Glossary

This section contains definition for some of the technical words used in the field of ANLP

A-Z Glossary

Agglutination – A language property that results in many ways to create lexical varieties, using prefixes, stems, and suffixes.

Bag-of-Words (BOW) – A representation that turns arbitrary text into fixed-length vectors by counting how many times each word appears. This process is often referred to as vectorization.

Corpus – A collection of text/audio, the plural is corpora.

Contextualized embeddings – Dynamic vector representation of the words, that changes depending on the context.

Chunking – Chunking (or text chunking) is a type of shallow parsing that analyses a sentence by first identifying its constituent parts (nouns, verbs, adjectives, etc.) and then links them to higher order units that have discrete grammatical meanings (noun groups or phrases, verb groups, etc.). Related term: Shallow Parsing.

Constituency Parsing – Chunking (or text chunking) is a type of shallow parsing that analyses a sentence by first identifying its constituent parts (nouns, verbs, adjectives, etc.) and then links them to higher order units that have discrete grammatical meanings (noun groups or phrases, verb groups, etc.). Related term: Shallow Parsing.

Case Grammar – A system of linguistic analysis, focusing on the link between the valence, or number of subjects, objects, etc., of a verb and the grammatical context it requires.

Co-location Analysis – A Co-location is an expression consisting of two or more words that correspond to some conventional way of saying things.

Character Counting – Counts the number of characters in a line of text, page or group of text.

Concordance – An alphabetical list of the words (especially the important ones) present in a text, usually with citations of the passages in which they are found.

Cosine Similarity – A metric used to measure how similar the documents are irrespective of their size.

Camel Case Splitting – Split CamelCase string to individual strings.

Contextualized word embedding – A neural model that learns a generic embedding function for variable length contexts of target words. Related terms: Context2Vec.

Coreference Resolution – Finding all expressions that refer to the same entity in a discourse.

Dialect identification – The task of classifying a text with the correct dialect.

Diglossia – A phenomenon whereby two or more varieties of a language exist in the same speech community.

Derivational language – Creation of words from other word. usually, derivational morphemes change the pos of the word and its core meaning.

Document Similarity – Computing the similarity between two text documents by transforming the input documents into real-valued vectors.

Agglutination – A language property that results in many ways to create lexical varieties, using prefixes, stems, and suffixes.

Gazetteers – A set of lists containing names of entities.

Homonym Detection – Detecting the words that are pronounced the same as each other (e.g., ”maid” and ”made”) or have the same spelling (e.g., ”lead weight” and ”to lead”).

Inflectional Language – The core meaning and pos of the word are unchanged, adding affixes to take into consideration the count, the tense, possession, and comparison, this can only be done by adding suffixes.

Agglutination – A language property that results in many ways to create lexical varieties, using prefixes, stems, and suffixes.

Keyword Searching – The technique of finding strings that match a pattern. Related terms: Term Matching, Word Matching.

Lexicon – Refers to a component containing semantic and grammatical information about the words.

Lemmatization – Use a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

Morphology – The study of internal word structure.

Named Entity Recognition – Labeling sequences within a sentence that represent an entity, such as a Person, an organization.

Agglutination – A language property that results in many ways to create lexical varieties, using prefixes, stems, and suffixes.

Parsing – Analyzing the input of an input sentence in terms of grammatical constituents, POS and syntactic relations.

Part-of-Speech (POS) Tagging – POS Tagging (or Tagging) processes a sequence of words, and attaches a POS tag to each word. Parts of speech are also known as word classes or lexical categories.

Agglutination – A language property that results in many ways to create lexical varieties, using prefixes, stems, and suffixes.

Regular Expression – A special series of strings for describing a a text pattern for the purpose of searching or replacing the described items.

Syntax – The linguistic discipline interested in modeling how words are arranged together to make larger sequences in a language.

Semantics – The study of the literal meaning of linguistic expressions.

Semantic Role Labelling (SRL) – The process of detecting the semantic arguments linked with the predicate or verb of a sentence and their classification into their specific roles. Related Term: Semantic parsing, semantic trees, shallow parsing, and shallow semantic analysis.

Sentiment Analysis – The process of computationally identifying and categorizing opinions expressed in a piece of text.

Semantic Annotation – The process of attaching to a text document or other unstructured content, metadata about concepts (e.g., people, places, organizations, products or topics) relevant to it.

Summarization – The practice of breaking down long publications into manageable paragraphs or sentences. The procedure extracts important information while also ensuring that the paragraph’s sense is preserved.

Stemming – A crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.

Stop-Word Removal – Words which are filtered out before or after processing of natural language data (text).

Term Extraction – The process of extracting the most relevant words and expressions from text. Related terms: Keyword Extraction, Word Extraction.

Temporal Tagging – The task of finding phrases with temporal meaning within the context of a larger document.

Text Annotation – The practice and the result of adding a note or gloss to a text, which may include highlights or underlining, comments, footnotes, tags, and links.

Topic Modelling – A type of statistical model for discovering the abstract ”topics” that occur in a collection of documents.

Term-Document Matrix – A mathematical matrix that describes the frequency of terms that occur in a collection of documents.

Textual Entailment Recognition – Deciding, given two text fragments, whether the meaning of one text is entailed (can be inferred) from another text.

Agglutination – A language property that results in many ways to create lexical varieties, using prefixes, stems, and suffixes.

Word Frequency – How often a word appears in a document, divided by how many words there are. Related Terms: Term Frequency, Domain Term Frequency.

Word Embedding – One of the most popular technique to learn word embeddings using shallow neural network. Word embeddings are vector representations of a particular word. Related terms: Word2Vec.

Agglutination – A language property that results in many ways to create lexical varieties, using prefixes, stems, and suffixes.

Glossary

A-Z Glossary

Address

Phone

Email