Arabic Natural Language Processing

Arabic Natural language processing (ANLP) field is a subfield of computational linguistics aiming at modelling the Arabic natural language. The field natural language processing has two main components which are Natural Language Understanding (NLU) and Natural Language Generation (NLG). This research area has seen enormous advancement when it comes to some languages such as the English language, however, it is not the case for the Arabic language. This is due to the English being an international language, but this is also a result of several challenges which can be categorized into the subsequent categories, the complexity of the language and the lack of resources.

Arabic Language Complexity

01

Morphological level

When it comes to the morphological level, Arabic is a highly inflectional and derivational language, this means that for a given root, so many possible words could be formed keeping in mind that there are around 10,000 free roots, furthermore, clitics are also added in the creation of new words, which lead to more options. A result of these properties is a vast vocabulary, and it was noted that Arabic has 2.5 times the vocabulary growth rate of English and have almost 10 times out of vocabulary words more than English. This makes techniques such as language modeling harder and requiring more enabling technologies.

02

Syntactic level

On the syntactic level, unlike the English language where most of the sentences follow the SVO (subject verb object) order, Arabic has a relatively free word order, that is, there is no predefined arrangement for the words within a sentence. A second aspect related to the syntactic ambiguity is the fact that there are two types of phrases in the Arabic language, which are the verbal phrase and the nominal phrase. Moreover, Arabic is known to be a null-subject language, that is, the subject could be dropped with no information being lost, these characteristics result in structural ambiguity.

03

Orthographic ambiguity

When it comes to orthographic ambiguity, the script itself is a reason for the challenges caused to machines, as in the Arabic script there is no capitalization and proper punctuation, also, the form of some letters changes depending on their position within the word, so in such case, tasks such as sentence segmentation and named entity recognition become harder. Also, the absence of diacritization is another problem that gives rise to several types of ambiguities. It is important to note that diacritics have mainly two roles to play, the first role is determining the core meaning of the word and the other one is agreeing with the grammatical role of the word within a sentence, so diacritics are essential for semantic and syntactic analysis and their absence cause challenges on both levels. It is worth mentioning that diacritics are fully present in religious texts, however they are less present in modern Arabic and almost absent in Arabic dialects and Arabizi.

Arabic Types

,

Classical Arabic (CA)

Classical Arabic represents religious documents such as the Holy Quran and Al Hadith.

,

Modern Standard Arabic (MSA)

Modern Standard Arabic (MSA) is the arabic used in daily formal communication.

,

Arabic Dialect

Arabic Dialects vary from one country to another and from one region to another, this type of Arabic is used in daily informal communcations.

,

Arabizi

Arabizi is the Arabic type present on social media platforms, it is Arabic written in Latin script.

Major Arabic Dialects

,

Maghrebi

Maghrebi is the dialect spoken in informal communications in Morocco.

,

Egyptian

Egyptian is the dialect spoken in daily informal communication in Egypt.

,

Levantine

Levantine dialect includes, Lebanese, Plaestinian, Syrian and Jordanian Dialects.

,

Iraqi

Iraqi is the dialect used in Iraq for daily informal communication.

,

Gulf

Gulf dialects include the Qatari, Yemeni, Emarati, Bahraini dialects along with several other dialects in the region.