Authors: Wissam Antoun, Fady Baly, Hazem Hajj

 

AraBERT is an Arabic pretrained language model based on Google’s BERT architecture. AraBERT uses the same BERT-Base config. More details are available in the AraBERT PAPER and in the AraBERT Meetup

There is two versions of the model AraBERTv0.1 and AraBERTv1, with the difference being that AraBERTv1 uses pre-segmented text where prefixes and suffixes were split using the Farasa Segmenter.

The model was trained on ~70M sentences or ~23GB of Arabic text with ~3B words.

Source Code Repository: https://github.com/aub-mind/arabert
Paper: https://www.aclweb.org/anthology/2020.osact-1.2.pdf

 

Results (Accuracy)

We evaluate both AraBERT models on different downstream tasks and compare it to mBERT, and other state of the art models (To the extent of our knowledge). The Tasks were Sentiment Analysis on 6 different datasets (HARDASTD-BalancedArsenTD-LevLABRArSaS), Named Entity Recognition with the ANERcorp, and Arabic Question Answering on Arabic-SQuAD and ARCD

Task prev. SOTA mBERT AraBERTv0.1 AraBERTv1
HARD 95.7 ElJundi et.al. 95.7 96.2 96.1
ASTD 86.5 ElJundi et.al. 80.1 92.2 92.6
ArsenTD-Lev 52.4 ElJundi et.al. 51 58.9 59.4
AJGT 93 Dahou et.al. 83.6 93.1 93.8
LABR 87.5 Dahou et.al. 83 85.9 86.7
ANERcorp 81.7 (BiLSTM-CRF) 78.4 84.2 81.9
ARCD mBERT EM:34.2 F1: 61.3 EM:51.14 F1:82.13 EM:54.84 F1: 82.15

Model Weights and Vocab Download

Models AraBERTv0.1 AraBERTv1
TensorFlow Drive Link Drive Link
PyTorch Drive_Link Drive_Link

You can find the PyTorch models in HuggingFace’s Transformer Library under the aubmindlab username

If you used this model please cite us as:

@inproceedings{antoun2020arabert,
  title={AraBERT: Transformer-based Model for Arabic Language Understanding},
  author={Antoun, Wissam and Baly, Fady and Hajj, Hazem},
  booktitle={LREC 2020 Workshop Language Resources and Evaluation Conference 11--16 May 2020},
  pages={9}
}

Acknowledgments

Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn’t have done it without this program, and to the AUB MIND Lab Members for the continous support. Also thanks to Yakshof and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.

Contacts

Wissam AntounLinkedin | Twitter | Github | wfa07@mail.aub.edu | wissam.antoun@gmail.com

Fady BalyLinkedin | Twitter | Github | fgb06@mail.aub.edu | baly.fady@gmail.com

We are looking for sponsors to train BERT-Large and other Transformer models, the sponsor only needs to cover to data storage and compute cost of the generating the pretraining data