Authors: Wissam Antoun, Fady Baly, Hazem Hajj

AraBERT is an Arabic pretrained language model based on Google’s BERT architecture. AraBERT uses the same BERT-Base config. More details are available in the AraBERT PAPER and in the AraBERT Meetup

There is two versions of the model AraBERTv0.1 and AraBERTv1, with the difference being that AraBERTv1 uses pre-segmented text where prefixes and suffixes were split using the Farasa Segmenter.

The model was trained on ~70M sentences or ~23GB of Arabic text with ~3B words.

Source Code Repository: https://github.com/aub-mind/arabert
Paper: https://www.aclweb.org/anthology/2020.osact-1.2.pdf

Results (Accuracy)

We evaluate both AraBERT models on different downstream tasks and compare it to mBERT, and other state of the art models (To the extent of our knowledge). The Tasks were Sentiment Analysis on 6 different datasets (HARD, ASTD-Balanced, ArsenTD-Lev, LABR, ArSaS), Named Entity Recognition with the ANERcorp, and Arabic Question Answering on Arabic-SQuAD and ARCD

Task	prev. SOTA	mBERT	AraBERTv0.1	AraBERTv1
HARD	95.7 ElJundi et.al.	95.7	96.2	96.1
ASTD	86.5 ElJundi et.al.	80.1	92.2	92.6
ArsenTD-Lev	52.4 ElJundi et.al.	51	58.9	59.4
AJGT	93 Dahou et.al.	83.6	93.1	93.8
LABR	87.5 Dahou et.al.	83	85.9	86.7
ANERcorp	81.7 (BiLSTM-CRF)	78.4	84.2	81.9
ARCD	mBERT	EM:34.2 F1: 61.3	EM:51.14 F1:82.13	EM:54.84 F1: 82.15

Model Weights and Vocab Download

Models	AraBERTv0.1	AraBERTv1
TensorFlow	Drive Link	Drive Link
PyTorch	Drive_Link	Drive_Link

You can find the PyTorch models in HuggingFace’s Transformer Library under the aubmindlab username

If you used this model please cite us as:

@inproceedings{antoun2020arabert,
  title={AraBERT: Transformer-based Model for Arabic Language Understanding},
  author={Antoun, Wissam and Baly, Fady and Hajj, Hazem},
  booktitle={LREC 2020 Workshop Language Resources and Evaluation Conference 11--16 May 2020},
  pages={9}
}

Acknowledgments

Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn’t have done it without this program, and to the AUB MIND Lab Members for the continous support. Also thanks to Yakshof and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.

Contacts

Wissam Antoun: Linkedin | Twitter | Github | wfa07@mail.aub.edu | wissam.antoun@gmail.com

Fady Baly: Linkedin | Twitter | Github | fgb06@mail.aub.edu | baly.fady@gmail.com

We are looking for sponsors to train BERT-Large and other Transformer models, the sponsor only needs to cover to data storage and compute cost of the generating the pretraining data

13,691 Comments

← Older Comments

بازی انفجار on August 6, 2024 at 6:57 pm

I am impressed by the information that you have on this blog. It shows how well you understand this subject. بازی انفجار
wilson Smith on August 6, 2024 at 7:31 pm

After study a handful of the websites with your web site now, and i also truly much like your way of blogging. I bookmarked it to my bookmark internet site list and will be checking back soon. Pls look at my web-site also and told me what you consider. naturewell vegan juice & smoothies
muhi on August 6, 2024 at 10:50 pm

Acknowledges suitable for penmanship an incredibly precious column, As i occured adjacent to your web site in addition to approximate several assistance. I had desire people build of manuscript… Black Fish
John on August 6, 2024 at 11:07 pm

Need help with GED online? pay someone to take my GED exam and get college ready grades

AraBERT : Pre-training BERT for Arabic Language Understanding

Authors: Wissam Antoun, Fady Baly, Hazem Hajj

Results (Accuracy)

Model Weights and Vocab Download

If you used this model please cite us as:

Acknowledgments

Contacts

13,691 Comments

Submit a Comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

Subscribe By Email