TY - GEN
T1 - Atlas-Chat
T2 - 1st Workshop on Language Models for Low-Resource Languages, LoResLM 2025 - co-located at the 31st International Conference on Computational Linguistics, COLING 2025
AU - Shang, Guokan
AU - Abdine, Hadi
AU - Khoubrane, Yousef
AU - Mohamed, Amr
AU - Abbahaddou, Yassine
AU - Ennadir, Sofiane
AU - Momayiz, Imane
AU - Ren, Xuguang
AU - Moulines, Eric
AU - Nakov, Preslav
AU - Vazirgiannis, Michalis
AU - Xing, Eric
N1 - Publisher Copyright:
© 2025 Association for Computational Linguistics.
PY - 2025/1/1
Y1 - 2025/1/1
N2 - We introduce Atlas-Chat, the first-ever collection of LLMs specifically developed for dialectal Arabic. Focusing on Moroccan Arabic, also known as Darija, we construct our instruction dataset by consolidating existing Darija language resources, creating novel datasets both manually and synthetically, and translating English instructions with stringent quality control. Atlas-Chat-2B, 9B1, and 27B models, fine-tuned on the dataset, exhibit superior ability in following Darija instructions and performing standard NLP tasks. Notably, our models outperform both state-of-the-art and Arabic-specialized LLMs like LLaMa, Jais, and AceGPT, e.g., our 9B model gains a 13% performance boost over a larger 13B model on DarijaMMLU, in our newly introduced evaluation suite for Darija covering both discriminative and generative tasks. Furthermore, we perform an experimental analysis of various fine-tuning strategies and base model choices to determine optimal configurations. All our resources are publicly accessible, and we believe our work offers comprehensive design methodologies of instruction-tuning for low-resource languages, which are often neglected in favor of data-rich languages by contemporary LLMs.
AB - We introduce Atlas-Chat, the first-ever collection of LLMs specifically developed for dialectal Arabic. Focusing on Moroccan Arabic, also known as Darija, we construct our instruction dataset by consolidating existing Darija language resources, creating novel datasets both manually and synthetically, and translating English instructions with stringent quality control. Atlas-Chat-2B, 9B1, and 27B models, fine-tuned on the dataset, exhibit superior ability in following Darija instructions and performing standard NLP tasks. Notably, our models outperform both state-of-the-art and Arabic-specialized LLMs like LLaMa, Jais, and AceGPT, e.g., our 9B model gains a 13% performance boost over a larger 13B model on DarijaMMLU, in our newly introduced evaluation suite for Darija covering both discriminative and generative tasks. Furthermore, we perform an experimental analysis of various fine-tuning strategies and base model choices to determine optimal configurations. All our resources are publicly accessible, and we believe our work offers comprehensive design methodologies of instruction-tuning for low-resource languages, which are often neglected in favor of data-rich languages by contemporary LLMs.
UR - https://www.scopus.com/pages/publications/105000170936
M3 - Conference contribution
AN - SCOPUS:105000170936
T3 - Proceedings - International Conference on Computational Linguistics, COLING
SP - 9
EP - 30
BT - LoResLM 2025 - 1st Workshop on Language Models for Low-Resource Languages, Proceedings of the Workshop
A2 - Hettiarachchi, Hansi
A2 - Ranasinghe, Tharindu
A2 - Rayson, Paul
A2 - Mitkov, Ruslan
A2 - Gaber, Mohamed
A2 - Premasiri, Damith
A2 - Tan, Fiona Anting
A2 - Uyangodage, Lasitha
PB - Association for Computational Linguistics (ACL)
Y2 - 20 January 2025
ER -