Passer à la navigation principale Passer à la recherche Passer au contenu principal

Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect

  • Guokan Shang
  • , Hadi Abdine
  • , Yousef Khoubrane
  • , Amr Mohamed
  • , Yassine Abbahaddou
  • , Sofiane Ennadir
  • , Imane Momayiz
  • , Xuguang Ren
  • , Eric Moulines
  • , Preslav Nakov
  • , Michalis Vazirgiannis
  • , Eric Xing

Résultats de recherche: Le chapitre dans un livre, un rapport, une anthologie ou une collectionContribution à une conférenceRevue par des pairs

Résumé

We introduce Atlas-Chat, the first-ever collection of LLMs specifically developed for dialectal Arabic. Focusing on Moroccan Arabic, also known as Darija, we construct our instruction dataset by consolidating existing Darija language resources, creating novel datasets both manually and synthetically, and translating English instructions with stringent quality control. Atlas-Chat-2B, 9B1, and 27B models, fine-tuned on the dataset, exhibit superior ability in following Darija instructions and performing standard NLP tasks. Notably, our models outperform both state-of-the-art and Arabic-specialized LLMs like LLaMa, Jais, and AceGPT, e.g., our 9B model gains a 13% performance boost over a larger 13B model on DarijaMMLU, in our newly introduced evaluation suite for Darija covering both discriminative and generative tasks. Furthermore, we perform an experimental analysis of various fine-tuning strategies and base model choices to determine optimal configurations. All our resources are publicly accessible, and we believe our work offers comprehensive design methodologies of instruction-tuning for low-resource languages, which are often neglected in favor of data-rich languages by contemporary LLMs.

langue originaleAnglais
titreLoResLM 2025 - 1st Workshop on Language Models for Low-Resource Languages, Proceedings of the Workshop
rédacteurs en chefHansi Hettiarachchi, Tharindu Ranasinghe, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith Premasiri, Fiona Anting Tan, Lasitha Uyangodage
EditeurAssociation for Computational Linguistics (ACL)
Pages9-30
Nombre de pages22
ISBN (Electronique)9798891762152
étatPublié - 1 janv. 2025
Modification externeOui
Evénement1st Workshop on Language Models for Low-Resource Languages, LoResLM 2025 - co-located at the 31st International Conference on Computational Linguistics, COLING 2025 - Abu Dhabi, Émirats arabes unis
Durée: 20 janv. 2025 → …

Série de publications

NomProceedings - International Conference on Computational Linguistics, COLING
ISSN (imprimé)2951-2093

Une conférence

Une conférence1st Workshop on Language Models for Low-Resource Languages, LoResLM 2025 - co-located at the 31st International Conference on Computational Linguistics, COLING 2025
Pays/TerritoireÉmirats arabes unis
La villeAbu Dhabi
période20/01/25 → …

Empreinte digitale

Examiner les sujets de recherche de « Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect ». Ensemble, ils forment une empreinte digitale unique.

Contient cette citation