From 8b5fcc25b45dc40f2627f929cd0c67837acdcf0f Mon Sep 17 00:00:00 2001 From: Mohammed Safi Ur Rahman Khan Date: Sun, 1 Sep 2024 13:15:34 +0530 Subject: [PATCH 1/2] update name in paper 842 --- data/xml/2024.acl.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/data/xml/2024.acl.xml b/data/xml/2024.acl.xml index aeb7674d3d..6926788cb7 100644 --- a/data/xml/2024.acl.xml +++ b/data/xml/2024.acl.xml @@ -10960,7 +10960,7 @@ <fixed-case>I</fixed-case>ndic<fixed-case>LLMS</fixed-case>uite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for <fixed-case>I</fixed-case>ndian Languages - MohammedKhanIndian Institute of Technology, Madras, Dhirubhai Ambani Institute Of Information and Communication Technology + Mohammed Safi Ur RahmanKhanIndian Institute of Technology, Madras PriyamMehtaGujarat Technological University Ahmedabad AnanthSankarAnnamalai University UmashankarKumaravelanAI4Bharat From 25d164714fbec60323ec925ecfbcd9cacac1c3eb Mon Sep 17 00:00:00 2001 From: Mohammed Safi Ur Rahman Khan Date: Mon, 2 Sep 2024 10:14:35 +0530 Subject: [PATCH 2/2] Updating author names --- data/xml/2024.acl.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/data/xml/2024.acl.xml b/data/xml/2024.acl.xml index 6926788cb7..f6712e1a9d 100644 --- a/data/xml/2024.acl.xml +++ b/data/xml/2024.acl.xml @@ -10966,12 +10966,12 @@ UmashankarKumaravelanAI4Bharat SumanthDoddapaneniIndian Institute of Technology, Madras SuriyaprasaadBIndian Institute of Technology, Madras, Dhirubhai Ambani Institute Of Information and Communication Technology - VarunGIndian Institute of Information Technology, Design and Manufacturing, Kancheepuram + Varun BalanGIndian Institute of Information Technology, Design and Manufacturing, Kancheepuram SparshJainGuru Gobind Singh Indraprastha University, Dhirubhai Ambani Institute Of Information and Communication Technology AnoopKunchukuttanMicrosoft PratyushKumarIndian Institute of Technology Madras, Dhirubhai Ambani Institute Of Information and Communication Technology RajDabreNational Institute of Information and Communications Technology (NICT), National Institute of Advanced Industrial Science and Technology - MiteshKhapraIndian Institute of Technology, Madras, Dhirubhai Ambani Institute Of Information and Communication Technology + Mitesh M.KhapraIndian Institute of Technology, Madras, Dhirubhai Ambani Institute Of Information and Communication Technology 15831-15879 Despite the considerable advancements in English LLMs, the progress in building comparable models for other languages has been hindered due to the scarcity of tailored resources. Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages, containing a total of 251B tokens and 74.8M instruction-response pairs. Recognizing the importance of both data quality and quantity, our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data. We build a clean, open-source pipeline for curating pre-training data from diverse sources, including websites, PDFs, and videos, incorporating best practices for crawling, cleaning, flagging, and deduplication. For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English datasets into Indian languages, and utilize LLaMa2 and Mixtral models to create conversations grounded in articles from Indian Wikipedia and Wikihow. Additionally, we address toxicity alignment by generating toxic prompts for multiple scenarios and then generate non-toxic responses by feeding these toxic prompts to an aligned LLaMa2 model. We hope that the datasets, tools, and resources released as a part of this work will not only propel the research and development of Indic LLMs but also establish an open-source blueprint for extending such efforts to other languages. 2024.acl-long.843