Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update author name and affiliation in paper 842 of 2024.acl #3835

Merged
merged 2 commits into from
Sep 2, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions data/xml/2024.acl.xml
Original file line number Diff line number Diff line change
Expand Up @@ -10960,18 +10960,18 @@
</paper>
<paper id="843">
<title><fixed-case>I</fixed-case>ndic<fixed-case>LLMS</fixed-case>uite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for <fixed-case>I</fixed-case>ndian Languages</title>
<author><first>Mohammed</first><last>Khan</last><affiliation>Indian Institute of Technology, Madras, Dhirubhai Ambani Institute Of Information and Communication Technology</affiliation></author>
<author><first>Mohammed Safi Ur Rahman</first><last>Khan</last><affiliation>Indian Institute of Technology, Madras</affiliation></author>
<author><first>Priyam</first><last>Mehta</last><affiliation>Gujarat Technological University Ahmedabad</affiliation></author>
<author><first>Ananth</first><last>Sankar</last><affiliation>Annamalai University</affiliation></author>
<author><first>Umashankar</first><last>Kumaravelan</last><affiliation>AI4Bharat</affiliation></author>
<author><first>Sumanth</first><last>Doddapaneni</last><affiliation>Indian Institute of Technology, Madras</affiliation></author>
<author><first>Suriyaprasaad</first><last>B</last><affiliation>Indian Institute of Technology, Madras, Dhirubhai Ambani Institute Of Information and Communication Technology</affiliation></author>
<author><first>Varun</first><last>G</last><affiliation>Indian Institute of Information Technology, Design and Manufacturing, Kancheepuram</affiliation></author>
<author><first>Varun Balan</first><last>G</last><affiliation>Indian Institute of Information Technology, Design and Manufacturing, Kancheepuram</affiliation></author>
<author><first>Sparsh</first><last>Jain</last><affiliation>Guru Gobind Singh Indraprastha University, Dhirubhai Ambani Institute Of Information and Communication Technology</affiliation></author>
<author><first>Anoop</first><last>Kunchukuttan</last><affiliation>Microsoft</affiliation></author>
<author><first>Pratyush</first><last>Kumar</last><affiliation>Indian Institute of Technology Madras, Dhirubhai Ambani Institute Of Information and Communication Technology</affiliation></author>
<author><first>Raj</first><last>Dabre</last><affiliation>National Institute of Information and Communications Technology (NICT), National Institute of Advanced Industrial Science and Technology</affiliation></author>
<author><first>Mitesh</first><last>Khapra</last><affiliation>Indian Institute of Technology, Madras, Dhirubhai Ambani Institute Of Information and Communication Technology</affiliation></author>
<author><first>Mitesh M.</first><last>Khapra</last><affiliation>Indian Institute of Technology, Madras, Dhirubhai Ambani Institute Of Information and Communication Technology</affiliation></author>
<pages>15831-15879</pages>
<abstract>Despite the considerable advancements in English LLMs, the progress in building comparable models for other languages has been hindered due to the scarcity of tailored resources. Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages, containing a total of 251B tokens and 74.8M instruction-response pairs. Recognizing the importance of both data quality and quantity, our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data. We build a clean, open-source pipeline for curating pre-training data from diverse sources, including websites, PDFs, and videos, incorporating best practices for crawling, cleaning, flagging, and deduplication. For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English datasets into Indian languages, and utilize LLaMa2 and Mixtral models to create conversations grounded in articles from Indian Wikipedia and Wikihow. Additionally, we address toxicity alignment by generating toxic prompts for multiple scenarios and then generate non-toxic responses by feeding these toxic prompts to an aligned LLaMa2 model. We hope that the datasets, tools, and resources released as a part of this work will not only propel the research and development of Indic LLMs but also establish an open-source blueprint for extending such efforts to other languages.</abstract>
<url hash="28f6a48c">2024.acl-long.843</url>
Expand Down
Loading