diff --git a/data/xml/2023.icon.xml b/data/xml/2023.icon.xml index 544ccda893..8c37b4da2c 100644 --- a/data/xml/2023.icon.xml +++ b/data/xml/2023.icon.xml @@ -95,7 +95,7 @@ An Annotated Corpus for Realis Event Detection in Short Stories Written in <fixed-case>E</fixed-case>nglish and Low Resource <fixed-case>A</fixed-case>ssamese Language ChaitanyaKirti PankajChoudhury - AshishAn + AshishAnand PrithwijitGuha 72–81 This paper presents an annotated corpora of Assamese and English short stories for event trigger detection. This marks a pioneering endeavor in short stories, contributing to developing resources for this genre, especially in the low-resource Assamese language. In the process, 200 short stories were manually annotated in both Assamese and English. The dataset was evaluated and several models were compared for predicting events that are actually happening, i.e., realis events. However, it is expensive to develop manually annotated language resources, especially when the text requires specialist knowledge to interpret. In this regard, TagIT, an automated event annotation tool, is introduced. TagIT is designed to facilitate our objective of expanding the dataset from 200 to 1,000. The best-performing model was employed in TagIT to automate the event annotation process. Extensive experiments were conducted to evaluate the quality of the expanded dataset. This study further illustrates how the combination of an automatic annotation tool and human-in-the-loop participation significantly reduces the time needed to generate a high-quality dataset. @@ -126,7 +126,7 @@ AlapanKuila SomnathJena SudeshnaSarkar - ParthaChakrabarti + Partha PratimChakrabarti 99–119 In today’s media landscape, where news outlets play a pivotal role in shaping public opinion, it is imperative to address the issue of sentiment manipulation within news text. News writers often inject their own biases and emotional language, which can distort the objectivity of reporting. This paper introduces a novel approach to tackle this problem by reducing the polarity of latent sentiments in news content. Drawing inspiration from adversarial attack-based sentence perturbation techniques and a promptbased method using ChatGPT, we employ transformation constraints to modify sentences while preserving their core semantics. Using three perturbation methods—replacement, insertion, and deletion—coupled with a contextaware masked language model, we aim to maximize the desired sentiment score for targeted news aspects through a beam search algorithm. Our experiments and human evaluations demonstrate the effectiveness of these two models in achieving reduced sentiment polarity with minimal modifications while maintaining textual similarity, fluency, and grammatical correctness. Comparative analysis confirms the competitive performance of the adversarial attack-based perturbation methods and promptbased methods, offering a promising solution to foster more objective news reporting and combat emotional language bias in the media. 2023.icon-1.11 @@ -165,7 +165,7 @@ Enriching Electronic Health Record with Semantic Features <fixed-case>U</fixed-case>tilising<fixed-case>P</fixed-case>retrained Transformers - Lena AlMutair + LenaAlMutair EricAtwell NishantRavikumar 151–161 @@ -175,7 +175,7 @@ Multilingual Multimodal Text Detection in <fixed-case>I</fixed-case>ndo-<fixed-case>A</fixed-case>ryan Languages - NiharBasisth + Nihar JyotiBasisth EishaHalder TusharSachan AdvaithaVetagiri @@ -187,10 +187,10 @@ Iterative Back Translation Revisited: An Experimental Investigation for Low-resource <fixed-case>E</fixed-case>nglish <fixed-case>A</fixed-case>ssamese Neural Machine Translation - MazidaAhmed + Mazida AkhtaraAhmed KishoreKashyap KuwaliTalukdar - ParvezBoruah + Parvez AzizBoruah 172–179 Back Translation has been an effective strategy to leverage monolingual data both on the source and target sides. Research have opened up several ways to improvise the procedure, one among them is iterative back translation where the monolingual data is repeatedly translated and used for re-training for the model enhancement. Despite its success, iterative back translation remains relatively unexplored in low-resource scenarios, particularly for rich Indic languages. This paper presents a comprehensive investigation into the application of iterative back translation to the low-resource English-Assamese language pair. A simplified version of iterative back translation is presented. This study explores various critical aspects associated with back translation, including the balance between original and synthetic data and the refinement of the target (backward) model through cleaner data retraining. The experimental results demonstrate significant improvements in translation quality. Specifically, the simplistic approach to iterative back translation yields a noteworthy +6.38 BLEU score improvement for the EnglishAssamese translation direction and a +4.38 BLEU score improvement for the AssameseEnglish translation direction. Further enhancements are further noticed when incorporating higher-quality, cleaner data for model retraining highlighting the potential of iterative back translation as a valuable tool for enhancing low-resource neural machine translation (NMT). 2023.icon-1.17 @@ -231,9 +231,9 @@ Neural Machine Translation for a Low Resource Language Pair: <fixed-case>E</fixed-case>nglish-<fixed-case>B</fixed-case>odo - ParvezBoruah + Parvez AzizBoruah KuwaliTalukdar - MazidaAhmed + Mazida AkhtaraAhmed KishoreKashyap 295–300 This paper represent a work done on Neural Machine Translation for English and Bodo language pair. English is a language spoken around the world whereas, Bodo is a language mostly spoken in North Eastern area of India. This work of machine translation is done on a relatively small size of parallel data as there is less parallel corpus available for english bodo pair. Corpus is generally taken from available source National Platform of Language Technology(NPLT), Data Management Unit(DMU), Mission Bhashini, Ministry of Electronics and Information Technology and also generated internally in-house. Tokenization of raw text is done using IndicNLP library and mosesdecoder for Bodo and English respectively. Subword tokenization is performed by using BPE(Byte Pair Encoder) , Sentencepiece and Wordpiece subword. Experiments have been done on two different vocab size of 8000 and 16000 on a total of around 92410 parallel sentences. Two standard transformer encoder and decoder models with varying number of layers and hidden size are build for training the data using OpenNMT-py framework. The result are evaluated based on the BLEU score on an additional testset for evaluating the performance. The highest BLEU score of 11.01 and 14.62 are achieved on the testset for English to Bodo and Bodo to English translation respectively. @@ -465,7 +465,7 @@ Mitigating Clickbait: An Approach to Spoiler Generation Using Multitask Learning SayantanPal SouvikDas - RohiniSrihari + RohiniK. Srihari 486–490 With the increasing number of users on social media platforms, the detection and categorization of abusive comments have become crucial, necessitating effective strategies to mitigate their impact on online discussions. However, the intricate and diverse nature of lowresource Indic languages presents a challenge in developing reliable detection methodologies. This research focuses on the task of classifying YouTube comments written in Tamil language into various categories. To achieve this, our research conducted experiments utilizing various multi-lingual transformer-based models along with data augmentation approaches involving back translation approaches and other pre-processing techniques. Our work provides valuable insights into the effectiveness of various preprocessing methods for this classification task. Our experiments showed that the Multilingual Representations for Indian Languages (MURIL) transformer model, coupled with round-trip translation and lexical replacement, yielded the most promising results, showcasing a significant improvement of over 15 units in macro F1-score compared to existing baselines. This contribution adds to the ongoing research to mitigate the adverse impact of abusive content on online platforms, emphasizing the utilization of diverse preprocessing strategies and state-of-the-art language models. 2023.icon-1.43 @@ -516,7 +516,7 @@ A Survey of using Large Language Models for Generating Infrastructure as Code - GaneshK. Srivatsa + Kalahasti GaneshSrivatsa SabyasachiMukhopadhyay GaneshKatrapati ManishShrivastava @@ -580,7 +580,7 @@ Mytho-Annotator: An Annotation tool for <fixed-case>I</fixed-case>ndian <fixed-case>H</fixed-case>indu Mythology ApurbaPaul AnupamMondal - SainikMahata + Sainik KumarMahata SrijanSeal PrasunSarkar DipankarDas @@ -591,9 +591,9 @@ Transformer-based <fixed-case>B</fixed-case>engali Textual Emotion Recognition - AtabuzzamanMd. - Maksuda BilkisBaby - ShajalalMd. + Md.Atabuzzaman + Mst Maksuda BilkisBaby + Md.Shajalal 579–587 Emotion recognition for high-resource languages has progressed significantly. However, resource-constrained languages such as Bengali have not advanced notably due to the lack of large benchmark datasets. Besides this, the need for more Bengali language processing tools makes the emotion recognition task more challenging and complicated. Therefore, we developed the largest dataset in this paper, consisting of almost 12k Bengali texts with six basic emotions. Then, we conducted experiments on our dataset to establish the baseline performance applying machine learning, deep learning, and transformer-based models as emotion classifiers. The experimental results demonstrate that the models achieved promising performance in Bengali emotion recognition. 2023.icon-1.55 @@ -621,9 +621,9 @@ Abstractive <fixed-case>H</fixed-case>indi Text Summarization: A Challenge in a Low-Resource Setting - DaisyLal + Daisy MonikaLal PaulRayson - KrishnaSingh + Krishna PratapSingh Uma ShankerTiwary 603–612 The Internet has led to a surge in text data in Indian languages; hence, text summarization tools have become essential for information retrieval. Due to a lack of data resources, prevailing summarizing systems in Indian languages have been primarily dependent on and derived from English text summarization approaches. Despite Hindi being the most widely spoken language in India, progress in Hindi summarization is being delayed due to the lack of proper labeled datasets. In this preliminary work we address two major challenges in abstractive Hindi text summarization: creating Hindi language summaries and assessing the efficacy of the produced summaries. Since transfer learning (TL) has shown to be effective in low-resource settings, in order to assess the effectiveness of TL-based approach for summarizing Hindi text, we perform a comparative analysis using three encoder-decoder models: attention-based (BASE), multi-level (MED), and TL-based model (RETRAIN). In relation to the second challenge, we introduce the ICE-H evaluation metric based on the ICE metric for assessing English language summaries. The Rouge and ICE-H metrics are used for evaluating the BASE, MED, and RETRAIN models. According to the Rouge results, the RETRAIN model produces slightly better abstracts than the BASE and MED models for 20k and 100k training samples. The ICE-H metric, on the other hand, produces inconclusive results, which may be attributed to the limitations of existing Hindi NLP resources, such as word embeddings and POS taggers. @@ -671,7 +671,7 @@ Transfer learning in low-resourced <fixed-case>MT</fixed-case>: An empirical study - SainikMahata + Sainik KumarMahata DipanjanSaha DipankarDas SivajiBandyopadhyay @@ -735,8 +735,8 @@ A Baseline System for <fixed-case>K</fixed-case>hasi and <fixed-case>A</fixed-case>ssamese Bidirectional <fixed-case>NMT</fixed-case> with Zero available Parallel Data: Dataset Creation and System Development KishoreKashyap KuwaliTalukdar - MazidaAhmed - ParvezBoruah + Mazida AkhtaraAhmed + Parvez AzizBoruah 696–702 In this work we have tried to build a baseline Neural Machine Translation system for Khasi and Assamese in both directions. Both the languages are considered as low-resourced Indic languages. As per the language family in concerned, Assamese is a language from IndoAryan family and Khasi belongs to the MonKhmer branch of the Austroasiatic language family. No prior work is done which investigate the performance of Neural Machine Translation for these two diverse low-resourced languages. It is also worth mentioning that no parallel corpus and test data is available for these two languages. The main contribution of this work is the creation of Khasi-Assamese parallel corpus and test set. Apart from this, we also created baseline systems in both directions for the said language pair. We got best bilingual evaluation understudy (BLEU) score of 2.78 for Khasi to Assamese translation direction and 5.51 for Assamese to Khasi translation direction. We then applied phrase table injection (phrase augmentation) technique and got new higher BLEU score of 5.01 and 7.28 for Khasi to Assamese and Assamese to Khasi translation direction respectively. 2023.icon-1.69 @@ -758,8 +758,8 @@ Shikhar KumarSarma FarhaNaznin KishoreKashyap - MazidaAhmed - ParvezBoruah + Mazida AkhtaraAhmed + Parvez AzizBoruah 714–719 Impressive results have been reported in various works related to low resource languages, using Neural Machine Translation (NMT), where size of parallel dataset is relatively low. This work presents the experiment of Machine Translation in the low resource Indian language pair AssameseBodo, with a relatively low amount of parallel data. Tokenization of raw data is done with IndicNLP tool. NMT model is trained with preprocessed dataset, and model performances have been observed with varying hyper parameters. Experiments have been completed with Vocab Size 8000 and 16000. Significant increase in BLEU score has been observed in doubling the Vocab size. Also data size increase has contributed to enhanced overall performances. BLEU scores have been recorded with training on a data set of 70000 parallel sentences, and the results are compared with another round of training with a data set enhanced with 11500 Wordnet parallel data. A gold standard test data set of 500 sentence size has been used for recording BLEU. First round reported an overall BLEU of 4.0, with vocab size of 8000. With same vocab size, and Wordnet enhanced dataset, BLEU score of 4.33 was recorded. Significant increase of BLEU score (6.94) has been observed with vocab size of 16000. Next round of experiment was done with additional 7000 new data, and filtering the entire dataset. New BLEU recorded was 9.68, with 16000 vocab size. Cross validation has also been designed and performed with an experiment with 8-fold data chunks prepared on 80K total dataset. Impressive BLEU scores of (Fold-1 through fold-8) 18.12, 16.28, 18.90, 19.25, 19.60, 18.43, 16.28, and 7.70 have been recorded. The 8th fold BLEU deviated from the trend, might be because of nonhomogeneous last fold data. 2023.icon-1.71 @@ -771,7 +771,7 @@ GargiRoy AmitBarman IndranilDutta - SudipNaskar + Sudip KumarNaskar 720–728 With the recent surge and exponential growth of social media usage, scrutinizing social media content for the presence of any hateful content is of utmost importance. Researchers have been diligently working since the past decade on distinguishing between content that promotes hatred and content that does not. Traditionally, the main focus has been on analyzing textual content. However, recent research attempts have also commenced into the identification of audio-based content. Nevertheless, studies have shown that relying solely on audio or text-based content may be ineffective, as recent upsurge indicates that individuals often employ sarcasm in their speech and writing. To overcome these challenges, we present an approach to identify whether a speech promotes hate or not utilizing both audio and textual representations. Our methodology is based on the Transformer framework that incorporates both audio and text sampling, accompanied by our very own layer called “Attentive Fusion”. The results of our study surpassed previous stateof-the-art techniques, achieving an impressive macro F1 score of 0.927 on the Test Set. 2023.icon-1.72