acl-org · yufei118liu · May 10, 2024 · May 6, 2024 · May 6, 2024 · May 6, 2024
diff --git a/data/xml/2023.acl.xml b/data/xml/2023.acl.xml
diff --git a/data/xml/2023.blackboxnlp.xml b/data/xml/2023.blackboxnlp.xml
@@ -112,6 +112,7 @@
       <url hash="ef09913d">2023.blackboxnlp-1.8</url>
       <bibkey>sun-hewitt-2023-character</bibkey>
       <doi>10.18653/v1/2023.blackboxnlp-1.8</doi>
+      <award>Outstanding Paper Award</award>
     </paper>
     <paper id="9">
       <title>Unveiling Multilinguality in Transformer Models: Exploring Language Specificity in Feed-Forward Networks</title>

diff --git a/data/xml/2024.eacl.xml b/data/xml/2024.eacl.xml
@@ -77,6 +77,7 @@
       <abstract>Natural Language Processing (NLP) research is increasingly focusing on the use of Large Language Models (LLMs), with some of the most popular ones being either fully or partially closed-source. The lack of access to model details, especially regarding training data, has repeatedly raised concerns about data contamination among researchers. Several attempts have been made to address this issue, but they are limited to anecdotal evidence and trial and error. Additionally, they overlook the problem of indirect data leaking, where modelsare iteratively improved by using data coming from users. In this work, we conduct the first systematic analysis of work using OpenAI’s GPT-3.5 and GPT-4, the most prominently used LLMs today, in the context of data contamination. By analysing 255 papers and considering OpenAI’s data usage policy, we extensively document the amount of data leaked to these models during the first year after the model’s release. We report that these models have been globally exposed to ∼4.7M samples from 263 benchmarks. At the same time, we document a number of evaluation malpractices emerging in the reviewed papers, such as unfair or missing baseline comparisons and reproducibility issues. We release our results as a collaborative project on https://leak-llm.github.io/, where other researchers can contribute to our efforts.</abstract>
       <url hash="59105272">2024.eacl-long.5</url>
       <bibkey>balloccu-etal-2024-leak</bibkey>
+      <award>Best Non-publicized Paper Award</award>
     </paper>
     <paper id="6">
       <title>Archer: A Human-Labeled Text-to-<fixed-case>SQL</fixed-case> Dataset with Arithmetic, Commonsense and Hypothetical Reasoning</title>
@@ -329,6 +330,7 @@
       <abstract>Recent studies of the emergent capabilities of transformer-based Natural Language Understanding (NLU) models have indicated that they have an understanding of lexical and compositional semantics. We provide evidence that suggests these claims should be taken with a grain of salt: we find that state-of-the-art Natural Language Inference (NLI) models are sensitive towards minor semantics preserving surface-form variations, which lead to sizable inconsistent model decisions during inference. Notably, this behaviour differs from valid and in-depth comprehension of compositional semantics, however does neither emerge when evaluating model accuracy on standard benchmarks nor when probing for syntactic, monotonic, and logically robust reasoning. We propose a novel framework to measure the extent of semantic sensitivity. To this end, we evaluate NLI models on adversarially generated examples containing minor semantics-preserving surface-form input noise. This is achieved using conditional text generation, with the explicit condition that the NLI model predicts the relationship between the original and adversarial inputs as a symmetric equivalence entailment. We systematically study the effects of the phenomenon across NLI models for <tex-math>\text{\emph{in-}}</tex-math> and <tex-math>\text{\emph{out-of-}}</tex-math> domain settings. Our experiments show that semantic sensitivity causes performance degradations of 12.92% and 23.71% average over <tex-math>\text{\emph{in-}}</tex-math> and <tex-math>\text{\emph{out-of-}}</tex-math> domain settings, respectively. We further perform ablation studies, analysing this phenomenon across models, datasets, and variations in inference and show that semantic sensitivity can lead to major inconsistency within model predictions.</abstract>
       <url hash="98c94193">2024.eacl-long.27</url>
       <bibkey>arakelyan-etal-2024-semantic</bibkey>
+      <award>Outstanding Paper Award</award>
     </paper>
     <paper id="28">
       <title>Exploring the Robustness of Task-oriented Dialogue Systems for Colloquial <fixed-case>G</fixed-case>erman Varieties</title>
@@ -587,6 +589,7 @@
       <abstract>How can NLP/AI practitioners engage with oral societies and develop locally appropriate language technologies? We report on our experience of working together over five years in a remote community in the far north of Australia, and how we prototyped simple language technologies to support our collaboration. We navigated different understandings of language, the functional differentiation of oral vs institutional languages, and the distinct technology opportunities for each. Our collaboration unsettled the first author’s western framing of language as data for exploitation by machines, and we devised a design pattern that seems better aligned with local interests and aspirations. We call for new collaborations on the design of locally appropriate technologies for languages with primary orality.</abstract>
       <url hash="746b433a">2024.eacl-long.50</url>
       <bibkey>bird-yibarbuk-2024-centering</bibkey>
+      <award>Outstanding Paper Award</award>
     </paper>
     <paper id="51">
       <title>Improving the <fixed-case>TENOR</fixed-case> of Labeling: Re-evaluating Topic Models for Content Analysis</title>
@@ -614,6 +617,7 @@
       <abstract>We conducted a detailed analysis on the quality of web-mined corpora for two low-resource languages (making three language pairs, English-Sinhala, English-Tamil and Sinhala-Tamil). We ranked each corpus according to a similarity measure and carried out an intrinsic and extrinsic evaluation on different portions of this ranked corpus. We show that there are significant quality differences between different portions of web-mined corpora and that the quality varies across languages and datasets. We also show that, for some web-mined datasets, Neural Machine Translation (NMT) models trained with their highest-ranked 25k portion can be on par with human-curated datasets.</abstract>
       <url hash="fc413046">2024.eacl-long.52</url>
       <bibkey>ranathunga-etal-2024-quality</bibkey>
+      <award>Low-Resource Paper Award</award>
     </paper>
     <paper id="53">
       <title><fixed-case>VOLTAGE</fixed-case>: A Versatile Contrastive Learning based <fixed-case>OCR</fixed-case> Methodology for ultra low-resource scripts through Auto Glyph Feature Extraction</title>
@@ -827,6 +831,7 @@
       <bibkey>le-bronnec-etal-2024-locost</bibkey>
       <revision id="1" href="2024.eacl-long.69v1" hash="7d28d719"/>
       <revision id="2" href="2024.eacl-long.69v2" hash="e4b40864" date="2024-03-21">Add an extra acknowlegement.</revision>
+      <award>Best Paper Award</award>
     </paper>
     <paper id="70">
       <title>A Classification-Guided Approach for Adversarial Attacks against Neural Machine Translation</title>
@@ -995,6 +1000,7 @@
       <abstract>Large language models (LLMs) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries. However, this has also raised concerns about the potential misuse of such texts in journalism, education, and academia. In this study, we strive to create automated systems that can detect machine-generated texts and pinpoint potential misuse. We first introduce a large-scale benchmark M4, which is a multi-generator, multi-domain, and multi-lingual corpus for machine-generated text detection. Through an extensive empirical study of this dataset, we show that it is challenging for detectors to generalize well on instances from unseen domains or LLMs. In such cases, detectors tend to misclassify machine-generated text as human-written. These results show that the problem is far from solved and that there is a lot of room for improvement. We believe that our dataset will enable future research towards more robust approaches to this pressing societal problem. The dataset is available at https://github.com/mbzuai-nlp/M4</abstract>
       <url hash="3ba47fcc">2024.eacl-long.83</url>
       <bibkey>wang-etal-2024-m4</bibkey>
+      <award>Resource Paper Award</award>
     </paper>
     <paper id="84">
       <title>A Truly Joint Neural Architecture for Segmentation and Parsing</title>
@@ -1026,6 +1032,7 @@
       <abstract>Recently, continuous diffusion models (CDM) have been introduced into non-autoregressive (NAR) text-to-text generation. However, the discrete nature of text increases the difficulty of CDM to generate coherent and fluent texts, and also causes the incompatibility problem between CDM and advanced NLP techniques, especially the popular pre-trained language models (PLMs).To solve it, we propose Diffusion-NAT, which introduces discrete diffusion models (DDM) into NAR text-to-text generation and integrates BART to improve the performance.By revising the decoding process of BART and the typical settings of DDM, we unify the inference process of BART and the denoising process of DDM into the same NAR masked tokens recovering task.In this way, DDM can rely on BART to perform denoising, which can benefit from both the rich pre-learned knowledge of BART and the iterative refining paradigm of DDM.Besides, we also propose the iterative self-prompting strategy to further improve the generation quality.Experimental results on 7 datasets show that our approach can outperform competitive NAR methods, and even surpass autoregressive methods.Our code and data are released at <url>https://github.com/RUCAIBox/DiffusionNAT</url>.</abstract>
       <url hash="3e9bbf92">2024.eacl-long.86</url>
       <bibkey>zhou-etal-2024-diffusion</bibkey>
+      <award>Evaluation and Model Insight Award</award>
     </paper>
     <paper id="87">
       <title>Unleashing the Power of Discourse-Enhanced Transformers for Propaganda Detection</title>
@@ -1183,6 +1190,7 @@
       <bibkey>senel-etal-2024-kardes</bibkey>
       <revision id="1" href="2024.eacl-long.100v1" hash="e31fcfaf"/>
       <revision id="2" href="2024.eacl-long.100v2" hash="9e3df1c2" date="2024-03-25">The revision changes the title due to ethical considerations.</revision>
+      <award>Outstanding Paper Award</award>
     </paper>
     <paper id="101">
       <title>Chaining Event Spans for Temporal Relation Grounding</title>
@@ -1873,6 +1881,7 @@
       <abstract>Annotators’ sociodemographic backgrounds (i.e., the individual compositions of their gender, age, educational background, etc.) have a strong impact on their decisions when working on subjective NLP tasks, such as toxic language detection. Often, heterogeneous backgrounds result in high disagreements. To model this variation, recent work has explored sociodemographic prompting, a technique, which steers the output of prompt-based models towards answers that humans with specific sociodemographic profiles would give. However, the available NLP literature disagrees on the efficacy of this technique — it remains unclear for which tasks and scenarios it can help, and the role of the individual factors in sociodemographic prompting is still unexplored. We address this research gap by presenting the largest and most comprehensive study of sociodemographic prompting today. We use it to analyze its influence on model sensitivity, performance and robustness across seven datasets and six instruction-tuned model families. We show that sociodemographic information affects model predictions and can be beneficial for improving zero-shot learning in subjective NLP tasks.However, its outcomes largely vary for different model types, sizes, and datasets, and are subject to large variance with regards to prompt formulations. Most importantly, our results show that sociodemographic prompting should be used with care when used for data annotation or studying LLM alignment.</abstract>
       <url hash="b5ed6d7b">2024.eacl-long.159</url>
       <bibkey>beck-etal-2024-sensitivity</bibkey>
+      <award>Social Impact Award</award>
     </paper>
     <paper id="160">
       <title>Threat Behavior Textual Search by Attention Graph Isomorphism</title>
@@ -2111,6 +2120,7 @@
       <abstract>The field of machine learning (ML) has gained widespread adoption, leading to significant demand for adapting ML to specific scenarios, which is yet expensive and non-trivial. The predominant approaches towards the automation of solving ML tasks (e.g., AutoML) are often time-consuming and hard to understand for human developers. In contrast, though human engineers have the incredible ability to understand tasks and reason about solutions, their experience and knowledge are often sparse and difficult to utilize by quantitative approaches. In this paper, we aim to bridge the gap between machine intelligence and human knowledge by introducing a novel framework MLCopilot, which leverages the state-of-the-art large language models to develop ML solutions for novel tasks. We showcase the possibility of extending the capability of LLMs to comprehend structured inputs and perform thorough reasoning for solving novel ML tasks. And we find that, after some dedicated design, the LLM can (i) observe from the existing experiences of ML tasks and (ii) reason effectively to deliver promising results for new tasks. The solution generated can be used directly to achieve high levels of competitiveness.</abstract>
       <url hash="24fa996c">2024.eacl-long.179</url>
       <bibkey>zhang-etal-2024-mlcopilot</bibkey>
+      <award>Outstanding Paper Award</award>
     </paper>
     <paper id="180">
       <title>Text-Guided Image Clustering</title>
@@ -2745,6 +2755,7 @@
       <abstract>We have deployed an LLM-based spoken dialogue system in a real hospital. The ARI social robot embodies our system, which patients and their companions can have multi-party conversations with together. In order to enable this multi-party ability, multimodality is critical. Our system, therefore, receives speech and video as input, and generates both speech and gestures (arm, head, and eye movements). In this paper, we describe our complex setting and the architecture of our dialogue system. Each component is detailed, and a video of the full system is available with the appropriate components highlighted in real-time. Our system decides when it should take its turn, generates human-like clarification requests when the patient pauses mid-utterance, answers in-domain questions (grounding to the in-prompt knowledge), and responds appropriately to out-of-domain requests (like generating jokes or quizzes). This latter feature is particularly remarkable as real patients often utter unexpected sentences that could not be handled previously.</abstract>
       <url hash="9e81e1d3">2024.eacl-demo.8</url>
       <bibkey>addlesee-etal-2024-multi</bibkey>
+      <award>Best Demo Award</award>
     </paper>
     <paper id="9">
       <title><fixed-case>S</fixed-case>cam<fixed-case>S</fixed-case>pot: Fighting Financial Fraud in <fixed-case>I</fixed-case>nstagram Comments</title>

diff --git a/data/xml/D13.xml b/data/xml/D13.xml
@@ -1630,6 +1630,7 @@
       <pwcdataset url="https://paperswithcode.com/dataset/sst">SST</pwcdataset>
       <pwcdataset url="https://paperswithcode.com/dataset/sst-2">SST-2</pwcdataset>
       <pwcdataset url="https://paperswithcode.com/dataset/sst-5">SST-5</pwcdataset>
+      <award>Test of Time Paper Award</award>
     </paper>
     <paper id="171">
       <title>Open Domain Targeted Sentiment</title>

diff --git a/data/xml/J98.xml b/data/xml/J98.xml
@@ -45,6 +45,7 @@
       <pages>97-123</pages>
       <url hash="c38b0492">J98-1004</url>
       <bibkey>schutze-1998-automatic</bibkey>
+      <award>Test of Time Paper Award</award>
     </paper>
     <paper id="5">
       <title>Disambiguating Highly Ambiguous Words</title>

diff --git a/data/xml/P13.xml b/data/xml/P13.xml
@@ -1848,6 +1848,7 @@
       <pages>92–97</pages>
       <url hash="6502fbd8">P13-2017</url>
       <bibkey>mcdonald-etal-2013-universal</bibkey>
+      <award>Test of Time Paper Award</award>
     </paper>
     <paper id="18">
       <title>An Empirical Examination of Challenges in <fixed-case>C</fixed-case>hinese Parsing</title>

diff --git a/data/xml/P98.xml b/data/xml/P98.xml
@@ -1106,6 +1106,7 @@
       <pages>704–710</pages>
       <url hash="dd5ff5cd">P98-1116</url>
       <bibkey>langkilde-knight-1998-generation-exploits</bibkey>
+      <award>Test of Time Paper Award</award>
     </paper>
     <paper id="117">
       <title>Methods and Practical Issues in Evaluating Alignment Techniques</title>