diff --git a/data/xml/2021.emnlp.xml b/data/xml/2021.emnlp.xml
index 38a5dc7d95..affc6bd715 100644
--- a/data/xml/2021.emnlp.xml
+++ b/data/xml/2021.emnlp.xml
@@ -5461,6 +5461,7 @@
chen-etal-2021-websrc
10.18653/v1/2021.emnlp-main.343
+ WebSRC
SQuAD
diff --git a/data/xml/2022.acl.xml b/data/xml/2022.acl.xml
index 10647dbf94..cee8ceb0ef 100644
--- a/data/xml/2022.acl.xml
+++ b/data/xml/2022.acl.xml
@@ -6800,6 +6800,7 @@ in the Case of Unambiguous Gender
2022.acl-long.420
li-etal-2022-markuplm
10.18653/v1/2022.acl-long.420
+ WebSRC
CLIP Models are Few-Shot Learners: Empirical Studies on VQA and Visual Entailment
@@ -9634,6 +9635,7 @@ in the Case of Unambiguous Gender
10.18653/v1/2022.acl-long.593
pytorch/fairseq
LibriSpeech
+ SALMon
Lite Unified Modeling for Discriminative Reading Comprehension
diff --git a/data/xml/2022.findings.xml b/data/xml/2022.findings.xml
index 876cc587fd..06ca74ec66 100644
--- a/data/xml/2022.findings.xml
+++ b/data/xml/2022.findings.xml
@@ -2295,7 +2295,7 @@
dabre-etal-2022-indicbart
10.18653/v1/2022.findings-acl.145
- AI4Bharat/indic-bart
+ AI4Bharat/indic-bart
FLoRes
FLoRes-101
IndicCorp
@@ -6013,7 +6013,7 @@
guo-etal-2022-longt5
10.18653/v1/2022.findings-naacl.55
- google-research/longt5
+ google-research/longt5
Arxiv HEP-TH citation graph
BigPatent
CNN/Daily Mail
diff --git a/data/xml/2022.lrec.xml b/data/xml/2022.lrec.xml
index f6fb2d5788..1f0716e036 100644
--- a/data/xml/2022.lrec.xml
+++ b/data/xml/2022.lrec.xml
@@ -7182,6 +7182,7 @@
2022.lrec-1.577
jauhar-etal-2022-ms
microsoft/ms-latte
+ 10,000 People - Human Pose Recognition Data
KazakhTTS2: Extending the Open-Source Kazakh TTS Corpus With More Data, Speakers, and Topics
diff --git a/data/xml/2022.naacl.xml b/data/xml/2022.naacl.xml
index 6acf3a1f8d..482f17be72 100644
--- a/data/xml/2022.naacl.xml
+++ b/data/xml/2022.naacl.xml
@@ -2151,6 +2151,7 @@
10.18653/v1/2022.naacl-main.132
x-lance/tie
+ WebSRC
RSTGen: Imbuing Fine-Grained Interpretable Control into Long-FormText Generators
diff --git a/data/xml/2024.aiwolfdial.xml b/data/xml/2024.aiwolfdial.xml
new file mode 100644
index 0000000000..59d1c8234b
--- /dev/null
+++ b/data/xml/2024.aiwolfdial.xml
@@ -0,0 +1,104 @@
+
+
+
+
+ Proceedings of the 2nd International AIWolfDial Workshop
+ YoshinobuKano
+ Association for Computational Linguistics
+ Tokyo, Japan
+ September
+ 2024
+ 2024.aiwolfdial-1
+ aiwolfdial
+ ws
+
+
+ 2024.aiwolfdial-1.0
+ aiwolfdial-2024-1
+
+
+ AIWolfDial 2024: Summary of Natural Language Division of 6th International AIWolf Contest
+ YoshinobuKano
+ YutoSahashi
+ NeoWatanabe
+ KaitoKagaminuma
+ ClausAranha
+ DaisukeKatagami
+ KeiHarada
+ MichimasaInaba
+ TakeshiIto
+ HirotakaOsawa
+ TakashiOtsuki
+ FujioToriumi
+ 1–12
+ We held our 6th annual AIWolf international contest to automatically play the Werewolf game “Mafia”, where players try finding liars via conversations, aiming at promoting developments in creating agents of more natural conversations in higher level, such as longer contexts, personal relationships, semantics, pragmatics, and logics, revealing the capabilities and limits of the generative AIs. In our Natural Language Division of the contest, we had eight Japanese speaking agent teams, and five English speaking agents, to mutually run games. By using the game logs, we performed human subjective evaluations, win rates, and detailed log analysis. We found that the entire system performance has largely improved over the previous year, due to the recent advantages of the LLMs. There are several new ideas to improve the way using LLMs such as the summarization, characterization, and the logics outside LLMs, etc. However, it is not perfect at all yet; the generated talks are sometimes inconsistent with the game actions. Our future work includes to reveal the capability of the LLMs, whether they can make the duality of the “liar”, in other words, holding a “true” and a “false” circumstances of the agent at the same time, even holding what these circumstances look like from other agents.
+ 2024.aiwolfdial-1.1
+ kano-etal-2024-aiwolfdial
+
+
+ Text Generation Indistinguishable from Target Person by Prompting Few Examples Using LLM
+ YukaTsubota
+ YoshinobuKano
+ 13–20
+ To achieve smooth and natural communication between a dialogue system and a human, it is necessary for the dialogue system to behave more human-like. Recreating the personality of an actual person can be an effective way for this purpose. This study proposes a method to recreate a personality by a large language model (generative AI) without training, but with prompt technique to make the creation cost as low as possible. Collecting a large amount of dialogue data from a specific person is not easy and requires a significant amount of time for training. Therefore, we aim to recreate the personality of a specific individual without using dialogue data. The personality referred to in this paper denotes the image of a person that can be determined solely from the input and output of text dialogues. As a result of the experiments, it was revealed that by using prompts combining profile information, responses to few questions, and extracted speaking characteristics from those responses, it is possible to improve the reproducibility of a specific individual’s personality.
+ 2024.aiwolfdial-1.2
+ tsubota-kano-2024-text
+
+
+ Werewolf Game Agent by Generative AI Incorporating Logical Information Between Players
+ NeoWatanabe
+ YoshinobuKano
+ 21–29
+ In recent years, AI models based on GPT have advanced rapidly. These models are capable of generating text, translating between different languages, and answering questions with high accuracy. However, the process behind their outputs remains a black box, making it difficult to ascertain the data influencing their responses. These AI models do not always produce accurate outputs and are known for generating incorrect information, known as hallucinations, whose causes are hard to pinpoint. Moreover, they still face challenges in solving complex problems that require step-by-step reasoning, despite various improvements like the Chain-of-Thought approach. There’s no guarantee that these models can independently perform logical reasoning from scratch, raising doubts about the reliability and accuracy of their inferences. To address these concerns, this study proposes the incorporation of an explicit logical structure into the AI’s text generation process. As a validation experiment, a text-based agent capable of playing the Werewolf game, which requires deductive reasoning, was developed using GPT-4. By comparing the model combined with an external explicit logical structure and a baseline that lacks such a structure, the proposed method demonstrated superior reasoning capabilities in subjective evaluations, suggesting the effectiveness of adding an explicit logical framework to the conventional AI models.
+ 2024.aiwolfdial-1.3
+ watanabe-kano-2024-werewolf
+
+
+ Enhancing Dialogue Generation in Werewolf Game Through Situation Analysis and Persuasion Strategies
+ ZhiyangQi
+ MichimasaInaba
+ 30–39
+ Recent advancements in natural language processing, particularly with large language models (LLMs) like GPT-4, have significantly enhanced dialogue systems, enabling them to generate more natural and fluent conversations. Despite these improvements, challenges persist, such as managing continuous dialogues, memory retention, and minimizing hallucinations. The AIWolfDial2024 addresses these challenges by employing the Werewolf Game, an incomplete information game, to test the capabilities of LLMs in complex interactive environments. This paper introduces a LLM-based Werewolf Game AI, where each role is supported by situation analysis to aid response generation. Additionally, for the werewolf role, various persuasion strategies, including logical appeal, credibility appeal, and emotional appeal, are employed to effectively persuade other players to align with its actions.
+ 2024.aiwolfdial-1.4
+ qi-inaba-2024-enhancing
+
+
+ Verification of Reasoning Ability using BDI Logic and Large Language Model in AIWolf
+ HirakuGondo
+ HirokiSakaji
+ ItsukiNoda
+ 40–47
+ We attempt to improve the reasoning capability of LLMs in werewolf game by combining BDI logic with LLMs. While LLMs such as ChatGPT has been developed and used for various tasks, there remain several weakness of the LLMs. Logical reasoning is one of such weakness. Therefore, we try to introduce BDI logic-based prompts to verify the logical reasoning ability of LLMs in dialogue of werewofl game. Experiments and evaluations were conducted using “AI-Werewolf,” a communication game for AI with incomplete information. From the results of the game played by five agents, we compare the logical reasoning ability of LLMs by using the win rate and the vote rate against werewolf.
+ 2024.aiwolfdial-1.5
+ 2024.aiwolfdial-1.5.Supplementary_Attachment.zip
+ gondo-etal-2024-verification
+
+
+ Enhancing Consistency of Werewolf AI through Dialogue Summarization and Persona Information
+ YoshikiTanaka
+ TakumasaKaneko
+ HirokiOnozeki
+ NatsumiEzure
+ RyuichiUehara
+ ZhiyangQi
+ TomoyaHiguchi
+ RyutaroAsahara
+ MichimasaInaba
+ 48–57
+ The Werewolf Game is a communication game where players’ reasoning and discussion skills are essential. In this study, we present a Werewolf AI agent developed for the AIWolfDial 2024 shared task, co-hosted with the 17th INLG. In recent years, large language models like ChatGPT have garnered attention for their exceptional response generation and reasoning capabilities. We thus develop the LLM-based agents for the Werewolf Game. This study aims to enhance the consistency of the agent’s utterances by utilizing dialogue summaries generated by LLMs and manually designed personas and utterance examples. By analyzing self-match game logs, we demonstrate that the agent’s utterances are contextually consistent and that the character, including tone, is maintained throughout the game.
+ 2024.aiwolfdial-1.6
+ tanaka-etal-2024-enhancing
+
+
+ An Implementation of Werewolf Agent That does not Truly Trust LLMs
+ TakehiroSato
+ ShintaroOzaki
+ DaisakuYokoyama
+ 58–67
+ Werewolf is an incomplete information game, which has several challenges when creating a computer agent as a player given the lack of understanding of the situation and individuality of utterance (e.g., computer agents are not capable of characterful utterance or situational lying). We propose a werewolf agent that solves some of those difficulties by combining a Large Language Model (LLM) and a rule-based algorithm. In particular, our agent uses a rule-based algorithm to select an output either from an LLM or a template prepared beforehand based on the results of analyzing conversation history using an LLM. It allows the agent to refute in specific situations, identify when to end the conversation, and behave with persona. This approach mitigated conversational inconsistencies and facilitated logical utterance as a result. We also conducted a qualitative evaluation, which resulted in our agent being perceived as more human-like compared to an unmodified LLM. The agent is freely available for contributing to advance the research in the field of Werewolf game.
+ 2024.aiwolfdial-1.7
+ 2024.aiwolfdial-1.7.Supplementary_Attachment.pdf
+ sato-etal-2024-implementation
+
+
+
diff --git a/data/xml/2024.inlg.xml b/data/xml/2024.inlg.xml
new file mode 100644
index 0000000000..54c83e5151
--- /dev/null
+++ b/data/xml/2024.inlg.xml
@@ -0,0 +1,734 @@
+
+
+
+
+ Proceedings of the 17th International Natural Language Generation Conference
+ SaadMahamood
+ Nguyen LeMinh
+ DaphneIppolito
+ Association for Computational Linguistics
+ Tokyo, Japan
+ September
+ 2024
+ 2024.inlg-main
+ inlg
+
+
+ 2024.inlg-main.0
+ inlg-2024-main
+
+
+ AutoTemplate: A Simple Recipe for Lexically Constrained Text Generation
+ HayateIso
+ 1–12
+ Lexically constrained text generation is one of the constrained text generation tasks, which aims to generate text that covers all the given constraint lexicons. While the existing approaches tackle this problem using a lexically constrained beam search algorithm or dedicated model using non-autoregressive decoding, there is a trade-off between the generated text quality and the hard constraint satisfaction. We introduce AutoTemplate, a simple yet effective lexically constrained text generation framework divided into template generation and lexicalization tasks. The template generation is to generate the text with the placeholders, and lexicalization replaces them into the constraint lexicons to perform lexically constrained text generation. We conducted the experiments on two tasks: keywords-to-sentence generations and entity-guided summarization. Experimental results show that the AutoTemplate outperforms the competitive baselines on both tasks while satisfying the hard lexical constraints. The code is available at https://github.com/megagonlabs/autotemplate
+ 2024.inlg-main.1
+ 2024.inlg-main.1.Supplementary_Attachment.pdf
+ iso-2024-autotemplate-simple
+
+
+ Noisy Pairing and Partial Supervision for Stylized Opinion Summarization
+ HayateIso
+ XiaolanWang
+ YoshiSuhara
+ 13–23
+ Opinion summarization research has primarily focused on generating summaries reflecting important opinions from customer reviews without paying much attention to the writing style. In this paper, we propose the stylized opinion summarization task, which aims to generate a summary of customer reviews in the desired (e.g., professional) writing style. To tackle the difficulty in collecting customer and professional review pairs, we develop a non-parallel training framework, Noisy Pairing and Partial Supervision (NAPA), which trains a stylized opinion summarization system from non-parallel customer and professional review sets. We create a benchmark ProSum by collecting customer and professional reviews from Yelp and Michelin. Experimental results on ProSum and FewSum demonstrate that our non-parallel training framework consistently improves both automatic and human evaluations, successfully building a stylized opinion summarization model that can generate professionally-written summaries from customer reviews. The code is available at https://github.com/megagonlabs/napa
+ 2024.inlg-main.2
+ 2024.inlg-main.2.Supplementary_Attachment.pdf
+ iso-etal-2024-noisy-pairing
+
+
+ LLM Neologism: Emergence of Mutated Characters due to Byte Encoding
+ RanIwamoto
+ HiroshiKanayama
+ 24–29
+ The process of language generation, which selects the most probable tokens one by one, may intrinsically result in output strings that humans never utter. We name this phenomenon “LLM neologism” and investigate it focusing on Japanese, Chinese, and Korean languages, where tokens can be smaller than characters. Our findings show that LLM neologism occurs through the combination of two high-frequency words with common tokens. We also clarify the cause of LLM neologism in the tokenization process with limited vocabularies. The results of this study provides important clues for better encoding of multibyte characters, aiming to prevent catastrophic results in AI-generated documents.
+ 2024.inlg-main.3
+ iwamoto-kanayama-2024-llm-neologism
+
+
+ Communicating Uncertainty in Explanations of the Outcomes of Machine Learning Models
+ IngridZukerman
+ SameenMaruf
+ 30–46
+ We consider two types of numeric representations for conveying the uncertainty of predictions made by Machine Learning (ML) models: confidence-based (e.g., “the AI is 90% confident”) and frequency-based (e.g., “the AI was correct in 180 (90%) out of 200 cases”). We conducted a user study to determine which factors influence users’ acceptance of predictions made by ML models, and how the two types of uncertainty representations affect users’ views about explanations. Our results show that users’ acceptance of ML model predictions depends mainly on the models’ confidence, and that explanations that include uncertainty information are deemed better in several respects than explanations that omit it, with frequency-based representations being deemed better than confidence-based representations.
+ 2024.inlg-main.4
+ zukerman-maruf-2024-communicating-uncertainty
+
+
+ Entity-aware Multi-task Training Helps Rare Word Machine Translation
+ MatissRikters
+ MakotoMiwa
+ 47–54
+ Named entities (NE) are integral for preserving context and conveying accurate information in the machine translation (MT) task. Challenges often lie in handling NE diversity, ambiguity, rarity, and ensuring alignment and consistency. In this paper, we explore the effect of NE-aware model fine-tuning to improve handling of NEs in MT. We generate data for NE recognition (NER) and NE-aware MT using common NER tools from Spacy, and align entities in parallel data. Experiments with fine-tuning variations of pre-trained T5 models on NE-related generation tasks between English and German show promising results with increasing amounts of NEs in the output and BLEU score improvements compared to the non-tuned baselines.
+ 2024.inlg-main.5
+ rikters-miwa-2024-entity-aware
+
+
+ CEval: A Benchmark for Evaluating Counterfactual Text Generation
+ Van BachNguyen
+ ChristinSeifert
+ JörgSchlötterer
+ 55–69
+ Counterfactual text generation aims to minimally change a text, such that it is classified differently. Assessing progress in method development for counterfactual text generation is hindered by a non-uniform usage of data sets and metrics in related work. We propose CEval, a benchmark for comparing counterfactual text generation methods. CEval unifies counterfactual and text quality metrics, includes common counterfactual datasets with human annotations, standard baselines (MICE, GDBA, CREST) and the open-source language model LLAMA-2. Our experiments found no perfect method for generating counterfactual text. Methods that excel at counterfactual metrics often produce lower-quality text while LLMs with simple prompts generate high-quality text but struggle with counterfactual criteria. By making CEval available as an open-source Python library, we encourage the community to contribute additional methods and maintain consistent evaluation in future work.
+ 2024.inlg-main.6
+ 2024.inlg-main.6.Supplementary_Attachment.pdf
+ nguyen-etal-2024-ceval-benchmark
+
+
+ Generating from AMRs into High and Low-Resource Languages using Phylogenetic Knowledge and Hierarchical QLoRA Training (HQL)
+ William EduardoSoto Martinez
+ YannickParmentier
+ ClaireGardent
+ 70–81
+ Multilingual generation from Abstract Meaning Representations (AMRs) verbalises AMRs into multiple languages. Previous work has focused on high- and medium-resource languages relying on large amounts of training data. In this work, we consider both high- and low-resource languages capping training data size at the lower bound set by our low-resource languages i.e. 31K. We propose a straightforward technique to enhance results on low-resource while preserving performance on high-resource languages. We iteratively refine a multilingua model to a set of monolingual models using Low-Rank Adaptation with a training curriculum based on a tree structure; this permits investigating how the languages used at each iteration impact generation performance on high and low-resource languages. We show an improvement over both mono and multilingual approaches. Comparing different ways of grouping languages at each iteration step we find two working configurations: grouping related languages which promotes transfer, or grouping distant languages which facilitates regularisation
+ 2024.inlg-main.7
+ 2024.inlg-main.7.Supplementary_Attachment.pdf
+ soto-martinez-etal-2024-generating-amrs
+
+
+ AMERICANO: Argument Generation with Discourse-driven Decomposition and Agent Interaction
+ ZheHu
+ Hou PongChan
+ YuYin
+ 82–102
+ Argument generation is a challenging task in natural language processing, which requires rigorous reasoning and proper content organization. Inspired by recent chain-of-thought prompting that breaks down a complex task into intermediate steps, we propose Americano, a novel framework with agent interaction for argument generation. Our approach decomposes the generation process into sequential actions grounded on argumentation theory, which first executes actions sequentially to generate argumentative discourse components, and then produces a final argument conditioned on the components. To further mimic the human writing process and improve the left-to-right generation paradigm of current autoregressive language models, we introduce an argument refinement module that automatically evaluates and refines argument drafts based on feedback received. We evaluate our framework on the task of counterargument generation using a subset of Reddit/CMV dataset. The results show that our method outperforms both end-to-end and chain-of-thought prompting methods and can generate more coherent and persuasive arguments with diverse and rich contents.
+ 2024.inlg-main.8
+ hu-etal-2024-americano-argument
+
+
+ Generating Simple, Conservative and Unifying Explanations for Logistic Regression Models
+ SameenMaruf
+ IngridZukerman
+ XuelinSitu
+ CecileParis
+ GholamrezaHaffari
+ 103–120
+ In this paper, we generate and compare three types of explanations of Machine Learning (ML) predictions: simple, conservative and unifying. Simple explanations are concise, conservative explanations address the surprisingness of a prediction, and unifying explanations convey the extent to which an ML model’s predictions are applicable. The results of our user study show that (1) conservative and unifying explanations are liked equally and considered largely equivalent in terms of completeness, helpfulness for understanding the AI, and enticement to act, and both are deemed better than simple explanations; and (2)users’ views about explanations are influenced by the (dis)agreement between the ML model’s predictions and users’ estimations of these predictions, and by the inclusion/omission of features users expect to see in explanations.
+ 2024.inlg-main.9
+ maruf-etal-2024-generating-simple
+
+
+ Extractive Summarization via Fine-grained Semantic Tuple Extraction
+ YubinGe
+ SullamJeoung
+ JanaDiesner
+ 121–133
+ Traditional extractive summarization treats the task as sentence-level classification and requires a fixed number of sentences for extraction. However, this rigid constraint on the number of sentences to extract may hinder model generalization due to varied summary lengths across datasets. In this work, we leverage the interrelation between information extraction (IE) and text summarization, and introduce a fine-grained autoregressive method for extractive summarization through semantic tuple extraction. Specifically, we represent each sentence as a set of semantic tuples, where tuples are predicate-argument structures derived from conducting IE. Then we adopt a Transformer-based autoregressive model to extract the tuples corresponding to the target summary given a source document. In inference, a greedy approach is proposed to select source sentences to cover extracted tuples, eliminating the need for a fixed number. Our experiments on CNN/DM and NYT demonstrate the method’s superiority over strong baselines. Through the zero-shot setting for testing the generalization of models to diverse summary lengths across datasets, we further show our method outperforms baselines, including ChatGPT.
+ 2024.inlg-main.10
+ ge-etal-2024-extractive-summarization
+
+
+ Evaluating RDF-to-text Generation Models for English and Russian on Out Of Domain Data
+ AnnaNikiforovskaya
+ ClaireGardent
+ 134–144
+ While the WebNLG dataset has prompted much research on generation from knowledge graphs, little work has examined how well models trained on the WebNLG data generalise to unseen data and work has mostly been focused on English. In this paper, we introduce novel benchmarks for both English and Russian which contain various ratios of unseen entities and properties. These benchmarks also differ from WebNLG in that some of the graphs stem from Wikidata rather than DBpedia. Evaluating various models for English and Russian on these benchmarks shows a strong decrease in performance while a qualitative analysis highlights the various types of errors induced by non i.i.d data.
+ 2024.inlg-main.11
+ 2024.inlg-main.11.Supplementary_Attachment.pdf
+ nikiforovskaya-gardent-2024-evaluating-rdf
+
+
+ Forecasting Implicit Emotions Elicited in Conversations
+ YurieKoga
+ ShunsukeKando
+ YusukeMiyao
+ 145–152
+ This paper aims to forecast the implicit emotion elicited in the dialogue partner by a textual input utterance. Forecasting the interlocutor’s emotion is beneficial for natural language generation in dialogue systems to avoid generating utterances that make the users uncomfortable. Previous studies forecast the emotion conveyed in the interlocutor’s response, assuming it will explicitly reflect their elicited emotion. However, true emotions are not always expressed verbally. We propose a new task to directly forecast the implicit emotion elicited by an input utterance, which does not rely on this assumption. We compare this task with related ones to investigate the impact of dialogue history and one’s own utterance on predicting explicit and implicit emotions. Our result highlights the importance of dialogue history for predicting implicit emotions. It also reveals that, unlike explicit emotions, implicit emotions show limited improvement in predictive performance with one’s own utterance, and that they are more difficult to predict than explicit emotions. We find that even a large language model (LLM) struggles to forecast implicit emotions accurately.
+ 2024.inlg-main.12
+ 2024.inlg-main.12.Supplementary_Attachment.pdf
+ koga-etal-2024-forecasting-implicit
+
+
+ German Voter Personas Can Radicalize LLM Chatbots via the Echo Chamber Effect
+ MaximilianBleick
+ NilsFeldhus
+ AljoschaBurchardt
+ SebastianMöller
+ 153–164
+ We investigate the impact of LLMs on political discourse with a particular focus on the influence of generated personas on model responses. We find an echo chamber effect from LLM chatbots when provided with German-language biographical information of politicians and voters in German politics, leading to sycophantic responses and the reinforcement of existing political biases. Findings reveal that personas of certain political party, such as those of the ‘Alternative für Deutschland’ party, exert a stronger influence on LLMs, potentially amplifying extremist views. Unlike prior studies, we cannot corroborate a tendency for larger models to exert stronger sycophantic behaviour. We propose that further development should aim at reducing sycophantic behaviour in LLMs across all sizes and diversifying language capabilities in LLMs to enhance inclusivity.
+ 2024.inlg-main.13
+ bleick-etal-2024-german-voter
+
+
+ Quantifying Memorization and Detecting Training Data of Pre-trained Language Models using Japanese Newspaper
+ ShotaroIshihara
+ HiromuTakahashi
+ 165–179
+ Dominant pre-trained language models (PLMs) have demonstrated the potential risk of memorizing and outputting the training data. While this concern has been discussed mainly in English, it is also practically important to focus on domain-specific PLMs. In this study, we pre-trained domain-specific GPT-2 models using a limited corpus of Japanese newspaper articles and evaluated their behavior. Experiments replicated the empirical finding that memorization of PLMs is related to the duplication in the training data, model size, and prompt length, in Japanese the same as in previous English studies. Furthermore, we attempted membership inference attacks, demonstrating that the training data can be detected even in Japanese, which is the same trend as in English. The study warns that domain-specific PLMs, sometimes trained with valuable private data, can ”copy and paste” on a large scale.
+ 2024.inlg-main.14
+ ishihara-takahashi-2024-quantifying-memorization
+
+
+ Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue
+ SimoneAlghisi
+ MassimoRizzoli
+ GabrielRoccabruna
+ Seyed MahedMousavi
+ GiuseppeRiccardi
+ 180–197
+ We study the limitations of Large Language Models (LLMs) for the task of response generation in human-machine dialogue. Several techniques have been proposed in the literature for different dialogue types (e.g., Open-Domain). However, the evaluations of these techniques have been limited in terms of base LLMs, dialogue types and evaluation metrics. In this work, we extensively analyze different LLM adaptation techniques when applied to different dialogue types. We have selected two base LLMs, Llama-2 and Mistral, and four dialogue types Open-Domain, Knowledge-Grounded, Task-Oriented, and Question Answering. We evaluate the performance of in-context learning and fine-tuning techniques across datasets selected for each dialogue type. We assess the impact of incorporating external knowledge to ground the generation in both scenarios of Retrieval-Augmented Generation (RAG) and gold knowledge. We adopt consistent evaluation and explainability criteria for automatic metrics and human evaluation protocols. Our analysis shows that there is no universal best-technique for adapting large language models as the efficacy of each technique depends on both the base LLM and the specific type of dialogue. Last but not least, the assessment of the best adaptation technique should include human evaluation to avoid false expectations and outcomes derived from automatic metrics.
+ 2024.inlg-main.15
+ 2024.inlg-main.15.Supplementary_Attachment.pdf
+ alghisi-etal-2024-fine-tune
+
+
+ Automating True-False Multiple-Choice Question Generation and Evaluation with Retrieval-based Accuracy Differential
+ Chen-JuiYu
+ Wen HungLee
+ Lin TseKe
+ Shih-WeiGuo
+ Yao-ChungFan
+ 198–212
+ Creating high-quality True-False (TF) multiple-choice questions (MCQs), with accurate distractors, is a challenging and time-consuming task in education. This paper introduces True-False Distractor Generation (TFDG), a pipeline that leverages pre-trained language models and sentence retrieval techniques to automate the generation of TF-type MCQ distractors. Furthermore, the evaluation of generated TF questions presents a challenge. Traditional metrics like BLEU and ROUGE are unsuitable for this task. To address this, we propose a new evaluation metric called Retrieval-based Accuracy Differential (RAD). RAD assesses the discriminative power of TF questions by comparing model accuracy with and without access to reference texts. It quantitatively evaluates how well questions differentiate between students with varying knowledge levels. This research benefits educators and assessment developers, facilitating the efficient automatic generation of high-quality TF-type MCQs and their reliable evaluation.
+ 2024.inlg-main.16
+ yu-etal-2024-automating-true
+
+
+ Transfer-Learning based on Extract, Paraphrase and Compress Models for Neural Abstractive Multi-Document Summarization
+ YlliasChali
+ ElozinoEgonmwan
+ 213–221
+ Recently, transfer-learning by unsupervised pre-training and fine-tuning has shown great success on a number of tasks. The paucity of data for multi-document summarization (MDS) in the news domain, especially makes this approach practical. However, while existing literature mostly formulate unsupervised learning objectives tailored for/around the summarization problem we find that MDS can benefit directly from models pre-trained on other downstream supervised tasks such as sentence extraction, paraphrase generation and sentence compression. We carry out experiments to demonstrate the impact of zero-shot transfer-learning from these downstream tasks on MDS. Since it is challenging to train end-to-end encoder-decoder models on MDS due to i) the sheer length of the input documents, and ii) the paucity of training data. We hope this paper encourages more work on these downstream tasks as a means to mitigating the challenges in neural abstractive MDS.
+ 2024.inlg-main.17
+ chali-egonmwan-2024-transfer-learning
+
+
+ Enhancing Presentation Slide Generation by LLMs with a Multi-Staged End-to-End Approach
+ SambaranBandyopadhyay
+ HimanshuMaheshwari
+ AnandhaveluNatarajan
+ ApoorvSaxena
+ 222–229
+ Generating presentation slides from a long document with multimodal elements such as text and images is an important task. This is time consuming and needs domain expertise if done manually. Existing approaches for generating a rich presentation from a document are often semi-automatic or only put a flat summary into the slides ignoring the importance of a good narrative. In this paper, we address this research gap by proposing a multi-staged end-to-end model which uses a combination of LLM and VLM. We have experimentally shown that compared to applying LLMs directly with state-of-the-art prompting, our proposed multi-staged solution is better in terms of automated metrics and human evaluation.
+ 2024.inlg-main.18
+ 2024.inlg-main.18.Supplementary_Attachment.pdf
+ bandyopadhyay-etal-2024-enhancing-presentation
+
+
+ Is Machine Psychology here? On Requirements for Using Human Psychological Tests on Large Language Models
+ LeaLöhn
+ NiklasKiehne
+ AlexanderLjapunov
+ Wolf-TiloBalke
+ 230–242
+ In an effort to better understand the behavior of large language models (LLM), researchers recently turned to conducting psychological assessments on them. Several studies diagnose various psychological concepts in LLMs, such as psychopathological symptoms, personality traits, and intellectual functioning, aiming to unravel their black-box characteristics. But can we safely assess LLMs with tests that were originally designed for humans? The psychology domain looks back on decades of developing standards of appropriate testing procedures to ensure reliable and valid measures. We argue that analogous standardization processes are required for LLM assessments, given their differential functioning as compared to humans. In this paper, we propose seven requirements necessary for testing LLMs. Based on these, we critically reflect a sample of 25 recent machine psychology studies. Our analysis reveals (1) the lack of appropriate methods to assess test reliability and construct validity, (2) the unknown strength of construct-irrelevant influences, such as the contamination of pre-training corpora with test material, and (3) the pervasive issue of non-reproducibility of many studies. The results underscore the lack of a general methodology for the implementation of psychological assessments of LLMs and the need to redefine psychological constructs specifically for large language models rather than adopting them from human psychology.
+ 2024.inlg-main.19
+ lohn-etal-2024-machine-psychology
+
+
+ Exploring the impact of data representation on neural data-to-text generation
+ David M.Howcroft
+ Lewis N.Watson
+ OlesiaNedopas
+ DimitraGkatzia
+ 243–253
+ A relatively under-explored area in research on neural natural language generation is the impact of the data representation on text quality. Here we report experiments on two leading input representations for data-to-text generation: attribute-value pairs and Resource Description Framework (RDF) triples. Evaluating the performance of encoder-decoder seq2seq models as well as recent large language models (LLMs) with both automated metrics and human evaluation, we find that the input representation does not seem to have a large impact on the performance of either purpose-built seq2seq models or LLMs. Finally, we present an error analysis of the texts generated by the LLMs and provide some insights into where these models fail.
+ 2024.inlg-main.20
+ 2024.inlg-main.20.Supplementary_Attachment.pdf
+ howcroft-etal-2024-exploring-impact
+
+
+ Automatically Generating IsiZulu Words From Indo-Arabic Numerals
+ ZolaMahlaza
+ TadiwaMagwenzi
+ C. MariaKeet
+ LangaKhumalo
+ 254–271
+ Artificial conversational agents are deployed to assist humans in a variety of tasks. Some of these tasks require the capability to communicate numbers as part of their internal and abstract representations of meaning, such as for banking and scheduling appointments. They currently cannot do so for isiZulu because there are no algorithms to do so due to a lack of speech and text data and the transformation is complex and it may include dependence on the type of noun that is counted. We solved this by extracting and iteratively improving on the rules for speaking and writing numerals as words and creating two algorithms to automate the transformation. Evaluation of the algorithms by two isiZulu grammarians showed that six out of seven number categories were 90-100% correct. The same software was used with an additional set of rules to create a large monolingual text corpus, made up of 771 643 sentences, to enable future data-driven approaches.
+ 2024.inlg-main.21
+ mahlaza-etal-2024-automatically-generating
+
+
+ (Mostly) Automatic Experiment Execution for Human Evaluations of NLP Systems
+ CraigThomson
+ AnyaBelz
+ 272–279
+ Human evaluation is widely considered the most reliable form of evaluation in NLP, but recent research has shown it to be riddled with mistakes, often as a result of manual execution of tasks. This paper argues that such mistakes could be avoided if we were to automate, as much as is practical, the process of performing experiments for human evaluation of NLP systems. We provide a simple methodology that can improve both the transparency and reproducibility of experiments. We show how the sequence of component processes of a human evaluation can be defined in advance, facilitating full or partial automation, detailed preregistration of the process, and research transparency and repeatability.
+ 2024.inlg-main.22
+ thomson-belz-2024-mostly-automatic
+
+
+ Generating Hotel Highlights from Unstructured Text using LLMs
+ Srinivas RameshKamath
+ FahimeSame
+ SaadMahamood
+ 280–288
+ We describe our implementation and evaluation of the Hotel Highlights system which has been deployed live by trivago. This system leverages a large language model (LLM) to generate a set of highlights from accommodation descriptions and reviews, enabling travellers to quickly understand its unique aspects. In this paper, we discuss our motivation for building this system and the human evaluation we conducted, comparing the generated highlights against the source input to assess the degree of hallucinations and/or contradictions present. Finally, we outline the lessons learned and the improvements needed.
+ 2024.inlg-main.23
+ kamath-etal-2024-generating-hotel
+
+
+ Text2Traj2Text: Learning-by-Synthesis Framework for Contextual Captioning of Human Movement Trajectories
+ HikaruAsano
+ RyoYonetani
+ TaikiSekii
+ HirokiOuchi
+ 289–302
+ This paper presents Text2Traj2Text, a novel learning-by-synthesis framework for captioning possible contexts behind shopper’s trajectory data in retail stores. Our work will impact various retail applications that need better customer understanding, such as targeted advertising and inventory management. The key idea is leveraging large language models to synthesize a diverse and realistic collection of contextual captions as well as the corresponding movement trajectories on a store map. Despite learned from fully synthesized data, the captioning model can generalize well to trajectories/captions created by real human subjects. Our systematic evaluation confirmed the effectiveness of the proposed framework over competitive approaches in terms of ROUGE and BERT Score metrics.
+ 2024.inlg-main.24
+ asano-etal-2024-text2traj2text-learning
+
+
+ n-gram F-score for Evaluating Grammatical Error Correction
+ ShotaKoyama
+ RyoNagata
+ HiroyaTakamura
+ NaoakiOkazaki
+ 303–313
+ M2 and its variants are the most widely used automatic evaluation metrics for grammatical error correction (GEC), which calculate an F-score using a phrase-based alignment between sentences. However, it is not straightforward at all to align learner sentences containing errors to their correct sentences. In addition, alignment calculations are computationally expensive. We propose GREEN, an alignment-free F-score for GEC evaluation. GREEN treats a sentence as a multiset of n-grams and extracts edits between sentences by set operations instead of computing an alignment. Our experiments confirm that GREEN performs better than existing methods for the corpus-level metrics and comparably for the sentence-level metrics even without computing an alignment. GREEN is available at https://github.com/shotakoyama/green.
+ 2024.inlg-main.25
+ koyama-etal-2024-n-gram
+
+
+ Personalized Cloze Test Generation with Large Language Models: Streamlining MCQ Development and Enhancing Adaptive Learning
+ Chih-HsuanShen
+ Yi-LiKuo
+ Yao-ChungFan
+ 314–319
+ Cloze multiple-choice questions (MCQs) are essential for assessing comprehension in educational settings, but manually designing effective distractors is time-consuming. Addressing this, recent research has automated distractor generation, yet such methods often neglect to adjust the difficulty level to the learner’s abilities, resulting in non-personalized assessments. This study introduces the Personalized Cloze Test Generation (PCGL) Framework, utilizing Large Language Models (LLMs) to generate cloze tests tailored to individual proficiency levels. Our PCGL Framework simplifies test creation by generating both question stems and distractors from a single input word and adjusts the difficulty to match the learner’s proficiency. The framework significantly reduces the effort in creating tests and enhances personalized learning by dynamically adjusting to the needs of each learner.
+ 2024.inlg-main.26
+ 2024.inlg-main.26.Supplementary_Attachment.pdf
+ shen-etal-2024-personalized-cloze
+
+
+ Pipeline Neural Data-to-text with Large Language Models
+ Chinonso CynthiaOsuji
+ BrianTimoney
+ ThiagoCastro Ferreira
+ BrianDavis
+ 320–329
+ Previous studies have highlighted the advantages of pipeline neural architectures over end-to-end models, particularly in reducing text hallucination. In this study, we extend prior research by integrating pretrained language models (PLMs) into a pipeline framework, using both fine-tuning and prompting methods. Our findings show that fine-tuned PLMs consistently generate high quality text, especially within end-to-end architectures and at intermediate stages of the pipeline across various domains. These models also outperform prompt-based ones on automatic evaluation metrics but lag in human evaluations. Compared to the standard five-stage pipeline architecture, a streamlined three-stage pipeline, which only include ordering, structuring, and surface realization, achieves superior performance in fluency and semantic adequacy according to the human evaluation.
+ 2024.inlg-main.27
+ osuji-etal-2024-pipeline-neural
+
+
+ Reduction-Synthesis: Plug-and-Play for Sentiment Style Transfer
+ ShengXu
+ FumiyoFukumoto
+ YoshimiSuzuki
+ 330–343
+ Sentiment style transfer (SST), a variant of text style transfer (TST), has recently attracted extensive interest. Some disentangling-based approaches have improved performance, while most still struggle to properly transfer the input as the sentiment style is intertwined with the content of the text. To alleviate the issue, we propose a plug-and-play method that leverages an iterative self-refinement algorithm with a large language model (LLM). Our approach separates the straightforward Seq2Seq generation into two phases: (1) Reduction phase which generates a style-free sequence for a given text, and (2) Synthesis phase which generates the target text by leveraging the sequence output from the first phase. The experimental results on two datasets demonstrate that our transfer strategy is effective for challenging SST cases where the baseline methods perform poorly. Our code is available online.
+ 2024.inlg-main.28
+ 2024.inlg-main.28.Supplementary_Attachment.zip
+ xu-etal-2024-reduction-synthesis
+
+
+ Resilience through Scene Context in Visual Referring Expression Generation
+ SimeonJunker
+ SinaZarrieß
+ 344–357
+ Scene context is well known to facilitate humans’ perception of visible objects. In this paper, we investigate the role of context in Referring Expression Generation (REG) for objects in images, where existing research has often focused on distractor contexts that exert pressure on the generator. We take a new perspective on scene context in REG and hypothesize that contextual information can be conceived of as a resource that makes REG models more resilient and facilitates the generation of object descriptions, and object types in particular. We train and test Transformer-based REG models with target representations that have been artificially obscured with noise to varying degrees. We evaluate how properties of the models’ visual context affect their processing and performance. Our results show that even simple scene contexts make models surprisingly resilient to perturbations, to the extent that they can identify referent types even when visual information about the target is completely missing.
+ 2024.inlg-main.29
+ junker-zarriess-2024-resilience-scene
+
+
+ The Unreasonable Ineffectiveness of Nucleus Sampling on Mitigating Text Memorization
+ LukaBorec
+ PhilippSadler
+ DavidSchlangen
+ 358–370
+ This work analyses the text memorization behavior of large language models (LLMs) when subjected to nucleus sampling. Stochastic decoding methods like nucleus sampling are typically applied to overcome issues such as monotonous and repetitive text generation, which are often observed with maximization-based decoding techniques. We hypothesize that nucleus sampling might also reduce the occurrence of memorization patterns, because it could lead to the selection of tokens outside the memorized sequence. To test this hypothesis we create a diagnostic dataset with a known distribution of duplicates that gives us some control over the likelihood of memorisation of certain parts of the training data. Our analysis of two GPT-Neo models fine-tuned on this dataset interestingly shows that (i) an increase of the nucleus size reduces memorization only modestly, and (ii) even when models do not engage in “hard” memorization – a verbatim reproduction of training samples – they may still display “soft” memorization whereby they generate outputs that echo the training data but without a complete one-by-one resemblance.
+ 2024.inlg-main.30
+ borec-etal-2024-unreasonable-ineffectiveness
+
+
+ CADGE: Context-Aware Dialogue Generation Enhanced with Graph-Structured Knowledge Aggregation
+ ChenTang
+ HongboZhang
+ TylerLoakman
+ BohaoYang
+ StefanGoetze
+ ChenghuaLin
+ 371–383
+ Commonsense knowledge is crucial to many natural language processing tasks. Existing works usually incorporate graph knowledge with conventional graph neural networks (GNNs), resulting in a sequential pipeline that compartmentalizes the encoding processes for textual and graph-based knowledge. This compartmentalization does, however, not fully exploit the contextual interplay between these two types of input knowledge. In this paper, a novel context-aware graph-attention model (Context-aware GAT) is proposed, designed to effectively assimilate global features from relevant knowledge graphs through a context-enhanced knowledge aggregation mechanism. Specifically, the proposed framework employs an innovative approach to representation learning that harmonizes heterogeneous features by amalgamating flattened graph knowledge with text data. The hierarchical application of graph knowledge aggregation within connected subgraphs, complemented by contextual information, to bolster the generation of commonsense-driven dialogues is analyzed. Empirical results demonstrate that our framework outperforms conventional GNN-based language models in terms of performance. Both, automated and human evaluations affirm the significant performance enhancements achieved by our proposed model over the concept flow baseline.
+ 2024.inlg-main.31
+ tang-etal-2024-cadge-context
+
+
+ Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning
+ YingjinSong
+ DenisPaperno
+ AlbertGatt
+ 384–401
+ Visual storytelling systems generate multi-sentence stories from image sequences. In this task, capturing contextual information and bridging visual variation bring additional challenges. We propose a simple yet effective framework that leverages the generalization capabilities of pretrained foundation models, only training a lightweight vision-language mapping network to connect modalities, while incorporating context to enhance coherence. We introduce a multimodal contrastive objective that also improves visual relevance and story informativeness. Extensive experimental results, across both automatic metrics and human evaluations, demonstrate that the stories generated by our framework are diverse, coherent, informative, and interesting.
+ 2024.inlg-main.32
+ song-etal-2024-context-aware
+
+
+ Enhancing Editorial Tasks: A Case Study on Rewriting Customer Help Page Contents Using Large Language Models
+ AleksandraGabryszak
+ DanielRöder
+ ArneBinder
+ LucaSion
+ LeonhardHennig
+ 402–411
+ In this paper, we investigate the use of large language models (LLMs) to enhance the editorial process of rewriting customer help pages. We introduce a German-language dataset comprising Frequently Asked Question-Answer pairs, presenting both raw drafts and their revisions by professional editors. On this dataset, we evaluate the performance of four large language models (LLM) through diverse prompts tailored for the rewriting task. We conduct automatic evaluations of content and text quality using ROUGE, BERTScore, and ChatGPT. Furthermore, we let professional editors assess the helpfulness of automatically generated FAQ revisions for editorial enhancement. Our findings indicate that LLMs can produce FAQ reformulations beneficial to the editorial process. We observe minimal performance discrepancies among LLMs for this task, and our survey on helpfulness underscores the subjective nature of editors’ perspectives on editorial refinement.
+ 2024.inlg-main.33
+ gabryszak-etal-2024-enhancing-editorial
+
+
+ Customizing Large Language Model Generation Style using Parameter-Efficient Finetuning
+ XinyueLiu
+ HarshitaDiddee
+ DaphneIppolito
+ 412–426
+ One-size-fits-all large language models (LLMs) are increasingly being used to help people with their writing. However, the style these models are trained to write in may not suit all users or use cases. LLMs would be more useful as writing assistants if their idiolect could be customized to match each user. In this paper, we explore whether parameter-efficient finetuning (PEFT) with Low-Rank Adaptation can effectively guide the style of LLM generations. We use this method to customize LLaMA-2 to ten different authors and show that the generated text has lexical, syntactic, and surface alignment with the target author but struggles with content memorization. Our findings highlight the potential of PEFT to support efficient, user-level customization of LLMs.
+ 2024.inlg-main.34
+ liu-etal-2024-customizing-large
+
+
+ Towards Fine-Grained Citation Evaluation in Generated Text: A Comparative Analysis of Faithfulness Metrics
+ WeijiaZhang
+ MohammadAliannejadi
+ YifeiYuan
+ JiahuanPei
+ Jia-hongHuang
+ EvangelosKanoulas
+ 427–439
+ Large language models (LLMs) often produce unsupported or unverifiable content, known as “hallucinations.” To mitigate this, retrieval-augmented LLMs incorporate citations, grounding the content in verifiable sources. Despite such developments, manually assessing how well a citation supports the associated statement remains a major challenge. Previous studies use faithfulness metrics to estimate citation support automatically but are limited to binary classification, overlooking fine-grained citation support in practical scenarios. To investigate the effectiveness of faithfulness metrics in fine-grained scenarios, we propose a comparative evaluation framework that assesses the metric effectiveness in distinguishing citations between three-category support levels: full, partial, and no support. Our framework employs correlation analysis, classification evaluation, and retrieval evaluation to measure the alignment between metric scores and human judgments comprehensively. Our results show no single metric consistently excels across all evaluations, revealing the complexity of assessing fine-grained support. Based on the findings, we provide practical recommendations for developing more effective metrics.
+ 2024.inlg-main.35
+ zhang-etal-2024-towards-fine-grained
+
+
+ Audio-visual training for improved grounding in video-text LLMs
+ Shivprasad RajendraSagare
+ HemachandranS
+ KinshukSarabhai
+ PrashantUllegaddi
+ RajeshkumarSa
+ 440–445
+ Recent advances in multimodal LLMs, have led to several video-text models being proposed for critical video-related tasks. However, most of the previous works support visual input only, essentially muting the audio signal in the video. Few models that support both audio and visual input, are not explicitly trained on audio data. Hence, the effect of audio towards video understanding is largely unexplored. To this end, we propose a model architecture that handles audio-visual inputs explicitly. We train our model with both audio and visual data from a video instruction-tuning dataset. Comparison with vision-only baselines, and other audio-visual models showcase that training on audio data indeed leads to better grounding of responses. For better evaluation of audio-visual models, we also release a human-annotated benchmark dataset, with audio-aware question-answer pairs.
+ 2024.inlg-main.36
+ sagare-etal-2024-audio-visual
+
+
+ aiXplain SDK: A High-Level and Standardized Toolkit for AI Assets
+ ShreyasSharma
+ LucasPavanelli
+ ThiagoCastro Ferreira
+ MohamedAl-Badrashiny
+ HassanSawaf
+ 446–452
+ The aiXplain SDK is an open-source Python toolkit which aims to simplify the wide and complex ecosystem of AI resources. The toolkit enables access to a wide selection of AI assets, including datasets, models, and metrics, from both academic and commercial sources, which can be selected, executed and evaluated in one place through different services in a standardized format with consistent documentation provided. The study showcases the potential of the proposed toolkit with different code examples and by using it on a user journey where state-of-the-art Large Language Models are fine-tuned on instruction prompt datasets, outperforming their base versions.
+ 2024.inlg-main.37
+ sharma-etal-2024-aixplain-sdk
+
+
+ Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding
+ BramWillemsen
+ GabrielSkantze
+ 453–469
+ We propose an approach to referring expression generation (REG) in visually grounded dialogue that is meant to produce referring expressions (REs) that are both discriminative and discourse-appropriate. Our method constitutes a two-stage process. First, we model REG as a text- and image-conditioned next-token prediction task. REs are autoregressively generated based on their preceding linguistic context and a visual representation of the referent. Second, we propose the use of discourse-aware comprehension guiding as part of a generate-and-rerank strategy through which candidate REs generated with our REG model are reranked based on their discourse-dependent discriminatory power. Results from our human evaluation indicate that our proposed two-stage approach is effective in producing discriminative REs, with higher performance in terms of text-image retrieval accuracy for reranked REs compared to those generated using greedy decoding.
+ 2024.inlg-main.38
+ willemsen-skantze-2024-referring-expression
+
+
+ The Gricean Maxims in NLP - A Survey
+ LeaKrause
+ Piek T.J.M.Vossen
+ 470–485
+ In this paper, we provide an in-depth review of how the Gricean maxims have been used to develop and evaluate Natural Language Processing (NLP) systems. Originating from the domain of pragmatics, the Gricean maxims are foundational principles aimed at optimising communicative effectiveness, encompassing the maxims of Quantity, Quality, Relation, and Manner. We explore how these principles are operationalised within NLP through the development of data sets, benchmarks, qualitative evaluation and the formulation of tasks such as Data-to-text, Referring Expressions, Conversational Agents, and Reasoning with a specific focus on Natural Language Generation (NLG). We further present current works on the integration of these maxims in the design and assessment of Large Language Models (LLMs), highlighting their potential influence on enhancing model performance and interaction capabilities. Additionally, this paper identifies and discusses relevant challenges and opportunities, with a special emphasis on the cultural adaptation and contextual applicability of the Gricean maxims. While they have been widely used in different NLP applications, we present the first comprehensive survey of the Gricean maxims’ impact.
+ 2024.inlg-main.39
+ krause-vossen-2024-gricean-maxims
+
+
+ Leveraging Plug-and-Play Models for Rhetorical Structure Control in Text Generation
+ YukaYokogawa
+ TatsuyaIshigaki
+ HiroyaTakamura
+ YusukeMiyao
+ IchiroKobayashi
+ 486–493
+ We propose a method that extends a BART-based language generator using a plug-and-play model to control the rhetorical structure of generated text. Our approach considers rhetorical relations between clauses and generates sentences that reflect this structure using plug-and-play language models. We evaluated our method using the Newsela corpus, which consists of texts at various levels of English proficiency. Our experiments demonstrated that our method outperforms the vanilla BART in terms of the correctness of output discourse and rhetorical structures. In existing methods, the rhetorical structure tends to deteriorate when compared to the baseline, the vanilla BART, as measured by n-gram overlap metrics such as BLEU. However, our proposed method does not exhibit this significant deterioration, demonstrating its advantage.
+ 2024.inlg-main.40
+ 2024.inlg-main.40.Supplementary_Attachment.pdf
+ yokogawa-etal-2024-leveraging-plug
+
+
+ Multilingual Text Style Transfer: Datasets & Models for Indian Languages
+ SourabrataMukherjee
+ Atul Kr.Ojha
+ AkankshaBansal
+ DeepakAlok
+ John P.McCrae
+ OndrejDusek
+ 494–522
+ Text style transfer (TST) involves altering the linguistic style of a text while preserving its style-independent content. This paper focuses on sentiment transfer, a popular TST subtask, across a spectrum of Indian languages: Hindi, Magahi, Malayalam, Marathi, Punjabi, Odia, Telugu, and Urdu, expanding upon previous work on English-Bangla sentiment transfer. We introduce dedicated datasets of 1,000 positive and 1,000 negative style-parallel sentences for each of these eight languages. We then evaluate the performance of various benchmark models categorized into parallel, non-parallel, cross-lingual, and shared learning approaches, including the Llama2 and GPT-3.5 large language models (LLMs). Our experiments highlight the significance of parallel data in TST and demonstrate the effectiveness of the Masked Style Filling (MSF) approach in non-parallel techniques. Moreover, cross-lingual and joint multilingual learning methods show promise, offering insights into selecting optimal models tailored to the specific language and task requirements. To the best of our knowledge, this work represents the first comprehensive exploration of the TST task as sentiment transfer across a diverse set of languages.
+ 2024.inlg-main.41
+ mukherjee-etal-2024-multilingual-text
+
+
+ Are Large Language Models Actually Good at Text Style Transfer?
+ SourabrataMukherjee
+ Atul Kr.Ojha
+ OndrejDusek
+ 523–539
+ We analyze the performance of large language models (LLMs) on Text Style Transfer (TST), specifically focusing on sentiment transfer and text detoxification across three languages: English, Hindi, and Bengali. Text Style Transfer involves modifying the linguistic style of a text while preserving its core content. We evaluate the capabilities of pre-trained LLMs using zero-shot and few-shot prompting as well as parameter-efficient finetuning on publicly available datasets. Our evaluation using automatic metrics, GPT-4 and human evaluations reveals that while some prompted LLMs perform well in English, their performance in on other languages (Hindi, Bengali) remains average. However, finetuning significantly improves results compared to zero-shot and few-shot prompting, making them comparable to previous state-of-the-art. This underscores the necessity of dedicated datasets and specialized models for effective TST.
+ 2024.inlg-main.42
+ mukherjee-etal-2024-large-language
+
+
+ Towards Effective Long Conversation Generation with Dynamic Topic Tracking and Recommendation
+ TrevorAshby
+ AdithyaKulkarni
+ JingyuanQi
+ MinqianLiu
+ EunahCho
+ VaibhavKumar
+ LifuHuang
+ 540–556
+ During conversations, the human flow of thoughts may result in topic shifts and evolution. In open-domain dialogue systems, it is crucial to track the topics discussed and recommend relevant topics to be included in responses to have effective conversations. Furthermore, topic evolution is needed to prevent stagnation as conversation length increases. Existing open-domain dialogue systems do not pay sufficient attention to topic evolution and shifting, resulting in performance degradation due to ineffective responses as conversation length increases. To address the shortcomings of existing approaches, we propose EvolvConv. EvolvConv conducts real-time conversation topic and user preference tracking and utilizes the tracking information to evolve and shift topics depending on conversation status. We conduct extensive experiments to validate the topic evolving and shifting capabilities of EvolvConv as conversation length increases. Un-referenced evaluation metric UniEval compare EvolvConv with the baselines. Experimental results show that EvolvConv maintains a smooth conversation flow without abruptly shifting topics; the probability of topic shifting ranges between 5%-8% throughout the conversation. EvolvConv recommends 4.77% more novel topics than the baselines, and the topic evolution follows balanced topic groupings. Furthermore, we conduct user surveys to test the practical viability of EvolvConv. User survey results reveal that responses generated by EvolvConv are preferred 47.8% of the time compared to the baselines and comes second to real human responses.
+ 2024.inlg-main.43
+ ashby-etal-2024-towards-effective
+
+
+ Automatic Metrics in Natural Language Generation: A survey of Current Evaluation Practices
+ PatriciaSchmidtova
+ SaadMahamood
+ SimoneBalloccu
+ OndrejDusek
+ AlbertGatt
+ DimitraGkatzia
+ David M.Howcroft
+ OndrejPlatek
+ AdarsaSivaprasad
+ 557–583
+ Automatic metrics are extensively used to evaluate Natural Language Processing systems. However, there has been increasing focus on how the are used and reported by practitioners within the field. In this paper, we have conducted a survey on the use of automatic metrics, focusing particularly on natural language generation tasks. We inspect which metrics are used as well as why they are chosen and how their use is reported. Our findings from this survey reveal significant shortcomings, including inappropriate metric usage, lack of implementation details and missing correlations with human judgements. We conclude with recommendations that we believe authors should follow to enable more rigour within the field.
+ 2024.inlg-main.44
+ schmidtova-etal-2024-automatic-metrics
+
+
+ A Comprehensive Analysis of Memorization in Large Language Models
+ HirokazuKiyomaru
+ IssaSugiura
+ DaisukeKawahara
+ SadaoKurohashi
+ 584–596
+ This paper presents a comprehensive study that investigates memorization in large language models (LLMs) from multiple perspectives. Experiments are conducted with the Pythia and LLM-jp model suites, both of which offer LLMs with over 10B parameters and full access to their pre-training corpora. Our findings include: (1) memorization is more likely to occur with larger model sizes, longer prompt lengths, and frequent texts, which aligns with findings in previous studies; (2) memorization is less likely to occur for texts not trained during the latter stages of training, even if they frequently appear in the training corpus; (3) the standard methodology for judging memorization can yield false positives, and texts that are infrequent yet flagged as memorized typically result from causes other than true memorization.
+ 2024.inlg-main.45
+ kiyomaru-etal-2024-comprehensive-analysis
+
+
+ Generating Attractive Ad Text by Facilitating the Reuse of Landing Page Expressions
+ HidetakaKamigaito
+ SoichiroMurakami
+ PeinanZhang
+ HiroyaTakamura
+ ManabuOkumura
+ 597–608
+ Ad text generation is vital for automatic advertising in various fields through search engine advertising (SEA) to avoid the cost problem caused by laborious human efforts for creating ad texts. Even though ad creators create the landing page (LP) for advertising and we can expect its quality, conventional approaches with reinforcement learning (RL) mostly focus on advertising keywords rather than LP information. This work investigates and shows the effective usage of LP information as a reward in RL-based ad text generation through automatic and human evaluations. Our analysis of the actually generated ad text shows that LP information can be a crucial reward by appropriately scaling its value range to improve ad text generation performance.
+ 2024.inlg-main.46
+ kamigaito-etal-2024-generating-attractive
+
+
+ Differences in Semantic Errors Made by Different Types of Data-to-text Systems
+ RudaliHuidrom
+ AnyaBelz
+ MichelaLorandi
+ 609–621
+ In this paper, we investigate how different semantic, or content-related, errors made by different types of data-to-text systems differ in terms of number and type. In total, we examine 15 systems: three rule-based and 12 neural systems including two large language models without training or fine-tuning. All systems were tested on the English WebNLG dataset version 3.0. We use a semantic error taxonomy and the brat annotation tool to obtain word-span error annotations on a sample of system outputs. The annotations enable us to establish how many semantic errors different (types of) systems make and what specific types of errors they make, and thus to get an overall understanding of semantic strengths and weaknesses among various types of NLG systems. Among our main findings, we observe that symbolic (rule and template-based) systems make fewer semantic errors overall, non-LLM neural systems have better fluency and data coverage, but make more semantic errors, while LLM-based systems require improvement particularly in addressing superfluous.
+ 2024.inlg-main.47
+ huidrom-etal-2024-differences-semantic
+
+
+ Leveraging Large Language Models for Building Interpretable Rule-Based Data-to-Text Systems
+ JędrzejWarczyński
+ MateuszLango
+ OndrejDusek
+ 622–630
+ We introduce a simple approach that uses a large language model (LLM) to automatically implement a fully interpretable rule-based data-to-text system in pure Python. Experimental evaluation on the WebNLG dataset showed that such a constructed system produces text of better quality (according to the BLEU and BLEURT metrics) than the same LLM prompted to directly produce outputs, and produces fewer hallucinations than a BART language model fine-tuned on the same data. Furthermore, at runtime, the approach generates text in a fraction of the processing time required by neural approaches, using only a single CPU.
+ 2024.inlg-main.48
+ 2024.inlg-main.48.Supplementary_Attachment.pdf
+ warczynski-etal-2024-leveraging-large
+
+
+ Explainability Meets Text Summarization: A Survey
+ MahdiDhaini
+ EgeErdogan
+ SmarthBakshi
+ GjergjiKasneci
+ 631–645
+ Summarizing long pieces of text is a principal task in natural language processing with Machine Learning-based text generation models such as Large Language Models (LLM) being particularly suited to it. Yet these models are often used as black-boxes, making them hard to interpret and debug. This has led to calls by practitioners and regulatory bodies to improve the explainability of such models as they find ever more practical use. In this survey, we present a dual-perspective review of the intersection between explainability and summarization by reviewing the current state of explainable text summarization and also highlighting how summarization techniques are effectively employed to improve explanations.
+ 2024.inlg-main.49
+ dhaini-etal-2024-explainability-meets
+
+
+ Generating Faithful and Salient Text from Multimodal Data
+ TahsinaHashem
+ WeiqingWang
+ Derry TantiWijaya
+ Mohammed EunusAli
+ Yuan-FangLi
+ 646–662
+ While large multimodal models (LMMs) have obtained strong performance on many multimodal tasks, they may still hallucinate while generating text. Their performance on detecting salient features from visual data is also unclear. In this paper, we develop a framework to generate faithful and salient text from mixed-modal data, which includes images and structured data ( represented in knowledge graphs or tables). Specifically, we train a vision critic model to identify hallucinated and non-salient features from the image modality. The critic model also generates a list of salient image features. This information is used in the post editing step to improve the generation quality. Experiments on two datasets show that our framework improves LMMs’ generation quality on both faithfulness and saliency, outperforming recent techniques aimed at reducing hallucination. The dataset and code are available at https://github.com/TahsinaHashem/FaithD2T.
+ 2024.inlg-main.50
+ 2024.inlg-main.50.Supplementary_Attachment.pdf
+ hashem-etal-2024-generating-faithful
+
+
+ Investigating Paraphrase Generation as a Data Augmentation Strategy for Low-Resource AMR-to-Text Generation
+ Marco AntonioSobrevilla Cabezudo
+ Marcio LimaInacio
+ Thiago Alexandre SalgueiroPardo
+ 663–675
+ Abstract Meaning Representation (AMR) is a meaning representation (MR) designed to abstract away from syntax, allowing syntactically different sentences to share the same AMR graph. Unlike other MRs, existing AMR corpora typically link one AMR graph to a single reference. This paper investigates the value of paraphrase generation in low-resource AMR-to-Text generation by testing various paraphrase generation strategies and evaluating their impact. The findings show that paraphrase generation significantly outperforms the baseline and traditional data augmentation methods, even with fewer training instances. Human evaluations indicate that this strategy often produces syntactic-based paraphrases and can exceed the performance of previous approaches. Additionally, the paper releases a paraphrase-extended version of the AMR corpus.
+ 2024.inlg-main.51
+ sobrevilla-cabezudo-etal-2024-investigating-paraphrase
+
+
+ Zooming in on Zero-Shot Intent-Guided and Grounded Document Generation using LLMs
+ PritikaRamu
+ PranshuGaur
+ RishitaEmandi
+ HimanshuMaheshwari
+ DanishJaved
+ AparnaGarimella
+ 676–694
+ Repurposing existing content on-the-fly to suit author’s goals for creating initial drafts is crucial for document creation. We introduce the task of intent-guided and grounded document generation: given a user-specified intent (e.g., section title) and a few reference documents, the goal is to generate section-level multimodal documents spanning text and images, grounded on the given references, in a zero-shot setting. We present a data curation strategy to obtain general-domain samples from Wikipedia, and collect 1,000 Wikipedia sections consisting of textual and image content along with appropriate intent specifications and references. We propose a simple yet effective planning-based prompting strategy, Multimodal Plan-And-Write (MM-PAW), to prompt LLMs to generate an intermediate plan with text and image descriptions, to guide the subsequent generation. We compare the performances of MM-PAW and a text-only variant of it with those of zero-shot Chain-of-Thought (CoT) using recent close and open-domain LLMs. Both of them lead to significantly better performances in terms of content relevance, structure, and groundedness to the references, more so in the smaller models (upto 12.5 points increase in Rouge 1-F1) than in the larger ones (upto 4 points increase in R1-F1). They are particularly effective in improving relatively smaller models’ performances, to be on par or higher than those of their larger counterparts for this task.
+ 2024.inlg-main.52
+ 2024.inlg-main.52.Supplementary_Attachment.pdf
+ ramu-etal-2024-zooming-zero
+
+
+ Zero-shot cross-lingual transfer in instruction tuning of large language models
+ NadezhdaChirkova
+ VassilinaNikoulina
+ 695–708
+ Instruction tuning (IT) is widely used to teach pretrained large language models (LLMs) to follow arbitrary instructions, but is under-studied in multilingual settings. In this work, we conduct a systematic study of zero-shot cross-lingual transfer in IT, when an LLM is instruction-tuned on English-only data and then tested on user prompts in other languages. We advocate for the importance of evaluating various aspects of model responses in multilingual instruction following and investigate the influence of different model configuration choices. We find that cross-lingual transfer does happen successfully in IT even if all stages of model training are English-centric, but only if multiliguality is taken into account in hyperparameter tuning and with large enough IT data. English-trained LLMs are capable of generating correct-language, comprehensive and helpful responses in other languages, but suffer from low factuality and may occasionally have fluency errors.
+ 2024.inlg-main.53
+ chirkova-nikoulina-2024-zero-shot
+
+
+
+
+ Proceedings of the 17th International Natural Language Generation Conference: System Demonstrations
+ SaadMahamood
+ Nguyen LeMinh
+ DaphneIppolito
+ Association for Computational Linguistics
+ Tokyo, Japan
+ September
+ 2024
+ 2024.inlg-demos
+ inlg
+
+
+ 2024.inlg-demos.0
+ inlg-2024-demos
+
+
+ Be My Mate: Simulating Virtual Students for collaboration using Large Language Models
+ SergiSolera-Monforte
+ PabloArnau-González
+ MiguelArevalillo-Herráez
+ 1–3
+ Advancements in machine learning, particularly Large Language Models (LLMs), offer new opportunities for enhancing education through personalized assistance. We introduce “Be My Mate,” an agent that leverages LLMs to simulate virtual peer students in online collaborative education. The system includes a subscription module for real-time updates and a conversational module for generating supportive interactions. Key challenges include creating temporally realistic interactions and credible error generation. The initial demonstration shows promise in enhancing student engagement and learning outcomes.
+ 2024.inlg-demos.1
+ solera-monforte-etal-2024-mate-simulating
+
+
+ MTSwitch: A Web-based System for Translation between Molecules and Texts
+ NijiaHan
+ ZimuWang
+ YuqiWang
+ HaiyangZhang
+ DaiyunHuang
+ WeiWang
+ 4–6
+ We introduce MTSwitch, a web-based system for the bidirectional translation between molecules and texts, leveraging various large language models (LLMs). It supports two crucial tasks, including molecule captioning (explaining the properties of a molecule) and molecule generation (designing a molecule based on specific properties). To the best of our knowledge, MTSwitch is currently the first accessible system that allows users to translate between molecular representations and descriptive text contents. The system and a screencast can be found in https://github.com/hanninaa/MTSwitch.
+ 2024.inlg-demos.2
+ han-etal-2024-mtswitch-web
+
+
+ VideoRAG: Scaling the context size and relevance for video question-answering
+ Shivprasad RajendraSagare
+ PrashantUllegaddi
+ NachikethK S
+ NavanithR
+ KinshukSarabhai
+ Rajesh KumarS A
+ 7–8
+ Recent advancements have led to the adaptation of several multimodal large language models (LLMs) for critical video-related use cases, particularly in Video Question-Answering (QA). However, most of the previous models sample only a limited number of frames from video due to the context size limit of backbone LLM. Another approach of applying temporal pooling to compress multiple frames, is also shown to saturate and does not scale well. These limitations cause videoQA on long videos to perform very poorly. To address this, we present VideoRAG, a system to utilize recently popularized Retrieval Augmented Generation (RAG) pipeline to select the top-k frames from video, relevant to the user query. We have observed a qualitative improvement in our experiments, indicating a promising direction to pursue. Additionally, our findings indicate that videoRAG demonstrates superior performance when addressing needle-in-the-haystack questions in long videos. Our extensible system allows for trying multiple strategies for indexing, ranking, and adding QA models.
+ 2024.inlg-demos.3
+ sagare-etal-2024-videorag-scaling
+
+
+ QCET: An Interactive Taxonomy of Quality Criteria for Comparable and Repeatable Evaluation of NLP Systems
+ AnyaBelz
+ SimonMille
+ CraigThomson
+ RudaliHuidrom
+ 9–12
+ Four years on from two papers (Belz et al., 2020; Howcroft et al., 2020) that first called out the lack of standardisation and comparability in the quality criteria assessed in NLP system evaluations, researchers still use widely differing quality criteria names and definitions, meaning that it continues to be unclear when the same aspect of quality is being assessed in two evaluations. While normalised quality criteria were proposed at the time, the list was unwieldy and using it came with a steep learning curve. In this demo paper, our aim is to address these issues with an interactive taxonomy tool that enables quick perusal and selection of the quality criteria, and provides decision support and examples of use at each node.
+ 2024.inlg-demos.4
+ belz-etal-2024-qcet-interactive
+
+
+ factgenie: A Framework for Span-based Evaluation of Generated Texts
+ ZdeněkKasner
+ OndrejPlatek
+ PatriciaSchmidtova
+ SimoneBalloccu
+ OndrejDusek
+ 13–15
+ We present ‘factgenie‘: a framework for annotating and visualizing word spans in textual model outputs. Annotations can capture various span-based phenomena such as semantic inaccuracies or irrelevant text. With ‘factgenie‘, the annotations can be collected both from human crowdworkers and large language models. Our framework consists of a web interface for data visualization and gathering text annotations, powered by an easily extensible codebase.
+ 2024.inlg-demos.5
+ kasner-etal-2024-factgenie-framework
+
+
+ Filling Gaps in Wikipedia: Leveraging Data-to-Text Generation to Improve Encyclopedic Coverage of Underrepresented Groups
+ SimonMille
+ MassimilianoPronesti
+ CraigThomson
+ MichelaLorandi
+ SophieFitzpatrick
+ RudaliHuidrom
+ MohammedSabry
+ AmyO’Riordan
+ AnyaBelz
+ 16–19
+ Wikipedia is known to have systematic gaps in its coverage that correspond to under-resourced languages as well as underrepresented groups. This paper presents a new tool to support efforts to fill in these gaps by automatically generating draft articles and facilitating post-editing and uploading to Wikipedia. A rule-based generator and an input-constrained LLM are used to generate two alternative articles, enabling the often more fluent, but error-prone, LLM-generated article to be content-checked against the more reliable, but less fluent, rule-generated article.
+ 2024.inlg-demos.6
+ mille-etal-2024-filling-gaps
+
+
+
+
+ Proceedings of the 17th International Natural Language Generation Conference: Tutorial Abstract
+ AnyaBelz
+ JoãoSedo
+ CraigThomson
+ SimonMille
+ RudaliHuidrom
+ Association for Computational Linguistics
+ Tokyo, Japan
+ September
+ 2024
+ 2024.inlg-tutorials
+ inlg
+
+
+ 2024.inlg-tutorials.0
+ inlg-2024-tutorials
+
+
+ The INLG 2024 Tutorial on Human Evaluation of NLP System Quality: Background, Overall Aims, and Summaries of Taught Units
+ AnyaBelz
+ JoãoSedoc
+ CraigThomson
+ SimonMille
+ RudaliHuidrom
+ 1–12
+ Following numerous calls in the literature for improved practices and standardisation in human evaluation in Natural Language Processing over the past ten years, we held a tutorial on the topic at the 2024 INLG Conference. The tutorial addressed the structure, development, design, implementation, execution and analysis of human evaluations of NLP system quality. Hands-on practical sessions were run, designed to facilitate assimilation of the material presented. Slides, lecture recordings, code and data have been made available on GitHub (https://github.com/Human-Evaluation-Tutorial/INLG-2024-Tutorial). In this paper, we provide summaries of the content of the eight units of the tutorial, alongside its research context and aims.
+ 2024.inlg-tutorials.1
+ belz-etal-2024-inlg
+
+
+
+
+ 2024.aiwolfdial-1
+ 2024.practicald2t-1
+
+
+
diff --git a/data/xml/2024.ml4al.xml b/data/xml/2024.ml4al.xml
index ff25726c28..1a034a9a92 100644
--- a/data/xml/2024.ml4al.xml
+++ b/data/xml/2024.ml4al.xml
@@ -17,13 +17,13 @@
Hybrid in Bangkok, Thailand and online
August
2024
- 2024.ml4al-1
+ 2024.ml4al-1
ml4al
ws
- 2024.ml4al-1.0
- ml4al-2024-machine
+ 2024.ml4al-1.0
+ ml4al-2024-1
Challenging Error Correction in Recognised Byzantine Greek
@@ -39,7 +39,7 @@
FranzFischerUniversity of Venice
1-12
Automatic correction of errors in Handwritten Text Recognition (HTR) output poses persistent challenges yet to be fully resolved. In this study, we introduce a shared task aimed at addressing this challenge, which attracted 271 submissions, yielding only a handful of promising approaches. This paper presents the datasets, the most effective methods, and an experimental analysis in error-correcting HTRed manuscripts and papyri in Byzantine Greek, the language that followed Classical and preceded Modern Greek. By using recognised and transcribed data from seven centuries, the two best-performing methods are compared, one based on a neural encoder-decoder architecture and the other based on engineered linguistic rules. We show that the recognition error rate can be reduced by both, up to 2.5 points at the level of characters and up to 15 at the level of words, while also elucidating their respective strengths and weaknesses.
- 2024.ml4al-1.1
+ 2024.ml4al-1.1
pavlopoulos-etal-2024-challenging
@@ -50,7 +50,7 @@
MosheKoppelBar-Ilan University
13-18
Hebrew manuscripts preserve thousands of textual transmissions of post-Biblical Hebrew texts from the first millennium. In many cases, the text in the manuscripts is not fully decipherable, whether due to deterioration, perforation, burns, or otherwise. Existing BERT models for Hebrew struggle to fill these gaps, due to the many orthographical deviations found in Hebrew manuscripts. We have pretrained a new dedicated BERT model, dubbed MsBERT (short for: Manuscript BERT), designed from the ground up to handle Hebrew manuscript text. MsBERT substantially outperforms all existing Hebrew BERT models regarding the prediction of missing words in fragmentary Hebrew manuscript transcriptions in multiple genres, as well as regarding the task of differentiating between quoted passages and exegetical elaborations. We provide MsBERT for free download and unrestricted use, and we also provide an interactive and user-friendly website to allow manuscripts scholars to leverage the power of MsBERT in their scholarly work of reconstructing fragmentary Hebrew manuscripts.
- 2024.ml4al-1.2
+ 2024.ml4al-1.2
shmidman-etal-2024-msbert
@@ -58,7 +58,7 @@
FedericaGambaInstitute of Formal and Applied Linguistics, Charles University Prague
19-29
This paper explores the possibility to exploit different Pretrained Language Models (PLMs) to assist in a manual annotation task consisting in assigning the appropriate sense to verbal predicates in a Latin text. Indeed, this represents a crucial step when annotating data according to the Uniform Meaning Representation (UMR) framework, designed to annotate the semantic content of a text in a cross-linguistic perspective. We approach the study as a Word Sense Disambiguation task, with the primary goal of assessing the feasibility of leveraging available resources for Latin to streamline the labor-intensive annotation process. Our methodology revolves around the exploitation of contextual embeddings to compute token similarity, under the assumption that predicates sharing a similar sense would also share their context of occurrence. We discuss our findings, emphasizing applicability and limitations of this approach in the context of Latin, for which the limited amount of available resources poses additional challenges.
- 2024.ml4al-1.3
+ 2024.ml4al-1.3
gamba-2024-predicate
@@ -70,7 +70,7 @@
JacoboMyerston
30-41
Cuneiform is the oldest writing system used for more than 3,000 years in ancient Mesopotamia. Cuneiform is written on clay tablets, which are hard to date because they often lack explicit references to time periods and their paleographic traits are not always reliable as a dating criterion. In this paper, we systematically analyse cuneiform dating problems using machine learning. We build baseline models for both visual and textual features and identify two major issues: confounds and distribution shift. We apply adversarial regularization and deep domain adaptation to mitigate these issues. On tablets from the same museum collections represented in the training set, we achieve accuracies as high as 84.42%. However, when test tablets are taken from held-out collections, models generalize more poorly. This is only partially mitigated by robust learning techniques, highlighting important challenges for future work.
- 2024.ml4al-1.4
+ 2024.ml4al-1.4
chen-etal-2024-classification
@@ -81,7 +81,7 @@
OrlyGoldwasserHebrew University of Jerusalem
42-47
The complex Ancient Egyptian (AE) writing system was characterised by widespread use of graphemic classifiers (determinatives): silent (unpronounced) hieroglyphic signs clarifying the meaning or indicating the pronunciation of the host word. The study of classifiers has intensified in recent years with the launch and quick growth of the iClassifier project, a web-based platform for annotation and analysis of classifiers in ancient and modern languages. Thanks to the data contributed by the project participants, it is now possible to formulate the identification of classifiers in AE texts as an NLP task. In this paper, we make first steps towards solving this task by implementing a series of sequence-labelling neural models, which achieve promising performance despite the modest amount of training data. We discuss tokenisation and operationalisation issues arising from tackling AE texts and contrast our approach with frequency-based baselines.
- 2024.ml4al-1.5
+ 2024.ml4al-1.5
nikolaev-etal-2024-classifier
@@ -92,7 +92,7 @@
ToshinobuOgisoNational Insititute for Japanese Language and Linguistics
48-55
In Japanese, the natural minimal phrase of a sentence is the “bunsetsu” and it serves as a natural boundary of a sentence for native speakers rather than words, and thus grammatical analysis in Japanese linguistics commonly operates on the basis of bunsetsu units.In contrast, because Japanese does not have delimiters between words, there are two major categories of word definition, namely, Short Unit Words (SUWs) and Long Unit Words (LUWs).Though a SUW dictionary is available, LUW is not.Hence, this study focuses on providing deep learning-based (or LLM-based) bunsetsu and Long Unit Words analyzer for the Heian period (AD 794-1185) and evaluating its performances.We model the parser as transformer-based joint sequential labels model, which combine bunsetsu BI tag, LUW BI tag, and LUW Part-of-Speech (POS) tag for each SUW token.We train our models on corpora of each period including contemporary and historical Japanese.The results range from 0.976 to 0.996 in f1 value for both bunsetsu and LUW reconstruction indicating that our models achieve comparable performance with models for a contemporary Japanese corpus.Through the statistical analysis and diachronic case study, the estimation of bunsetsu could be influenced by the grammaticalization of morphemes.
- 2024.ml4al-1.6
+ 2024.ml4al-1.6
ozaki-etal-2024-long
@@ -101,7 +101,7 @@
JustinBarneyWestern Michigan University
56-60
The Machine-Actionable Ancient Text (MAAT) Corpus is a new resource providing training and evaluation data for restoring lacunae in ancient Greek, Latin, and Coptic texts. Current text restoration systems require large amounts of data for training and task-relevant means for evaluation. The MAAT Corpus addresses this need by converting texts available in EpiDoc XML format into a machine-actionable format that preserves the most textually salient aspects needed for machine learning: the text itself, lacunae, and textual restorations. Structured test cases are generated from the corpus that align with the actual text restoration task performed by papyrologists and epigraphist, enabling more realistic evaluation than the synthetic tasks used previously. The initial 1.0 beta release contains approximately 134,000 text editions, 178,000 text blocks, and 750,000 individual restorations, with Greek and Latin predominating. This corpus aims to facilitate the development of computational methods to assist scholars in accurately restoring ancient texts.
- 2024.ml4al-1.7
+ 2024.ml4al-1.7
fitzgerald-barney-2024-new
@@ -113,7 +113,7 @@
AmirZeldesGeorgetown University
61-70
Ancient manuscripts are frequently damaged, containing gaps in the text known as lacunae. In this paper, we present a bidirectional RNN model for character prediction of Coptic characters in manuscript lacunae. Our best model performs with 72% accuracy on single character reconstruction, but falls to 37% when reconstructing lacunae of various lengths. While not suitable for definitive manuscript reconstruction, we argue that our RNN model can help scholars rank the likelihood of textual reconstructions. As evidence, we use our RNN model to rank reconstructions in two early Coptic manuscripts. Our investigation shows that neural models can augment traditional methods of textual restoration, providing scholars with an additional tool to assess lacunae in Coptic manuscripts.
- 2024.ml4al-1.8
+ 2024.ml4al-1.8
levine-etal-2024-lacuna
@@ -124,7 +124,7 @@
AlessandroLenciUniversity of Pisa
71-86
This work explores the potential of Transformer models focusing on the translation of ancient Egyptian hieroglyphs. We present a novel Hieroglyphic Transformer model, built upon the powerful M2M-100 multilingual translation framework and trained on a dataset we customised from the Thesaurus Linguae Aegyptiae database. Our experiments demonstrate promising results, with the model achieving significant accuracy in translating hieroglyphics into both German and English. This work holds significant implications for Egyptology, potentially accelerating the translation process and unlocking new research approaches.
- 2024.ml4al-1.9
+ 2024.ml4al-1.9
cao-etal-2024-deep
@@ -133,7 +133,7 @@
Eliese-SophiaLinckeFreie Universität Berlin
87-97
We present models for lemmatizing and POS-tagging Earlier Egyptian, Coptic and Demotic to test the performance of our pipeline for the ancient languages of Egypt. Of these languages, Demotic and Egyptian are known to be difficult to annotate due to their high extent of ambiguity. We report lemmatization accuracy of 86%, 91% and 99%, and XPOS-tagging accuracy of 89%, 95% and 98% for Earlier Egyptian, Demotic and Coptic, respectively.
- 2024.ml4al-1.10
+ 2024.ml4al-1.10
sahala-lincke-2024-neural
@@ -146,7 +146,7 @@
XueshanLiHenan Normal University
98-106
Oracle bone script (OBS) is the earliest writing system in China, which is of great value in the improvement of archaeology and Chinese cultural history. However, there are some problems such as the lack of labels and the difficulty to distinguish the glyphs from the background of OBS, which makes the automatic recognition of OBS in the real world not achieve the satisfactory effect. In this paper, we propose a character recognition method based on an unsupervised domain adaptive network (UFCNet). Firstly, a convolutional attention fusion module (CAFM) is designed in the encoder to obtain more global features through multi-layer feature fusion. Second, we construct a Fourier transform (FT) module that focuses on the differences between glyphs and backgrounds. Finally, to further improve the network’s ability to recognize character edges, we introduce a kernel norm-constrained loss function. Extensive experiments perform on the Oracle-241 dataset show that the proposed method is superior to other adaptive methods. The code will be available at https://github.com/zhouynan/UFCNet.
- 2024.ml4al-1.11
+ 2024.ml4al-1.11
zhou-etal-2024-ufcnet
@@ -158,7 +158,7 @@
XueshanLiHenan Normal University
107-114
Due to ancient origin, there are many incomplete characters in the unearthed Oracle Bone Inscriptions(OBI), which brings the great challenges to recognition and research. In recent years, image inpainting techniques have made remarkable progress. However, these models are unable to adapt to the unique font shape and complex text background of OBI. To meet these aforementioned challenges, we propose a two-stage method for restoring damaged OBI using Generative Adversarial Networks (GAN), which incorporates a dual discriminator structure to capture both global and local image information. In order to accurately restore the image structure and details, the spatial attention mechanism and a novel loss function are proposed. By feeding clear copies of existing OBI and various types of masks into the network, it learns to generate content for the missing regions. Experimental results demonstrate the effectiveness of our proposed method in completing OBI compared to several state-of-the-art techniques.
- 2024.ml4al-1.12
+ 2024.ml4al-1.12
wang-etal-2024-coarse
@@ -167,7 +167,7 @@
DimitriosKosmopoulos
115-129
We investigate the problem of restoring Mycenaean linear B clay tablets, dating from about 1400 B.C. to roughly 1200 B.C., by using text infilling methods based on machine learning models. Our goals here are: first to try to improve the results of the methods used in the related literature by focusing on the characteristics of the Mycenaean Linear B writing system (series D), second to examine the same problem for the first time on series A&B and finally to investigate transfer learning using series D as source and the smaller series A&B as target. Our results show promising results in the supervised learning tasks, while further investigation is needed to better exploit the merits of transfer learning.
- 2024.ml4al-1.13
+ 2024.ml4al-1.13
papavassileiou-kosmopoulos-2024-restoring
@@ -180,7 +180,7 @@
RoeyLalazar
130-140
Cuneiform documents, the earliest known form of writing, are prolific textual sources of the ancient past. Experts publish editions of these texts in transliteration using specialized typesetting, but most remain inaccessible for computational analysis in traditional printed books or legacy materials. Off-the-shelf OCR systems are insufficient for digitization without adaptation. We present CuReD (Cuneiform Recognition-Documents), a deep learning-based human-in-the-loop OCR pipeline for digitizing scanned transliterations of cuneiform texts. CuReD has a character error rate of 9% on clean data and 11% on representative scans. We digitized a challenging sample of transliterated cuneiform documents, as well as lexical index cards from the University of Pennsylvania Museum, demonstrating the feasibility of our platform for enabling computational analysis and bolstering machine-readable cuneiform text datasets. Our result provide the first human-in-the-loop pipeline and interface for digitizing transliterated cuneiform sources and legacy materials, enabling the enrichment of digital sources of these low-resource languages.
- 2024.ml4al-1.14
+ 2024.ml4al-1.14
gordin-etal-2024-cured
@@ -188,7 +188,7 @@
FlorianKesslerFriedrich-Alexander Universität Erlangen-Nürnberg
141-151
For the automatic processing of Classical Chinese texts it is highly desirable to normalize variant characters, i.e. characters with different visual forms that are being used to represent the same morpheme, into a single form. However, there are some variant characters that are used interchangeably by some writers but deliberately employed to distinguish between different meanings by others. Hence, in order to avoid losing information in the normalization processes by conflating meaningful distinctions between variants, an intelligent normalization system that takes context into account is needed. Towards the goal of developing such a system, in this study, we describe how a dataset with usage samples of variant characters can be extracted from a corpus of paired editions of multiple texts. Using the dataset, we conduct two experiments, testing whether models can be trained with contextual word embeddings to predict variant characters. The results of the experiments show that while this is often possible for single texts, most conventions learned do not transfer well between documents.
- 2024.ml4al-1.15
+ 2024.ml4al-1.15
kessler-2024-towards
@@ -201,7 +201,7 @@
MargheritaFantoliKU Leuven
152-164
In this paper, we present a study of transformer-based Named Entity Recognition (NER) as applied to Ancient Greek texts, with an emphasis on retrieving personal names. Recent research shows that, while the task remains difficult, the use of transformer models results in significant improvements. We, therefore, compare the performance of four transformer models on the task of NER for the categories of people, locations and groups, and add an out-of-domain test set to the existing datasets. Results on this set highlight the shortcomings of the models when confronted with a random sample of sentences. To be able to more straightforwardly integrate domain and linguistic knowledge to improve performance, we narrow down our approach to the category of people. The task is simplified to a binary PERS/MISC classification on the token level, starting from capitalised words. Next, we test the use of domain and linguistic knowledge to improve the results. We find that including simple gazetteer information as a binary mask has a marginally positive effect on newly annotated data and that treebanks can be used to help identify multi-word individuals if they are scarcely or inconsistently annotated in the available training data. The qualitative error analysis identifies the potential for improvement in both manual annotation and the inclusion of domain and linguistic knowledge in the transformer models.
- 2024.ml4al-1.16
+ 2024.ml4al-1.16
beersmans-etal-2024-gotta
@@ -210,7 +210,7 @@
WouterMercelisKU Leuven
165-176
Natural language processing for Greek and Latin, inflectional languages with small corpora, requires special techniques. For morphological tagging, transformer models show promising potential, but the best approach to use these models is unclear. For both languages, this paper examines the impact of using morphological lexica, training different model types (a single model with a combined feature tag, multiple models for separate features, and a multi-task model for all features), and adding linguistic constraints. We find that, although simply fine-tuning transformers to predict a monolithic tag may already yield decent results, each of these adaptations can further improve tagging accuracy.
- 2024.ml4al-1.17
+ 2024.ml4al-1.17
keersmaekers-mercelis-2024-adapting
@@ -219,9 +219,13 @@
MatthewSwindall
JamesBrusuelas
JohnWallinMiddle Tennessee State University
+ FrancescaMaltomini
+ MariusGerhardt
+ MarziaD’Angelo
+ JohnF. Wallin
177-185
In this paper we present a deep learning pipeline for automatically dating ancient Greek papyrus fragments based solely on fragment images. The overall pipeline consists of several stages, including handwritten text recognition (HTR) to detect and classify characters, filtering and grouping of detected characters, 24 character-level date prediction models, and a fragment-level date prediction model that utilizes the per-character predictions. A new dataset (containing approximately 7,000 fragment images and 778,000 character images) was created by scraping papyrus databases, extracting fragment images with known dates, and running them through our HTR models to obtain labeled character images. Transfer learning was then used to fine-tune separate ResNets to predict dates for individual characters which are then used, in aggregate, to train the fragment-level date prediction model. Experiments show that even though the average accuracies of character-level dating models is low, between 35%-45%, the fragment-level model can achieve up to 79% accuracy in predicting a broad, two-century date range for fragments with many characters. We then discuss the limitations of this approach and outline future work to improve temporal resolution and further testing on additional papyri. This image-based deep learning approach has great potential to assist scholars in the palaeographical analysis and dating of ancient Greek manuscripts.
- 2024.ml4al-1.18
+ 2024.ml4al-1.18
west-etal-2024-deep
@@ -230,7 +234,7 @@
AdamAndersonUniversity of California, Berkeley
186-191
Beginning with the discovery of the cuneiform writing system in 1835, there have been numerous grammars published illustrating the complexities of the Sumerian language. However, the one thing they have in common is their omission of dependency rules for syntax in Sumerian linguistics. For this reason we are working toward a better understanding of Sumerian syntax, by means of dependency-grammar in the Universal Dependencies (UD) framework. Therefore, in this study we articulate the methods and engineering techniques that can address the hardships in annotating dependency relationships in the Sumerian texts in transliteration from the Electronic Text Corpora of Sumerian (ETCSUX). Our code can be found at https://github.com/ancient-world-citation-analysis/UD-ETCSUX.
- 2024.ml4al-1.19
+ 2024.ml4al-1.19
jiang-anderson-2024-ud
@@ -239,8 +243,8 @@
RichardDiehl MartinezUniversity of Cambridge
DanJurafskyStanford University
192-202
- Sumerian transliteration is a conventional system for representing a scholar's interpretation of a tablet in the Latin script. Thanks to visionary digital Assyriology projects such as ETCSL, CDLI, and Oracc, a large number of Sumerian transliterations have been published online, and these data are well-structured for a variety of search and analysis tasks. However, the absence of a comprehensive, accessible dataset pairing transliterations with a digital representation of the tablet's cuneiform glyphs has prevented the application of modern Natural Language Processing (NLP) methods to the task of Sumerian transliteration. To address this gap, we present SumTablets, a dataset pairing Unicode representations of 91,606 Sumerian cuneiform tablets (totaling 6,970,407 glyphs) with the associated transliterations published by Oracc. We construct SumTablets by first preprocessing and standardizing the Oracc transliterations before mapping each reading back to the Unicode representation of the source glyph. Furthermore, we retain parallel structural information (e.g., surfaces, newlines, broken segments) through the use of special tokens. We release SumTablets as a Hugging Face Dataset (CC BY 4.0) and open source data preparation code via GitHub. Additionally, we leverage SumTablets to implement and evaluate two transliteration baselines: (1) weighted sampling from a glyph's possible readings, and (2) fine-tuning an autoregressive language model. Our fine-tuned language model achieves an average transliteration character-level F-score (chrF) of 97.55, demonstrating the immediate potential of transformer-based transliteration models in allowing experts to rapidly verify generated transliterations rather than manually transliterating tablets one-by-one.
- 2024.ml4al-1.20
+ Transliterating Sumerian is a key step in understanding Sumerian texts, but remains a difficult and time-consuming task. With more than 100,000 known texts and comparatively few specialists, manually maintaining up-to-date transliterations for the entire corpus is impractical. While many transliterations have been published online thanks to the dedicated effort of previous projects, the lack of a comprehensive, easily accessible dataset that pairs digital representations of source glyphs with their transliterations has hindered the application of natural language processing (NLP) methods to this task.To address this gap, we present SumTablets, the largest collection of Sumerian cuneiform tablets structured as Unicode glyph–transliteration pairs. Our dataset comprises 91,606 tablets (totaling 6,970,407 glyphs) with associated period and genre metadata. We release SumTablets as a Hugging Face Dataset.To construct SumTablets, we first preprocess and standardize publicly available transliterations. We then map them back to a Unicode representation of their source glyphs, retaining parallel structural information (e.g., surfaces, newlines, broken segments) through the use of special tokens.We leverage SumTablets to implement and evaluate two transliteration approaches: 1) weighted sampling from a glyph’s possible readings, 2) fine-tuning an autoregressive language model. Our fine-tuned language model achieves an average transliteration character-level F-score (chrF) of 97.55, demonstrating the potential use of deep learning methods in Assyriological research.
+ 2024.ml4al-1.20
simmons-etal-2024-sumtablets
@@ -250,7 +254,7 @@
LaureThompsonPrinceton University and University of Massachusetts, Amherst
203-218
Existing Latin treebanks draw from Latin’s long written tradition, spanning 17 centuries and a variety of cultures. Recent efforts have begun to harmonize these treebanks’ annotations to better train and evaluate morphological taggers. However, the heterogeneity of these treebanks must be carefully considered to build effective and reliable data. In this work, we review existing Latin treebanks to identify the texts they draw from, identify their overlap, and document their coverage across time and genre. We additionally design automated conversions of their morphological feature annotations into the conventions of standard Latin grammar. From this, we build new time-period data splits that draw from the existing treebanks which we use to perform a broad cross-time analysis for POS and morphological feature tagging. We find that BERT-based taggers outperform existing taggers while also being more robust to cross-domain shifts.
- 2024.ml4al-1.21
+ 2024.ml4al-1.21
hudspeth-etal-2024-latin
@@ -259,7 +263,7 @@
IkkiOhmukaiThe University of Tokyo
219-223
This study analyzes the verses of the Rigveda, the oldest Sanskrit text, from a metrical perspective. Based on metrical structures, the verses are represented by four elements: light syllables, heavy syllables, word boundaries, and line boundaries. As a result, it became evident that among verses traditionally categorized under the same metrical name, there are those forming distinct clusters. Furthermore, the study reveals commonalities in metrical structures, such as similar metrical patterns grouping together despite differences in the number of lines. Going forward, it is anticipated that this methodology will enable comparisons across multiple languages within the Indo-European language family.
- 2024.ml4al-1.22
+ 2024.ml4al-1.22
tsukagoshi-ohmukai-2024-metronome
@@ -267,7 +271,7 @@
PriyankaMandikalUniversity of Texas, Austin
224-250
LLMs have revolutionized the landscape of information retrieval and knowledge dissemination. However, their application in specialized areas is often hindered by limitations such as factual inaccuracies and hallucinations, especially in long-tail knowledge distributions. In this work, we explore the potential of retrieval-augmented generation (RAG) models in performing long-form question answering (LFQA) on a specially curated niche and custom knowledge domain. We present VedantaNY-10M, a dataset curated from extensive public discourses on the ancient Indian philosophy of Advaita Vedanta. We develop and benchmark a RAG model against a standard, non-RAG LLM, focusing on transcription, retrieval, and generation performance. A human evaluation involving computational linguists and domain experts, shows that the RAG model significantly outperforms the standard model in producing factual, comprehensive responses having fewer hallucinations. In addition, we find that a keyword-based hybrid retriever that focuses on unique low-frequency words further improves results. Our study provides insights into the future development of real-world RAG models for custom and niche areas of knowledge.
- 2024.ml4al-1.23
+ 2024.ml4al-1.23
mandikal-2024-ancient
@@ -279,7 +283,7 @@
JosephDexter
251-259
In literary critical applications, stylometry can benefit from hand-curated feature sets capturing various syntactic and rhetorical functions. For premodern languages, calculation of such features is hampered by a lack of adequate computational resources for accurate part-of-speech tagging and semantic disambiguation. This paper reports an evaluation of POS-taggers for Latin and their use in augmenting a hand-curated stylometric feature set. Our experiments show that POS-augmented features not only provide more accurate counts than POS-blind features but also perform better on tasks such as genre classification. In the course of this work we introduce POS n-grams as a feature for Latin stylometry.
- 2024.ml4al-1.24
+ 2024.ml4al-1.24
chen-etal-2024-leveraging
@@ -289,8 +293,19 @@
EltonBarkerOpen University
260-268
Past research has modelled statistically the language of the Homeric poems, assessing the degree of surprisal for each verse through diverse metrics and resulting to the HoLM resource. In this study we utilise the HoLM resource to explore cross poem affinity at the verse level, looking at Iliadic verses and passages that are less surprising to the Odyssean model than to the Iliadic one and vice-versa. Using the same tool, we investigate verses that evoke greater surprise when assessed by a local model trained solely on their source book, compared to a global model trained on the entire source poem. Investigating deeper on the distribution of such verses across the Homeric poems we employ machine learning text classification to further analyse quantitatively cross-poem affinity in selected books.
- 2024.ml4al-1.25
+ 2024.ml4al-1.25
konstantinidou-etal-2024-exploring
+
+ Detecting Narrative Patterns in Biblical Hebrew and Greek
+ HopeMcGovern
+ HaleSirin
+ TomLippincottDepartment of Computer Science, Whiting School of Engineering
+ AndrewCainesUniversity of Cambridge
+ 269-279
+ We present a novel approach to extracting recurring narrative patterns, or type-scenes, in Biblical Hebrew and Biblical Greek with an information retrieval network. We use cross-references to train an encoder model to create similar representations for verses linked by a cross-reference. We then query our trained model with phrases informed by humanities scholarship and designed to elicit particular kinds of narrative scenes. Our models can surface relevant instances in the top-10 ranked candidates in many cases.Through manual error analysis and discussion, we address the limitations and challenges inherent in our approach. Our findings contribute to the field of Biblical scholarship by offering a new perspective on narrative analysis within ancient texts, and to computational modeling of narrative with a genre-agnostic approach for pattern-finding in long, literary texts.
+ 2024.ml4al-1.26
+ mcgovern-etal-2024-detecting
+
diff --git a/data/xml/2024.naacl.xml b/data/xml/2024.naacl.xml
index e8f65fc41a..b63c7ea8f9 100644
--- a/data/xml/2024.naacl.xml
+++ b/data/xml/2024.naacl.xml
@@ -8681,7 +8681,7 @@
Shears: Unstructured Sparsity with Neural Low-rank Adapter Search
- JuanMunozIntel
+ J. PabloMuñozIntel
JinjieYuanIntel
NileshJainIntel
395-405
diff --git a/data/xml/2024.practicald2t.xml b/data/xml/2024.practicald2t.xml
new file mode 100644
index 0000000000..20fdd60b65
--- /dev/null
+++ b/data/xml/2024.practicald2t.xml
@@ -0,0 +1,52 @@
+
+
+
+
+ Proceedings of the 2nd Workshop on Practical LLM-assisted Data-to-Text Generation
+ SimoneBalloccu
+ ZdeněkKasner
+ OndřejPlátek
+ PatríciaSchmidtová
+ KristýnaOnderková
+ MateuszLango
+ OndřejDušek
+ LucieFlek
+ EhudReiter
+ DimitraGkatzia
+ SimonMille
+ Association for Computational Linguistics
+ Tokyo, Japan
+ September
+ 2024
+ 2024.practicald2t-1
+ practicald2t
+ ws
+
+
+ 2024.practicald2t-1.0
+ practicald2t-2024-1
+
+
+ Beyond the Hype: Identifying and Analyzing Math Word Problem-Solving Challenges for Large Language Models
+ Romina SoledadAlbornoz-De Luise
+ DavidArnau
+ PabloArnau-González
+ MiguelArevalillo-Herráez
+ 1–6
+ Despite not being explicitly trained for this purpose, models like Mistral and LLaMA have demonstrated impressive results across numerous tasks, including generating solutions to Mathematical Word Problems (MWPs). A MWP involves translating a textual description into a mathematical model or equation that solving it. However, these models face challenges in accurately interpreting and utilizing the numerical information present in the MWP statements, which can lead to errors in the generated solutions. To better understand the limitations of LLMs, we analyzed the MWP where models failed to accurately solve problems from the SVAMP dataset. By categorizing these MWPs, we identify specific types of problems where the models are most prone to errors, providing insights into the underlying challenges faced by LLMs in problem-solving scenarios and open new modeling opportunities. By understanding the expected errors, researchers can design strategies to adequately model problems more effectively and choose the most suitable LLM for solving them taking into account each model’s strengths and weaknesses.
+ 2024.practicald2t-1.1
+ albornoz-de-luise-etal-2024-beyond
+
+
+ Enhancing Situation Awareness through Model-Based Explanation Generation
+ KonstantinosGavriilidis
+ IoannisKonstas
+ HelenHastie
+ WeiPang
+ 7–16
+ Robots are often deployed in remote locations for tasks such as exploration, where users cannot directly perceive the agent and its environment. For Human-In-The-Loop applications, operators must have a comprehensive understanding of the robot’s current state and its environment to take necessary actions and effectively assist the agent. In this work, we compare different explanation styles to determine the most effective way to convey real-time updates to users. Additionally, we formulate these explanation styles as separate fine-tuning tasks and assess the effectiveness of large language models in delivering in-mission updates to maintain situation awareness. The code and dataset for this work are available at:———
+ 2024.practicald2t-1.2
+ gavriilidis-etal-2024-enhancing
+
+
+
diff --git a/data/xml/2024.repl4nlp.xml b/data/xml/2024.repl4nlp.xml
index a984f8f7ad..3d128059db 100644
--- a/data/xml/2024.repl4nlp.xml
+++ b/data/xml/2024.repl4nlp.xml
@@ -222,14 +222,6 @@
2024.repl4nlp-1.19
ki-etal-2024-mitigating
-
- On the Pathological Path-star Task for Language Models (Extended Abstract)
- ArvidFrydenlund
- 274-284
- The recently introduced path-star task is a minimal toy task designed to exemplify limitations to the abilities of language models (Bachmann and Nagarajan, 2024). It involves a path-star graph where multiple arms radiate from a single starting node and each node is unique. Then, given the start node and a specified target node which ends one of the arms, the task is to generate the arm containing that target node. This is straightforward for a human but surprisingly difficult for a language model, which they found failed to predict above chance.They hypothesized this is due to a deficiency in teacher-forcing and next-token prediction paradigm. In this extended abstract, we demonstrate that the task is learnable using teacher-forcing in alternative settings and that the issue is (partially) due to representation. We analyze situations when the models fail to solve the task which leads us to introduce a regularization technique where we pack each training batch with multiple instances of the same graph but with differing target nodes to prevent overfitting. Initial results indicate this helps in solving the task.
- 2024.repl4nlp-1.20
- frydenlund-2024-pathological
-
Whitening Not Recommended for Classification Tasks in LLMs
AliForooghiUniversity of Windsor
@@ -240,16 +232,5 @@
2024.repl4nlp-1.21
forooghi-etal-2024-whitening
-
- LLM Circuit Analyses Are Consistent Across Training and Scale
- CurtTiggesEleutherAI Institute
- MichaelHannaUniversity of Amsterdam
- QinanYu
- StellaBidermanEleutherAI and Booz Allen Hamilton
- 290-303
- Most currently deployed large language models (LLMs) undergo continuous training or additional finetuning. By contrast, most research into LLMs’ internal mechanisms focuses on models at one snapshot in time (the end of pre-training), raising the question of whether their results generalize to real-world settings. Existing studies of mechanisms over time focus on encoder-only or toy models, which differ significantly from most deployed models. In this study, we track how model mechanisms, operationalized as circuits, emerge and evolve across 300 billion tokens of training in decoder-only LLMs, in models ranging from 70 million to 2.8 billion parameters. We find that task abilities and the functional components that support them emerge consistently at similar token counts across scale. Moreover, although such components may be implemented by different attention heads over time, the overarching algorithm that they implement remains. Surprisingly, both these algorithms and the types of components involved therein tend to replicate across model scale. Finally, we find that circuit size correlates with model size and can fluctuate considerably over time even when the same algorithm is implemented. These results suggest that circuit analyses conducted on small models at the end of pre-training can provide insights that still apply after additional training and over model scale.
- 2024.repl4nlp-1.22
- tigges-etal-2024-llm
-
diff --git a/data/xml/2024.sigdial.xml b/data/xml/2024.sigdial.xml
new file mode 100644
index 0000000000..8aecc67d7a
--- /dev/null
+++ b/data/xml/2024.sigdial.xml
@@ -0,0 +1,754 @@
+
+
+
+
+ Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue
+ TatsuyaKawahara
+ VeraDemberg
+ StefanUltes
+ KojiInoue
+ ShikibMehri
+ DavidHowcroft
+ KazunoriKomatani
+ Association for Computational Linguistics
+ Kyoto, Japan
+ September
+ 2024
+ 2024.sigdial-1
+ sigdial
+
+
+ 2024.sigdial-1.0
+ sigdial-2024-1
+
+
+ Dialogue Discourse Parsing as Generation: A Sequence-to-Sequence LLM-based Approach
+ ChuyuanLi
+ YuweiYin
+ GiuseppeCarenini
+ 1–14
+ Existing works on dialogue discourse parsing mostly utilize encoder-only models and sophisticated decoding strategies to extract structures. Despite recent advances in Large Language Models (LLMs), there has been little work applying directly these models on discourse parsing. To fully utilize the rich semantic and discourse knowledge in LLMs, we explore the feasibility of transforming discourse parsing into a generation task using a text-to-text paradigm. Our approach is intuitive and requires no modification of the LLM architecture. Experimental results on STAC and Molweni datasets show that a sequence-to-sequence model such as T0 can perform reasonably well. Notably, our improved transition-based sequence-to-sequence system achieves new state-of-the-art performance on Molweni, demonstrating the effectiveness of the proposed method. Furthermore, our systems can generate richer discourse structures such as directed acyclic graphs, whereas previous methods are limited to trees.
+ 2024.sigdial-1.1
+ li-etal-2024-dialogue
+
+
+ Rhetorical Strategies in the UN Security Council: Rhetorical Structure Theory and Conflicts
+ KarolinaZaczynska
+ ManfredStede
+ 15–28
+ More and more corpora are being annotated with Rhetorical Structure Theory (RST) trees, often in a multi-layer scenario, as analyzing RST annotations in combination with other layers can lead to a deeper understanding of texts. To date, prior work on RST for the analysis of diplomatic language however, is scarce. We are interested in political speeches and investigate what rhetorical strategies diplomats use to communicate critique or deal with disputes. To this end, we present a new dataset with RST annotations of 82 diplomatic speeches aligned to existing Conflict annotations (UNSC-RST). We explore ways of using rhetorical trees to analyze an annotated multi-layer corpus, looking at both the relation distribution and the tree structure of speeches. In preliminary analyses we already see patterns that are characteristic for particular topics or countries.
+ 2024.sigdial-1.2
+ zaczynska-stede-2024-rhetorical
+
+
+ Elaborative Simplification for German-Language Texts
+ FreyaHewett
+ HadiAsghari
+ ManfredStede
+ 29–39
+ There are many strategies used to simplify texts. In this paper, we focus specifically on the act of inserting information or elaborative simplification. Adding information is done for various reasons, such as providing definitions for concepts, making relations between concepts more explicit, and providing background information that is a prerequisite for the main content. As all of these reasons have the main goal of ensuring coherence, we first conduct a corpus analysis of simplified German-language texts that have been annotated with Rhetorical Structure Theory (RST). We focus specifically on how additional information is incorporated into the RST annotation for a text. We then transfer these insights to automatic simplification using Large Language Models (LLMs), as elaborative simplification is a nuanced task which LLMs still seem to struggle with.
+ 2024.sigdial-1.3
+ hewett-etal-2024-elaborative
+
+
+ Examining Gender and Power on Wikipedia through Face and Politeness
+ AdilSoubki
+ Shyne E.Choi
+ OwenRambow
+ 40–50
+ We propose a framework for analyzing discourse by combining two interdependent concepts from sociolinguistic theory: face acts and politeness. While politeness has robust existing tools and data, face acts are less resourced. We introduce a new corpus created by annotating Wikipedia talk pages with face acts and we use this to train a face act tagger. We then employ our framework to study how face and politeness interact with gender and power in discussions between Wikipedia editors. Among other findings, we observe that female Wikipedians are not only more polite, which is consistent with prior studies, but that this difference corresponds with significantly more language directed at humbling aspects of their own face. Interestingly, the distinction nearly vanishes once limiting to editors with administrative power.
+ 2024.sigdial-1.4
+ soubki-etal-2024-examining
+
+
+ ReALM: Reference Resolution as Language Modeling
+ Joel Ruben AntonyMoniz
+ SoundaryaKrishnan
+ MelisOzyildirim
+ PrathameshSaraf
+ Halim CagriAtes
+ YuanZhang
+ HongYu
+ 51–65
+ Reference resolution is an important problem, one that is essential to understand and successfully handle contexts of different kinds. This context includes both previous turns and context that pertains to non-conversational entities, such as entities on the user’s screen or those running in the background. While LLMs have been shown to be extremely powerful for a variety of tasks, their use in reference resolution, particularly for non-conversational entities, remains underutilized. This paper demonstrates how LLMs can be used to create an effective system to resolve references of various types, by showing how reference resolution can be converted into a language modeling problem, despite involving forms of entities like those on screen that are not traditionally conducive to being reduced to a text-only modality. We demonstrate large improvements over an existing system with similar functionality across different types of references, with our smallest model obtaining absolute gains of over 5% for on-screen references. We also benchmark against GPT-3.5 and GPT-4, with our smallest model achieving performance comparable to that of GPT-4, and our larger models substantially outperforming it.
+ 2024.sigdial-1.5
+ moniz-etal-2024-realm
+
+
+ Dialog Flow Induction for Constrainable LLM-Based Chatbots
+ StutiAgrawal
+ PranavPillai
+ NishiUppuluri
+ RevanthGangi Reddy
+ ShaLi
+ GokhanTur
+ DilekHakkani-Tur
+ HengJi
+ 66–77
+ LLM-driven dialog systems are used in a diverse set of applications, ranging from healthcare to customer service. However, given their generalization capability, it is difficult to ensure that these chatbots stay within the boundaries of the specialized domains, potentially resulting in inaccurate information and irrelevant responses. This paper introduces an unsupervised approach for automatically inducing domain-specific dialog flows that can be used to constrain LLM-based chatbots. We introduce two variants of dialog flow based on the availability of in-domain conversation instances. Through human and automatic evaluation over 24 dialog domains, we demonstrate that our high-quality data-guided dialog flows achieve better domain coverage, thereby overcoming the need for extensive manual crafting of such flows.
+ 2024.sigdial-1.6
+ agrawal-etal-2024-dialog
+
+
+ Knowledge-Grounded Dialogue Act Transfer using Prompt-Based Learning for Controllable Open-Domain NLG
+ AlainVazquez Risco
+ Angela MariaRamirez
+ NehaPullabhotla
+ NanQiang
+ HaoranZhang
+ MarilynWalker
+ Maria InesTorres
+ 78–91
+ Open domain spoken dialogue systems need to controllably generate many different dialogue acts (DAs) to allow Natural Language Generation (NLG) to create interesting and engaging conversational interactions with users. We aim to create an NLG engine that can produce a variety of DAs that make substantive knowledge-grounded contributions to a conversation. Training such an NLG typically requires dialogue corpora that are labelled for DAs, which are expensive to produce and vulnerable to quality issues. Here, we present a prompt-based learning approach to transfer DAs from one domain, video games, to 7 new domains. For each novel domain, we first crawl WikiData to create Meaning Representations that systematically vary both the number of attributes and hops on the WikiData Knowledge Graph. The proposed method involves a self-training step to create prompt examples for each domain followed by an overgeneration and ranking step. The result is a novel, high-quality dataset, Wiki-Dialogue, of 71K knowledge-grounded utterances, covering 9 DAs and the Art, Movies, Music, Sports, TV, Animal, and Boardgames domains, whose combined DA and semantic accuracy is 89%. We assess the corpus quality using both automatic and human evaluations and find it high. The corpus is found to be safe, lexically rich, and large in vocabulary, when compared to similar datasets.
+ 2024.sigdial-1.7
+ vazquez-risco-etal-2024-knowledge
+
+
+ Incremental Learning for Knowledge-Grounded Dialogue Systems in Industrial Scenarios
+ IzaskunFernandez
+ CristinaAceta
+ CristinaFernandez
+ Maria InesTorres
+ AitorEtxalar
+ ArianeMendez
+ MaiaAgirre
+ ManuelTorralbo
+ ArantzaDel Pozo
+ JosebaAgirre
+ EgoitzArtetxe
+ IkerAltuna
+ 92–102
+ In today’s industrial landscape, seamless collaboration between humans and machines is essential and requires a shared knowledge of the operational domain. In this framework, the technical knowledge for operator assistance has traditionally been derived from static sources such as technical documents. However, experienced operators hold invaluable know-how that can significantly contribute to support other operators. This work focuses on enhancing the operator assistance tasks in the manufacturing industry by leveraging spoken natural language interaction. More specifically, a Human-in-the-Loop (HIL) incremental learning approach is proposed to integrate this expertise into a domain knowledge graph (KG) dynamically, along with the use of in-context learning for Large Language Models (LLMs) to benefit other capabilities of the system. Preliminary results of the experimentation carried out in an industrial scenario, where the graph size was increased in a 25%, demonstrate that the incremental enhancing of the KG benefits the dialogue system’s performance.
+ 2024.sigdial-1.8
+ fernandez-etal-2024-incremental
+
+
+ Anticipating Follow-Up Questions in Exploratory Information Search
+ GrahamWilcock
+ 103–109
+ The paper describes methods for anticipating follow-up questions in exploratory information search. There are two main cases: information stored in knowledge graphs, and information in unstructured texts such as Wikipedia. In the first case, follow-up questions are anticipated by extracting subgraphs relevant to user queries, passing the subgraphs to an LLM to generate responses. In the second case, entities and their relationships are extracted from the texts and added to short-term knowledge graphs relevant to initial queries. Follow-up questions are then anticipated by extracting subgraphs relevant to subsequent queries and passing the subgraphs to the LLM, as in the first case. The short-term graphs in dialogue memory are often sufficient to answer follow-up questions. If they are not, the described steps are repeated as required.
+ 2024.sigdial-1.9
+ wilcock-2024-anticipating
+
+
+ Bridging Information Gaps in Dialogues with Grounded Exchanges Using Knowledge Graphs
+ PhillipSchneider
+ NektariosMachner
+ KristiinaJokinen
+ FlorianMatthes
+ 110–120
+ Knowledge models are fundamental to dialogue systems for enabling conversational interactions, which require handling domain-specific knowledge. Ensuring effective communication in information-providing conversations entails aligning user understanding with the knowledge available to the system. However, dialogue systems often face challenges arising from semantic inconsistencies in how information is expressed in natural language compared to how it is represented within the system’s internal knowledge. To address this problem, we study the potential of large language models for conversational grounding, a mechanism to bridge information gaps by establishing shared knowledge between dialogue participants. Our approach involves annotating human conversations across five knowledge domains to create a new dialogue corpus called BridgeKG. Through a series of experiments on this dataset, we empirically evaluate the capabilities of large language models in classifying grounding acts and identifying grounded information items within a knowledge graph structure. Our findings offer insights into how these models use in-context learning for conversational grounding tasks and common prediction errors, which we illustrate with examples from challenging dialogues. We discuss how the models handle knowledge graphs as a semantic layer between unstructured dialogue utterances and structured information items.
+ 2024.sigdial-1.10
+ schneider-etal-2024-bridging
+
+
+ “Keep up the good work!”: Using Constraints in Zero Shot Prompting to Generate Supportive Teacher Responses
+ E. MargaretPerkoff
+ Angela MariaRamirez
+ Seanvon Bayern
+ MarilynWalker
+ JamesMartin
+ 121–138
+ Educational dialogue systems have been used to support students and teachers for decades. Such systems rely on explicit pedagogically motivated dialogue rules. With the ease of integrating large language models (LLMs) into dialogue systems, applications have been arising that directly use model responses without the use of human-written rules, raising concerns about their use in classroom settings. Here, we explore how to constrain LLM outputs to generate appropriate and supportive teacher-like responses. We present results comparing the effectiveness of different constraint variations in a zero-shot prompting setting on a large mathematics classroom corpus. Generated outputs are evaluated with human annotation for Fluency, Relevance, Helpfulness, and Adherence to the provided constraints. Including all constraints in the prompt led to the highest values for Fluency and Helpfulness, and the second highest value for Relevance. The annotation results also demonstrate that the prompts that result in the highest adherence to constraints do not necessarily indicate higher perceived scores for Fluency, Relevance, or Helpfulness. In a direct comparison, all of the non-baseline LLM responses were ranked higher than the actual teacher responses in the corpus over 50% of the time.
+ 2024.sigdial-1.11
+ perkoff-etal-2024-keep
+
+
+ HelloThere: A Corpus of Annotated Dialogues and Knowledge Bases of Time-Offset Avatars
+ AlbertoChierici
+ NizarHabash
+ 139–148
+ A Time-Offset Interaction Application (TOIA) is a software system that allows people to engage in face-to-face dialogue with previously recorded videos of other people. There are two TOIA usage modes: (a) creation mode, where users pre-record video snippets of themselves representing their answers to possible questions someone may ask them, and (b) interaction mode, where other users of the system can choose to interact with created avatars. This paper presents the HelloThere corpus that has been collected from two user studies involving several people who recorded avatars and many more who engaged in dialogues with them. The interactions with avatars are annotated by people asking them questions through three modes (card selection, text search, and voice input) and rating the appropriateness of their answers on a 1 to 5 scale. The corpus, made available to the research community, comprises 26 avatars’ knowledge bases and 317 dialogues between 64 interrogators and the avatars in text format.
+ 2024.sigdial-1.12
+ chierici-habash-2024-hellothere
+
+
+ It Couldn’t Help but Overhear: On the Limits of Modelling Meta-Communicative Grounding Acts with Supervised Learning
+ BrielenMadureira
+ DavidSchlangen
+ 149–158
+ Active participation in a conversation is key to building common ground, since understanding is jointly tailored by producers and recipients. Overhearers are deprived of the privilege of performing grounding acts and can only conjecture about intended meanings. Still, data generation and annotation, modelling, training and evaluation of NLP dialogue models place reliance on the overhearing paradigm. How much of the underlying grounding processes are thereby forfeited? As we show, there is evidence pointing to the impossibility of properly modelling human meta-communicative acts with data-driven learning models. In this paper, we discuss this issue and provide a preliminary analysis on the variability of human decisions for requesting clarification. Most importantly, we wish to bring this topic back to the community’s table, encouraging discussion on the consequences of having models designed to only “listen in’”.
+ 2024.sigdial-1.13
+ madureira-schlangen-2024-couldnt
+
+
+ Data Augmentation Integrating Dialogue Flow and Style to Adapt Spoken Dialogue Systems to Low-Resource User Groups
+ ZhiyangQi
+ MichimasaInaba
+ 159–171
+ This study addresses the interaction challenges encountered by spoken dialogue systems (SDSs) when engaging with users who exhibit distinct conversational behaviors, particularly minors, in scenarios where data are scarce. We propose a novel data augmentation framework to enhance SDS performance for user groups with limited resources. Our approach leverages a large language model (LLM) to extract speaker styles and a pre-trained language model (PLM) to simulate dialogue act history. This method generates enriched and personalized dialogue data, facilitating improved interactions with unique user demographics. Extensive experiments validate the efficacy of our methodology, highlighting its potential to foster the development of more adaptive and inclusive dialogue systems.
+ 2024.sigdial-1.14
+ qi-inaba-2024-data
+
+
+ StyEmp: Stylizing Empathetic Response Generation via Multi-Grained Prefix Encoder and Personality Reinforcement
+ YahuiFu
+ ChenhuiChu
+ TatsuyaKawahara
+ 172–185
+ Recent approaches for empathetic response generation mainly focus on emotional resonance and user understanding, without considering the system’s personality. Consistent personality is evident in real human expression and is important for creating trustworthy systems. To address this problem, we propose StyEmp, which aims to stylize the empathetic response generation with a consistent personality. Specifically, it incorporates a multi-grained prefix mechanism designed to capture the intricate relationship between a system’s personality and its empathetic expressions. Furthermore, we introduce a personality reinforcement module that leverages contrastive learning to calibrate the generation model, ensuring that responses are both empathetic and reflective of a distinct personality. Automatic and human evaluations on the EMPATHETICDIALOGUES benchmark show that StyEmp outperforms competitive baselines in terms of both empathy and personality expressions. Our code is available at https://github.com/fuyahuii/StyEmp.
+ 2024.sigdial-1.15
+ fu-etal-2024-styemp
+
+
+ Multi-Criteria Evaluation Framework of Selecting Response-worthy Chats in Live Streaming
+ ZhantaoLai
+ KosukeSato
+ 186–191
+ Live streaming, a dynamic medium that merges real-time audiovisual content with interactive text-based chat, presents unique challenges for maintaining viewer engagement and ensuring streamers’ well-being. This study introduces a multi-criteria evaluation framework designed to identify response-worthy chats during live streaming. We proposed a system that evaluates chats based on sentiment polarity and intensity, contextual relevance, and topic uniqueness. We also constructed a dataset annotated by human reviewers who validates the framework, demonstrating a closer alignment with human preferences compared to single-criterion baselines. This framework not only supports the development of more responsive and engaging live streaming environments but also contributes to the broader field of dialog systems by highlighting the distinct needs of real-time, large-scale conversational contexts.
+ 2024.sigdial-1.16
+ lai-sato-2024-multi
+
+
+ Generating Unexpected yet Relevant User Dialog Acts
+ LucieGalland
+ CatherinePelachaud
+ FlorianPecune
+ 192–203
+ The demand for mental health services has risen substantially in recent years, leading to challenges in meeting patient needs promptly. Virtual agents capable of emulating motivational interviews (MI) have emerged as a potential solution to address this issue, offering immediate support that is especially beneficial for therapy modalities requiring multiple sessions. However, developing effective patient simulation methods for training MI dialog systems poses challenges, particularly in generating syntactically and contextually correct, and diversified dialog acts while respecting existing patterns and trends in therapy data. This paper investigates data-driven approaches to simulate patients for training MI dialog systems. We propose a novel method that leverages time series models to generate diverse and contextually appropriate patient dialog acts, which are then transformed into utterances by a conditioned large language model. Additionally, we introduce evaluation measures tailored to assess the quality and coherence of simulated patient dialog. Our findings highlight the effectiveness of dialog act-conditioned approaches in improving patient simulation for MI, offering insights for developing virtual agents to support mental health therapy.
+ 2024.sigdial-1.17
+ galland-etal-2024-generating
+
+
+ Training LLMs to Recognize Hedges in Dialogues about Roadrunner Cartoons
+ AmiePaige
+ AdilSoubki
+ JohnMurzaku
+ OwenRambow
+ Susan E.Brennan
+ 204–215
+ Hedges allow speakers to mark utterances as provisional, whether to signal non-prototypicality or “fuzziness”, to indicate a lack of commitment to an utterance, to attribute responsibility for a statement to someone else, to invite input from a partner, or to soften critical feedback in the service of face management needs. Here we focus on hedges in an experimentally parameterized corpus of 63 Roadrunner cartoon narratives spontaneously produced from memory by 21 speakers for co-present addressees, transcribed to text (Galati and Brennan, 2010). We created a gold standard of hedges annotated by human coders (the Roadrunner-Hedge corpus) and compared three LLM-based approaches for hedge detection: fine-tuning BERT, and zero and few-shot prompting with GPT-4o and LLaMA-3. The best-performing approach was a fine-tuned BERT model, followed by few-shot GPT-4o. After an error analysis on the top performing approaches, we used an LLM-in-the-Loop approach to improve the gold standard coding, as well as to highlight cases in which hedges are ambiguous in linguistically interesting ways that will guide future research. This is the first step in our research program to train LLMs to interpret and generate collateral signals appropriately and meaningfully in conversation.
+ 2024.sigdial-1.18
+ paige-etal-2024-training
+
+
+ On the Controllability of Large Language Models for Dialogue Interaction
+ NicolasWagner
+ StefanUltes
+ 216–221
+ This paper investigates the enhancement of Dialogue Systems by integrating the creative capabilities of Large Language Models. While traditional Dialogue Systems focus on understanding user input and selecting appropriate system actions, Language Models excel at generating natural language text based on prompts. Therefore, we propose to improve controllability and coherence of interactions by guiding a Language Model with control signals that enable explicit control over the system behaviour. To address this, we tested and evaluated our concept in 815 conversations with over 3600 dialogue exchanges on a dataset. Our experiment examined the quality of generated system responses using two strategies: An unguided strategy where task data was provided to the models, and a controlled strategy in which a simulated Dialogue Controller provided appropriate system actions. The results show that the average BLEU score and the classification of dialogue acts improved in the controlled Natural Language Generation.
+ 2024.sigdial-1.19
+ wagner-ultes-2024-controllability
+
+
+ Divide and Conquer: Rethinking Ambiguous Candidate Identification in Multimodal Dialogues with Pseudo-Labelling
+ BhathiyaHemanthage
+ ChristianDondrup
+ HakanBilen
+ OliverLemon
+ 222–227
+ Ambiguous Candidate Identification(ACI) in multimodal dialogue is the task of identifying all potential objects that a user’s utterance could be referring to in a visual scene, in cases where the reference cannot be uniquely determined. End-to-end models are the dominant approach for this task, but have limited real-world applicability due to unrealistic inference-time assumptions such as requiring predefined catalogues of items. Focusing on a more generalized and realistic ACI setup, we demonstrate that a modular approach, which first emphasizes language-only reasoning over dialogue context before performing vision-language fusion, significantly outperforms end-to-end trained baselines. To mitigate the lack of annotations for training the language-only module (student), we propose a pseudo-labelling strategy with a prompted Large Language Model (LLM) as the teacher.
+ 2024.sigdial-1.20
+ hemanthage-etal-2024-divide
+
+
+ Self-Emotion Blended Dialogue Generation in Social Simulation Agents
+ QiangZhang
+ JasonNaradowsky
+ YusukeMiyao
+ 228–247
+ When engaging in conversations, dialogue agents in a virtual simulation environment may exhibit their own emotional states that are unrelated to the immediate conversational context, a phenomenon known as self-emotion. This study explores how such self-emotion affects the agents’ behaviors in dialogue strategies and decision-making within a large language model (LLM)-driven simulation framework. In a dialogue strategy prediction experiment, we analyze the dialogue strategy choices employed by agents both with and without self-emotion, comparing them to those of humans. The results show that incorporating self-emotion helps agents exhibit more human-like dialogue strategies. In an independent experiment comparing the performance of models fine-tuned on GPT-4 generated dialogue datasets, we demonstrate that self-emotion can lead to better overall naturalness and humanness. Finally, in a virtual simulation environment where agents have free discussions, we show that self-emotion of agents can significantly influence the decision-making process of the agents, leading to approximately a 50% change in decisions.
+ 2024.sigdial-1.21
+ zhang-etal-2024-self-emotion
+
+
+ Enhancing Model Transparency: A Dialogue System Approach to XAI with Domain Knowledge
+ IsabelFeustel
+ NiklasRach
+ WolfgangMinker
+ StefanUltes
+ 248–258
+ Explainable artificial intelligence (XAI) is a rapidly evolving field that seeks to create AI systems that can provide human-understandable explanations for their decision-making processes. However, these explanations rely on model and data-specific information only. To support better human decision-making, integrating domain knowledge into AI systems is expected to enhance understanding and transparency. In this paper, we present an approach for combining XAI explanations with domain knowledge within a dialogue system. We concentrate on techniques derived from the field of computational argumentation to incorporate domain knowledge and corresponding explanations into human-machine dialogue. We implement the approach in a prototype system for an initial user evaluation, where users interacted with the dialogue system to receive predictions from an underlying AI model. The participants were able to explore different types of explanations and domain knowledge. Our results indicate that users tend to more effectively evaluate model performance when domain knowledge is integrated. On the other hand, we found that domain knowledge was not frequently requested by the user during dialogue interactions.
+ 2024.sigdial-1.22
+ feustel-etal-2024-enhancing
+
+
+ Affect Recognition in Conversations Using Large Language Models
+ ShutongFeng
+ GuangzhiSun
+ NurulLubis
+ WenWu
+ ChaoZhang
+ MilicaGasic
+ 259–273
+ Affect recognition, encompassing emotions, moods, and feelings, plays a pivotal role in human communication. In the realm of conversational artificial intelligence, the ability to discern and respond to human affective cues is a critical factor for creating engaging and empathetic interactions. This study investigates the capacity of large language models (LLMs) to recognise human affect in conversations, with a focus on both open-domain chit-chat dialogues and task-oriented dialogues. Leveraging three diverse datasets, namely IEMOCAP (Busso et al., 2008), EmoWOZ (Feng et al., 2022), and DAIC-WOZ (Gratch et al., 2014), covering a spectrum of dialogues from casual conversations to clinical interviews, we evaluate and compare LLMs’ performance in affect recognition. Our investigation explores the zero-shot and few-shot capabilities of LLMs through in-context learning as well as their model capacities through task-specific fine-tuning. Additionally, this study takes into account the potential impact of automatic speech recognition errors on LLM predictions. With this work, we aim to shed light on the extent to which LLMs can replicate human-like affect recognition capabilities in conversations.
+ 2024.sigdial-1.23
+ feng-etal-2024-affect
+
+
+ Sentiment-Aware Dialogue Flow Discovery for Interpreting Communication Trends
+ Patrícia Sofia PereiraFerreira
+ IsabelCarvalho
+ AnaAlves
+ CatarinaSilva
+ Hugo GonçaloOliveira
+ 274–288
+ Customer-support services increasingly rely on automation, whether fully or with human intervention. Despite optimising resources, this may result in mechanical protocols and lack of human interaction, thus reducing customer loyalty. Our goal is to enhance interpretability and provide guidance in communication through novel tools for easier analysis of message trends and sentiment variations. Monitoring these contributes to more informed decision-making, enabling proactive mitigation of potential issues, such as protocol deviations or customer dissatisfaction. We propose a generic approach for dialogue flow discovery that leverages clustering techniques to identify dialogue states, represented by related utterances. State transitions are further analyzed to detect prevailing sentiments. Hence, we discover sentiment-aware dialogue flows that offer an interpretability layer to artificial agents, even those based on black-boxes, ultimately increasing trustworthiness. Experimental results demonstrate the effectiveness of our approach across different dialogue datasets, covering both human-human and human-machine exchanges, applicable in task-oriented contexts but also to social media, highlighting its potential impact across various customer-support settings.
+ 2024.sigdial-1.24
+ ferreira-etal-2024-sentiment
+
+
+ Analyzing and Enhancing Clarification Strategies for Ambiguous References in Consumer Service Interactions
+ ChanglingLi
+ YujianGan
+ ZhenrongYang
+ YouyangChen
+ XinxuanQiu
+ YanniLin
+ MatthewPurver
+ MassimoPoesio
+ 289–296
+ When customers present ambiguous references, service staff typically need to clarify the customers’ specific intentions. To advance research in this area, we collected 1,000 real-world consumer dialogues with ambiguous references. This dataset will be used for subsequent studies to identify ambiguous references and generate responses. Our analysis of the dataset revealed common strategies employed by service staff, including directly asking clarification questions (CQ) and listing possible options before asking a clarification question (LCQ). However, we found that merely using CQ often fails to fully satisfy customers. In contrast, using LCQ, as well as recommending specific products after listing possible options, proved more effective in resolving ambiguous references and enhancing customer satisfaction.
+ 2024.sigdial-1.25
+ li-etal-2024-analyzing
+
+
+ Coherence-based Dialogue Discourse Structure Extraction using Open-Source Large Language Models
+ GaetanoCimino
+ ChuyuanLi
+ GiuseppeCarenini
+ VincenzoDeufemia
+ 297–316
+ Despite the challenges posed by data sparsity in discourse parsing for dialogues, unsupervised methods have been underexplored. Leveraging recent advances in Large Language Models (LLMs), in this paper we investigate an unsupervised coherence-based method to build discourse structures for multi-party dialogues using open-source LLMs fine-tuned on conversational data. Specifically, we propose two algorithms that extract dialogue structures by identifying their most coherent sub-dialogues: DS-DP employs a dynamic programming strategy, while DS-FLOW applies a greedy approach. Evaluation on the STAC corpus demonstrates a micro-F1 score of 58.1%, surpassing prior unsupervised methods. Furthermore, on a cleaned subset of the Molweni corpus, the proposed method achieves a micro-F1 score of 74.7%, highlighting its effectiveness across different corpora.
+ 2024.sigdial-1.26
+ cimino-etal-2024-coherence
+
+
+ Transforming Slot Schema Induction with Generative Dialogue State Inference
+ James D.Finch
+ BoxinZhao
+ Jinho D.Choi
+ 317–324
+ The challenge of defining a slot schema to represent the state of a task-oriented dialogue system is addressed by Slot Schema Induction (SSI), which aims to automatically induce slots from unlabeled dialogue data. Whereas previous approaches induce slots by clustering value spans extracted directly from the dialogue text, we demonstrate the power of discovering slots using a generative approach. By training a model to generate slot names and values that summarize key dialogue information with no prior task knowledge, our SSI method discovers high-quality candidate information for representing dialogue state. These discovered slot-value candidates can be easily clustered into unified slot schemas that align well with human-authored schemas. Experimental comparisons on the MultiWOZ and SGD datasets demonstrate that Generative Dialogue State Inference (GenDSI) outperforms the previous state-of-the-art on multiple aspects of the SSI task.
+ 2024.sigdial-1.27
+ finch-etal-2024-transforming
+
+
+ Using Respiration for Enhancing Human-Robot Dialogue
+ TakaoObi
+ KotaroFunakoshi
+ 325–328
+ This paper presents the development and capabilities of a spoken dialogue robot that uses respiration to enhance human-robot dialogue. By employing a respiratory estimation technique that uses video input, the dialogue robot captures user respiratory information during dialogue. This information is then used to prevent speech collisions between the user and the robot and to present synchronized pseudo-respiration with the user, thereby enhancing the smoothness and engagement of human-robot dialogue.
+ 2024.sigdial-1.28
+ obi-funakoshi-2024-using
+
+
+ Interactive Dialogue Interface for Personalized News Article Comprehension
+ TomoyaHiguchi
+ MichimasaInaba
+ 329–332
+ We developed an interface to explain news articles through dialogue by considering the user’s comprehension level. The interface generates several pertinent questions based on the ongoing dialogue and news article, and users advance the conversation by selecting a question. Based on the user’s selected questions, the interface estimates their comprehension level of the news article and adjusts the difficulty of the generated questions accordingly. This enables a personalized dialogue tailored to each user’s comprehension needs. The results of the baseline comparison experiments confirmed the usefulness of the interface.
+ 2024.sigdial-1.29
+ higuchi-inaba-2024-interactive
+
+
+ Enhancing Dialogue Speech Recognition with Robust Contextual Awareness via Noise Representation Learning
+ WonjunLee
+ SanKim
+ Gary GeunbaeLee
+ 333–343
+ Recent dialogue systems typically operate through turn-based spoken interactions between users and agents. These systems heavily depend on accurate Automatic Speech Recognition (ASR), as transcription errors can significantly degrade performance in downstream dialogue tasks. To alleviate this challenge, robust ASR is required, and one effective method is to utilize the dialogue context from user and agent interactions for transcribing the subsequent user utterance. This method incorporates the transcription of the user’s speech and the agent’s response as model input, using the accumulated context generated by each turn. However, this context is susceptible to ASR errors because the ASR model generates it auto-regressively. Such noisy context can further degrade the benefits of context input, resulting in suboptimal ASR performance. In this paper, we introduce context noise representation learning to enhance robustness against noisy context, ultimately improving dialogue speech recognition accuracy. To maximize the advantage of context awareness, our approach involves decoder pre-training with text-based dialogue data and noise representation learning for a context encoder. Evaluated on DSTC11 (MultiWoZ 2.1 audio dialogues), it achieves a 24% relative reduction in Word Error Rate (WER) compared to wav2vec2.0 baselines and a 13% reduction compared to Whisper-large-v2. Notably, in noisy environments where user speech is barely audible, our method proves its effectiveness by utilizing contextual information for accurate transcription. Tested on audio data with strong noise level (Signal Noise Ratio of 0dB), our approach shows up to a 31% relative WER reduction compared to the wav2vec2.0 baseline, providing a reassuring solution for real-world noisy scenarios.
+ 2024.sigdial-1.30
+ lee-etal-2024-enhancing-dialogue
+
+
+ Local Topology Measures of Contextual Language Model Latent Spaces with Applications to Dialogue Term Extraction
+ Benjamin MatthiasRuppik
+ MichaelHeck
+ Carelvan Niekerk
+ RenatoVukovic
+ Hsien-chinLin
+ ShutongFeng
+ MarcusZibrowius
+ MilicaGasic
+ 344–356
+ A common approach for sequence tagging tasks based on contextual word representations is to train a machine learning classifier directly on these embedding vectors. This approach has two shortcomings. First, such methods consider single input sequences in isolation and are unable to put an individual embedding vector in relation to vectors outside the current local context of use. Second, the high performance of these models relies on fine-tuning the embedding model in conjunction with the classifier, which may not always be feasible due to the size or inaccessibility of the underlying feature-generation model. It is thus desirable, given a collection of embedding vectors of a corpus, i.e. a datastore, to find features of each vector that describe its relation to other, similar vectors in the datastore. With this in mind, we introduce complexity measures of the local topology of the latent space of a contextual language model with respect to a given datastore. The effectiveness of our features is demonstrated through their application to dialogue term extraction. Our work continues a line of research that explores the manifold hypothesis for word embeddings, demonstrating that local structure in the space carved out by word embeddings can be exploited to infer semantic properties.
+ 2024.sigdial-1.31
+ ruppik-etal-2024-local
+
+
+ Adaptive Open-Set Active Learning with Distance-Based Out-of-Distribution Detection for Robust Task-Oriented Dialog System
+ Sai KeerthanaGoruganthu
+ Roland R.Oruche
+ PrasadCalyam
+ 357–369
+ The advancements in time-efficient data collection techniques such as active learning (AL) has become salient for user intent classification performance in task-oriented dialog systems (TODS). In realistic settings, however, traditional AL techniques often fail to efficiently select targeted in-distribution (IND) data when encountering newly acquired out-of-distribution (OOD) user intents in the unlabeled pool. In this paper, we introduce a novel AL framework viz., AOSAL for TODS that combines a distance-based OOD detector using adaptive false positive rate threshold with an informativeness measure (e.g., entropy) to strategically select informative IND data points in the unlabeled pool. Specifically, we utilize the adaptive OOD detector to classify and filter out OOD samples from the unlabeled pool, then prioritize the acquisition of classified IND instances based on their informativeness scores. To validate our approach, we conduct experiments that display our framework’s flexibility and performance over multiple distance-based approaches and informativeness measures against deep AL baselines on benchmark text datasets. The results suggest that our AOSAL approach consistently outperforms the baselines on IND classification and OOD detection, advancing knowledge on improving robustness of task-oriented dialog systems.
+ 2024.sigdial-1.32
+ goruganthu-etal-2024-adaptive
+
+
+ Dialogue Ontology Relation Extraction via Constrained Chain-of-Thought Decoding
+ RenatoVukovic
+ DavidArps
+ Carelvan Niekerk
+ Benjamin MatthiasRuppik
+ Hsien-chinLin
+ MichaelHeck
+ MilicaGasic
+ 370–384
+ State-of-the-art task-oriented dialogue systems typically rely on task-specific ontologies for fulfilling user queries. The majority of task-oriented dialogue data, such as customer service recordings, comes without ontology and annotation. Such ontologies are normally built manually, limiting the application of specialised systems. Dialogue ontology construction is an approach for automating that process and typically consists of two steps: term extraction and relation extraction. In this work, we focus on relation extraction in a transfer learning set-up. To improve the generalisation, we propose an extension to the decoding mechanism of large language models. We adapt Chain-of-Thought (CoT) decoding, recently developed for reasoning problems, to generative relation extraction. Here, we generate multiple branches in the decoding space and select the relations based on a confidence threshold. By constraining the decoding to ontology terms and relations, we aim to decrease the risk of hallucination. We conduct extensive experimentation on two widely used datasets and find improvements in performance on target ontology for source fine-tuned and one-shot prompted large language models.
+ 2024.sigdial-1.33
+ vukovic-etal-2024-dialogue
+
+
+ InteLLA: Intelligent Language Learning Assistant for Assessing Language Proficiency through Interviews and Roleplays
+ MaoSaeki
+ HiroakiTakatsu
+ FumaKurata
+ ShungoSuzuki
+ MasakiEguchi
+ RyukiMatsuura
+ KotaroTakizawa
+ SadahiroYoshikawa
+ YoichiMatsuyama
+ 385–399
+ In this paper, we propose a multimodal dialogue system designed to elicit spontaneous speech samples from second language learners for reliable oral proficiency assessment. The primary challenge in utilizing dialogue systems for language testing lies in obtaining ratable speech samples that demonstrates the user’s full capabilities of interactional skill. To address this, we developed a virtual agent capable of conducting extended interactions, consisting of a 15-minute interview and 10-minute roleplay. The interview component is a system-led dialogue featuring questions that aim to elicit specific language functions from the user. The system dynamically adjusts the topic difficulty based on real-time assessments to provoke linguistic breakdowns as evidence of their upper limit of proficiency. The roleplay component is a mixed-initiative, collaborative conversation aimed at evaluating the user’s interactional competence. Two experiments were conducted to evaluate our system’s reliability in assessing oral proficiency. In experiment 1, we collected a total of 340 interview sessions, 45-72% of which successfully elicited upper linguistic limit by adjusting the topic difficulty levels. In experiment 2, based on the ropleplay dataset of 75 speakers, the interactional speech elicited by our system was found to be as ratable as those by human examiners, indicated by the reliability index of interactional ratings. These results demonstrates that our system can elicit ratable interactional performances comparable to those elicited by human interviewers. Finally, we report on the deployment of our system with over 10,000 university students in a real-world testing scenario.
+ 2024.sigdial-1.34
+ saeki-etal-2024-intella
+
+
+ Curriculum-Driven Edubot: A Framework for Developing Language Learning Chatbots through Synthesizing Conversational Data
+ YuLi
+ ShangQu
+ JiliShen
+ ShangchaoMin
+ ZhouYu
+ 400–419
+ Chatbots have become popular in educational settings, revolutionizing how students interact with material and how teachers teach. We present Curriculum-Driven EduBot, a framework for developing a chatbot that combines the interactive features of chatbots with the systematic material of English textbooks to assist students in enhancing their conversational skills. We begin by extracting pertinent topics from textbooks and using large language models to generate dialogues related to these topics. We then fine-tune an open-source LLM using our generated conversational data to create our curriculum-driven chatbot. User studies demonstrate that EduBot outperforms ChatGPT in leading curriculum-based dialogues and adapting its dialogue to match the user’s English proficiency level. By combining traditional textbook methodologies with conversational AI, our approach offers learners an interactive tool that aligns with their curriculum and provides user-tailored conversation practice. This facilitates meaningful student-bot dialogues and enriches the overall learning experience within the curriculum’s pedagogical framework.
+ 2024.sigdial-1.35
+ li-etal-2024-curriculum
+
+
+ Going beyond Imagination! Enhancing Multi-modal Dialogue Agents with Synthetic Visual Descriptions
+ HaolanZhan
+ SameenMaruf
+ IngridZukerman
+ GholamrezaHaffari
+ 420–427
+ Building a dialogue agent that can seamlessly interact with humans in multi-modal regimes, requires two fundamental abilities: (1) understanding emotion and dialogue acts within situated user scenarios, and (2) grounding perceived visual cues to dialogue contexts. However, recent works have uncovered shortcomings of existing dialogue agents in understanding emotions and dialogue acts, and in ground- ing visual cues effectively. In this work, we investigate whether additional dialogue data with only visual descriptions can help dialogue agents effectively align visual and textual features, and enhance the ability of dialogue agents to ground perceived visual cues to dialogue contexts. To this end, in the absence of a suitable dataset, we propose a synthetic visual description generation pipeline, and con- tribute a large-scale synthetic visual description dataset. In addition, we propose a general training procedure for effectively leveraging these synthetic data. We conduct comprehensive analyses to evaluate the impact of synthetic data on two benchmarks: MELD and IEMOCAP. Our findings suggest that synthetic visual descriptions can serve as an effective way to enhance a dialogue agents’ grounding ability, and that the training scheme affects the extent to which these descriptions improve the agent’s performance.
+ 2024.sigdial-1.36
+ zhan-etal-2024-going
+
+
+ User Review Writing via Interview with Dialogue Systems
+ YoshikiTanaka
+ MichimasaInaba
+ 428–439
+ User reviews on e-commerce and review sites are crucial for making purchase decisions, although creating detailed reviews is time-consuming and labor-intensive. In this study, we propose a novel use of dialogue systems to facilitate user review creation by generating reviews from information gathered during interview dialogues with users. To validate our approach, we implemented our system using GPT-4 and conducted comparative experiments from the perspectives of system users and review readers. The results indicate that participants who used our system rated their interactions positively. Additionally, reviews generated by our system required less editing to achieve user satisfaction compared to those by the baseline. We also evaluated the reviews from the readers’ perspective and found that our system-generated reviews are more helpful than those written by humans. Despite challenges with the fluency of the generated reviews, our method offers a promising new approach to review writing.
+ 2024.sigdial-1.37
+ tanaka-inaba-2024-user
+
+
+ Conversational Feedback in Scripted versus Spontaneous Dialogues: A Comparative Analysis
+ IldikoPilan
+ LaurentPrévot
+ HendrikBuschmeier
+ PierreLison
+ 440–457
+ Scripted dialogues such as movie and TV subtitles constitute a widespread source of training data for conversational NLP models. However, there are notable linguistic differences between these dialogues and spontaneous interactions, especially regarding the occurrence of communicative feedback such as backchannels, acknowledgments, or clarification requests. This paper presents a quantitative analysis of such feedback phenomena in both subtitles and spontaneous conversations. Based on conversational data spanning eight languages and multiple genres, we extract lexical statistics, classifications from a dialogue act tagger, expert annotations and labels derived from a fine-tuned Large Language Model (LLM). Our main empirical findings are that (1) communicative feedback is markedly less frequent in subtitles than in spontaneous dialogues and (2) subtitles contain a higher proportion of negative feedback. We also show that dialogues generated by standard LLMs lie much closer to scripted dialogues than spontaneous interactions in terms of communicative feedback.
+ 2024.sigdial-1.38
+ pilan-etal-2024-conversational
+
+
+ Exploring the Use of Natural Language Descriptions of Intents for Large Language Models in Zero-shot Intent Classification
+ TaesukHong
+ YoubinAhn
+ DongkyuLee
+ JoongboShin
+ SeungpilWon
+ JanghoonHan
+ Stanley JungkyuChoi
+ JungyunSeo
+ 458–465
+ In task-oriented dialogue systems, intent classification is crucial for accurately understanding user queries and providing appropriate services. This study explores the use of intent descriptions with large language models for unseen domain intent classification. By examining the effects of description quality, quantity, and input length management, we identify practical guidelines for optimizing performance. Our experiments using FLAN-T5 3B demonstrate that 1) high-quality descriptions for both training and testing significantly improve accuracy, 2) diversity in training descriptions doesn’t greatly affect performance, and 3) off-the-shelf rankers selecting around ten intent options reduce input length without compromising performance. We emphasize that high-quality testing descriptions have a greater impact on accuracy than training descriptions. These findings provide practical guidelines for using intent descriptions with large language models to achieve effective and efficient intent classification in low-resource settings.
+ 2024.sigdial-1.39
+ hong-etal-2024-exploring
+
+
+ Voice and Choice: Investigating the Role of Prosodic Variation in Request Compliance and Perceived Politeness Using Conversational TTS
+ EvaSzekely
+ JeffHigginbotham
+ FrancescoPossemato
+ 466–476
+ As conversational Text-to-Speech (TTS) technologies become increasingly realistic and expressive, understanding the impact of prosodic variation on speech perception and social dynamics is crucial for enhancing conversational systems. This study explores the influence of prosodic features on listener responses to indirect requests using a specifically designed conversational TTS engine capable of controlling prosody, and generating speech across three different speaker profiles: female, male, and gender-ambiguous. We conducted two experiments to analyse how naturalistic variations in speech rate and vocal energy (projection) impact the likelihood of request compliance and perceived politeness. In the first experiment, we examined how prosodic modifications affect the perception of politeness in permission- and service requests. In the second experiment participants compared pairs of spoken requests, each rendered with different prosodic features, and chose which they were more likely to grant. Results indicate that both faster speech rates and higher projection increased the willingness to comply, though the extent of this influence varied by speaker gender. Higher projection in service request increases the chance of being granted more than in permission requests. Politeness has a demonstrated positive impact on the likelihood of requests being granted, this effect is stronger for the male voice compared to female and gender-ambiguous voices.
+ 2024.sigdial-1.40
+ szekely-etal-2024-voice
+
+
+ A Dialogue Game for Eliciting Balanced Collaboration
+ IsidoraJeknic
+ DavidSchlangen
+ AlexanderKoller
+ 477–489
+ Collaboration is an integral part of human dialogue. Typical task-oriented dialogue games assign asymmetric roles to the participants, which limits their ability to elicit naturalistic role-taking in collaboration and its negotiation. We present a novel and simple online setup that favors balanced collaboration: a two-player 2D object placement game in which the players must negotiate the goal state themselves. We show empirically that human players exhibit a variety of role distributions, and that balanced collaboration improves task performance. We also present an LLM-based baseline agent which demonstrates that automatic playing of our game is an interesting challenge for artificial systems.
+ 2024.sigdial-1.41
+ jeknic-etal-2024-dialogue
+
+
+ Improving Speech Recognition with Jargon Injection
+ Minh-TienNguyen
+ Dat PhuocNguyen
+ Tuan-HaiLuu
+ Xuan-QuangNguyen
+ Tung-DuongNguyen
+ JeffYang
+ 490–499
+ This paper introduces a new method that improves the performance of Automatic speech recognition (ASR) engines, e.g., Whisper in practical cases. Different from prior methods that usually require both speech data and its transcription for decoding, our method only uses jargon as the context for decoding. To do that, the method first represents the jargon in a trie tree structure for efficient storing and traversing. The method next forces the decoding of Whisper to more focus on the jargon by adjusting the probability of generated tokens with the use of the trie tree. To further improve the performance, the method utilizes the prompting method that uses the jargon as the context. Final tokens are generated based on the combination of prompting and decoding. Experimental results on Japanese and English datasets show that the proposed method helps to improve the performance of Whisper, specially for domain-specific data. The method is simple but effective and can be deployed to any encoder-decoder ASR engines in actual cases. The code and data are also accessible (https://shorturl.at/nMsaY).
+ 2024.sigdial-1.42
+ nguyen-etal-2024-improving
+
+
+ Optimizing Code-Switching in Conversational Tutoring Systems: A Pedagogical Framework and Evaluation
+ ZhengyuanLiu
+ Stella XinYin
+ NancyChen
+ 500–515
+ Large language models demonstrate remarkable proficiency in various tasks across multiple languages. However, their potential in code-switching remains underexplored, particularly in cultural and educational contexts. Code-switching or translanguaging plays a crucial role in bilingual education, facilitating comprehension and engagement among students with varied linguistic proficiencies. In this work, we present a pedagogy-inspired framework that introduces traditional classroom practices of code-switching to intelligent tutoring systems. Specifically, we develop fine-grained instructional strategies tailored to multilingual and educational needs. We conduct experiments involving both LLM-based evaluation and expert analysis to assess the effectiveness of translanguaging in tutoring dialogues. Our experimental results indicate that strategic code-switching can significantly enhance the learning experience. This work not only advances dialogic tutors in language learning, but also extends LLMs to better accommodate multilingual interaction.
+ 2024.sigdial-1.43
+ liu-etal-2024-optimizing
+
+
+ ECoh: Turn-level Coherence Evaluation for Multilingual Dialogues
+ JohnMendonca
+ IsabelTrancoso
+ AlonLavie
+ 516–532
+ Despite being heralded as the new standard for dialogue evaluation, the closed-source nature of GPT-4 poses challenges for the community. Motivated by the need for lightweight, open source, and multilingual dialogue evaluators, this paper introduces GenResCoh (Generated Responses targeting Coherence). GenResCoh is a novel LLM generated dataset comprising over 130k negative and positive responses and accompanying explanations seeded from XDailyDialog and XPersona covering English, French, German, Italian, and Chinese. Leveraging GenResCoh, we propose ECoh (Evaluation of Coherence), a family of evaluators trained to assess response coherence across multiple languages. Experimental results demonstrate that ECoh achieves multilingual detection capabilities superior to the teacher model (GPT-3.5-Turbo) on GenResCoh, despite being based on a much smaller architecture. Furthermore, the explanations provided by ECoh closely align in terms of quality with those generated by the teacher model.
+ 2024.sigdial-1.44
+ mendonca-etal-2024-ecoh
+
+
+ An Investigation into Explainable Audio Hate Speech Detection
+ JinmyeongAn
+ WonjunLee
+ YejinJeon
+ JungseulOk
+ YunsuKim
+ Gary GeunbaeLee
+ 533–543
+ Research on hate speech has predominantly revolved around the detection and interpretation from textual inputs, leaving verbal content largely unexplored. Moreover, while there has been some limited exploration into hate speech detection within verbal acoustic speech inputs, the aspect of interpretability has been overlooked. As such, we introduce a new task within the audio hate speech detection task domain - we specifically aim to identify specific time frames of hate speech within audio utterances. Towards this, we propose two different approaches, cascading and End-to-End (E2E). The first cascading approach initially converts audio to transcripts, identifies hate speech within these transcripts, and subsequently locates the corresponding audio time frames. Conversely, the second E2E approach processes audio utterances directly, which allows it to pinpoint hate speech within specific time frames. Moreover, due to the lack of explainable audio hate speech datasets that include frame-level rationales, we curated a synthetic audio dataset to train our models. We further validate these models on actual human speech utterances and we find that the E2E approach outperforms the cascading method in terms of audio frame Intersection over Union (IoU) metric. Furthermore, we observe that the inclusion of frame-level rationales significantly enhances hate speech detection accuracy for both E2E and cascading approaches.
+ 2024.sigdial-1.45
+ an-etal-2024-investigation
+
+
+ Mhm... Yeah? Okay! Evaluating the Naturalness and Communicative Function of Synthesized Feedback Responses in Spoken Dialogue
+ CarolFigueroa
+ Marcelde Korte
+ MagalieOchs
+ GabrielSkantze
+ 544–553
+ To create conversational systems with human-like listener behavior, generating short feedback responses (e.g., “mhm”, “ah”, “wow”) appropriate for their context is crucial. These responses convey their communicative function through their lexical form and their prosodic realization. In this paper, we transplant the prosody of feedback responses from human-human U.S. English telephone conversations to a target speaker using two synthesis techniques (TTS and signal processing). Our evaluation focuses on perceived naturalness, contextual appropriateness and preservation of communicative function. Results indicate TTS-generated feedback were perceived as more natural than signal-processing-based feedback, with no significant difference in appropriateness. However, the TTS did not consistently convey the communicative function of the original feedback.
+ 2024.sigdial-1.46
+ figueroa-etal-2024-mhm
+
+
+ Generalizing across Languages and Domains for Discourse Relation Classification
+ PeterBourgonje
+ VeraDemberg
+ 554–565
+ The availability of corpora annotated for discourse relations is limited and discourse relation classification performance varies greatly depending on both language and domain. This is a problem for downstream applications that are intended for a language (i.e., not English) or a domain (i.e., not financial news) with comparatively low coverage for discourse annotations. In this paper, we experiment with a state-of-the-art model for discourse relation classification, originally developed for English, extend it to a multi-lingual setting (testing on Italian, Portuguese and Turkish), and employ a simple, yet effective method to mark out-of-domain training instances. By doing so, we aim to contribute to better generalization and more robust discourse relation classification performance across both language and domain.
+ 2024.sigdial-1.47
+ bourgonje-demberg-2024-generalizing
+
+
+ BoK: Introducing Bag-of-Keywords Loss for Interpretable Dialogue Response Generation
+ SuvodipDey
+ Maunendra SankarDesarkar
+ 566–578
+ The standard language modeling (LM) loss by itself has been shown to be inadequate for effective dialogue modeling. As a result, various training approaches, such as auxiliary loss functions and leveraging human feedback, are being adopted to enrich open-domain dialogue systems. One such auxiliary loss function is Bag-of-Words (BoW) loss, defined as the cross-entropy loss for predicting all the words/tokens of the next utterance. In this work, we propose a novel auxiliary loss named Bag-of-Keywords (BoK) loss to capture the central thought of the response through keyword prediction and leverage it to enhance the generation of meaningful and interpretable responses in open-domain dialogue systems. BoK loss upgrades the BoW loss by predicting only the keywords or critical words/tokens of the next utterance, intending to estimate the core idea rather than the entire response. We incorporate BoK loss in both encoder-decoder (T5) and decoder-only (DialoGPT) architecture and train the models to minimize the weighted sum of BoK and LM (BoK-LM) loss. We perform our experiments on two popular open-domain dialogue datasets, DailyDialog and Persona-Chat. We show that the inclusion of BoK loss improves the dialogue generation of backbone models while also enabling post-hoc interpretability. We also study the effectiveness of BoK-LM loss as a reference-free metric and observe comparable performance to the state-of-the-art metrics on various dialogue evaluation datasets.
+ 2024.sigdial-1.48
+ dey-desarkar-2024-bok
+
+
+ Cross-lingual Transfer and Multilingual Learning for Detecting Harmful Behaviour in African Under-Resourced Language Dialogue
+ Tunde OluwaseyiAjayi
+ MihaelArcan
+ PaulBuitelaar
+ 579–589
+ Most harmful dialogue detection models are developed for high-resourced languages. Consequently, users who speak under-resourced languages cannot fully benefit from these models in terms of usage, development, detection and mitigation of harmful dialogue utterances. Our work aims at detecting harmful utterances in under-resourced African languages. We leverage transfer learning using pretrained models trained with multilingual embeddings to develop a cross-lingual model capable of detecting harmful content across various African languages. We first fine-tune a harmful dialogue detection model on a selected African dialogue dataset. Additionally, we fine-tune a model on a combined dataset in some African languages to develop a multilingual harmful dialogue detection model. We then evaluate the cross-lingual model’s ability to generalise to an unseen African language by performing harmful dialogue detection in an under-resourced language not present during pretraining or fine-tuning. We evaluate our models on the test datasets. We show that our best performing models achieve impressive results in terms of F1 score. Finally, we discuss the results and limitations of our work.
+ 2024.sigdial-1.49
+ ajayi-etal-2024-cross
+
+
+ A Few-shot Approach to Task-oriented Dialogue Enhanced with Chitchat
+ ArmandStricker
+ PatrickParoubek
+ 590–602
+ Large language models (LLMs) tuned for chat have recently been adopted for few-shot end-to-end task-oriented dialogue (TOD), with some success. To further assess this method, we conduct experiments on two, more complex, task-oriented benchmarks that integrate elements of chitchat into the conversation. We enhance a few-shot baseline by adding zero-shot chitchat detection and implementing function calling for dialogue state tracking (DST). We focus on this step in the task-oriented pipeline as it comes first, and errors due to added chitchat at this stage have the most impact on end-to-end performance. We find that this prompting method shows increased resilience to mixed-mode inputs and our enhanced pipeline allows for natural inter-mode conversations, as assessed through human evaluation. Our findings also suggest that the performance gap between few-shot prompting for TOD and supervised task-specific models is narrowing.
+ 2024.sigdial-1.50
+ stricker-paroubek-2024-shot
+
+
+ Exploration of Human Repair Initiation in Task-oriented Dialogue: A Linguistic Feature-based Approach
+ AnhNgo
+ DirkHeylen
+ NicolasRollet
+ CatherinePelachaud
+ ChloéClavel
+ 603–609
+ In daily conversations, people often encounter problems prompting conversational repair to enhance mutual understanding. By employing an automatic coreference solver, alongside examining repetition, we identify various linguistic features that distinguish turns when the addressee initiates repair from those when they do not. Our findings reveal distinct patterns that characterize the repair sequence and each type of repair initiation.
+ 2024.sigdial-1.51
+ ngo-etal-2024-exploration
+
+
+ Comparing Pre-Trained Embeddings and Domain-Independent Features for Regression-Based Evaluation of Task-Oriented Dialogue Systems
+ KallirroiGeorgila
+ 610–623
+ We use Gaussian Process Regression to predict different types of ratings provided by users after interacting with various task-oriented dialogue systems. We compare the performance of domain-independent dialogue features (e.g., duration, number of filled slots, number of confirmed slots, word error rate) with pre-trained dialogue embeddings. These pre-trained dialogue embeddings are computed by averaging over sentence embeddings in a dialogue. Sentence embeddings are created using various models based on sentence transformers (appearing on the Hugging Face Massive Text Embedding Benchmark leaderboard) or by averaging over BERT word embeddings (varying the BERT layers used). We also compare pre-trained embeddings extracted from human transcriptions with pre-trained embeddings extracted from speech recognition outputs, to determine the robustness of these models to errors. Our results show that overall, for most types of user satisfaction ratings and advanced/recent (or sometimes less advanced/recent) pre-trained embedding models, using only pre-trained embeddings outperforms using only domain-independent features. However, this pattern varies depending on the type of rating and the embedding model used. Also, pre-trained embeddings are found to be robust to speech recognition errors, more advanced/recent embedding models do not always perform better than less advanced/recent ones, and larger models do not necessarily outperform smaller ones. The best prediction performance is achieved by combining pre-trained embeddings with domain-independent features.
+ 2024.sigdial-1.52
+ georgila-2024-comparing
+
+
+ Question Type Prediction in Natural Debate
+ ZlataKikteva
+ AlexanderTrautsch
+ SteffenHerbold
+ AnnetteHautli-Janisz
+ 624–630
+ In spontaneous natural debate, questions play a variety of crucial roles: they allow speakers to introduce new topics, seek other speakers’ opinions or indeed confront them. A three-class question typology has previously been demonstrated to effectively capture details pertaining to the nature of questions and the different functions associated with them in a debate setting. We adopt this classification and investigate the performance of several machine learning approaches on this task by incorporating various sets of lexical, dialogical and argumentative features. We find that BERT demonstrates the best performance on the task, followed by a Random Forest model enriched with pragmatic features.
+ 2024.sigdial-1.53
+ kikteva-etal-2024-question
+
+
+ MemeIntent: Benchmarking Intent Description Generation for Memes
+ JeongsikPark
+ Khoi P. N.Nguyen
+ TerrenceLi
+ SuyeshShrestha
+ Megan KimVu
+ Jerry YiningWang
+ VincentNg
+ 631–643
+ While recent years have seen a surge of interest in the automatic processing of memes, much of the work in this area has focused on determining whether a meme contains malicious content. This paper proposes the new task of intent description generation: generating a description of the author’s intentions when creating the meme. To stimulate future work on this task, we (1) annotated a corpus of memes with the intents being perceived by the reader as well as the background knowledge needed to infer the intents and (2) established baseline performance on the intent description generation task using state-of-the-art large language models. Our results suggest the importance of background knowledge retrieval in intent description generation for memes.
+ 2024.sigdial-1.54
+ park-etal-2024-memeintent
+
+
+ Automating PTSD Diagnostics in Clinical Interviews: Leveraging Large Language Models for Trauma Assessments
+ SichangTu
+ AbigailPowers
+ NatalieMerrill
+ NegarFani
+ SierraCarter
+ StephenDoogan
+ Jinho D.Choi
+ 644–663
+ The shortage of clinical workforce presents significant challenges in mental healthcare, limiting access to formal diagnostics and services. We aim to tackle this shortage by integrating a customized large language model (LLM) into the workflow, thus promoting equity in mental healthcare for the general population. Although LLMs have showcased their capability in clinical decision-making, their adaptation to severe conditions like Post-traumatic Stress Disorder (PTSD) remains largely unexplored. Therefore, we collect 411 clinician-administered diagnostic interviews and devise a novel approach to obtain high-quality data. Moreover, we build a comprehensive framework to automate PTSD diagnostic assessments based on interview contents by leveraging two state-of-the-art LLMs, GPT-4 and Llama-2, with potential for broader clinical diagnoses. Our results illustrate strong promise for LLMs, tested on our dataset, to aid clinicians in diagnostic validation. To the best of our knowledge, this is the first AI system that fully automates assessments for mental illness based on clinician-administered interviews.
+ 2024.sigdial-1.55
+ tu-etal-2024-automating
+
+
+ DialBB: A Dialogue System Development Framework as an Educational Material
+ MikioNakano
+ KazunoriKomatani
+ 664–668
+ We demonstrate DialBB, a dialogue system development framework, which we have been building as an educational material for dialogue system technology. Building a dialogue system requires the adoption of an appropriate architecture depending on the application and the integration of various technologies. However, this is not easy for those who have just started learning dialogue system technology. Therefore, there is a demand for educational materials that integrate various technologies to build dialogue systems, because traditional dialogue system development frameworks were not designed for educational purposes. DialBB enables the development of dialogue systems by combining modules called building blocks. After understanding sample applications, learners can easily build simple systems using built-in blocks and can build advanced systems using their own developed blocks.
+ 2024.sigdial-1.56
+ nakano-komatani-2024-dialbb
+
+
+ A Multimodal Dialogue System to Lead Consensus Building with Emotion-Displaying
+ ShinnosukeNozue
+ YutoNakano
+ ShojiMoriya
+ TomokiAriyama
+ KazumaKokuta
+ SuchunXie
+ KaiSato
+ ShusakuSone
+ RyoheiKamei
+ ReinaAkama
+ YuichirohMatsubayashi
+ KeisukeSakaguchi
+ 669–673
+ The evolution of large language models has enabled fluent dialogue, increasing interest in the coexistence of humans and avatars. An essential aspect of achieving this coexistence involves developing sophisticated dialogue systems that can influence user behavior. In this background, we propose an effective multimodal dialogue system designed to promote consensus building with humans. Our system employs a slot-filling strategy to guide discussions and attempts to influence users with suggestions through emotional expression and intent conveyance via its avatar. These innovations have resulted in our system achieving the highest performance in a competition evaluating consensus building between humans and dialogue systems. We hope that our research will promote further discussion on the development of dialogue systems that enhance consensus building in human collaboration.
+ 2024.sigdial-1.57
+ nozue-etal-2024-multimodal
+
+
+ PersonaCLR: Evaluation Model for Persona Characteristics via Contrastive Learning of Linguistic Style Representation
+ MichimasaInaba
+ 674–685
+ Persona-aware dialogue systems can improve the consistency of the system’s responses, users’ trust and user enjoyment. Filtering nonpersona-like utterances is important for constructing persona-aware dialogue systems. This paper presents the PersonaCLR model for capturing a given utterance’s intensity of persona characteristics. We trained the model with contrastive learning based on the sameness of the utterances’ speaker. Contrastive learning enables PersonaCLR to evaluate the persona characteristics of a given utterance, even if the target persona is not included in training data. For training and evaluating our model, we also constructed a new dataset of 2,155 character utterances from 100 Japanese online novels. Experimental results indicated that our model outperforms existing methods and a strong baseline using a large language model. Our source code, pre-trained model, and dataset are available at https://github.com/1never/PersonaCLR.
+ 2024.sigdial-1.58
+ inaba-2024-personaclr
+
+
+ DiagESC: Dialogue Synthesis for Integrating Depression Diagnosis into Emotional Support Conversation
+ SeungyeonSeo
+ Gary GeunbaeLee
+ 686–698
+ Dialogue systems for mental health care aim to provide appropriate support to individuals experiencing mental distress. While extensive research has been conducted to deliver adequate emotional support, existing studies cannot identify individuals who require professional medical intervention and cannot offer suitable guidance. We introduce the Diagnostic Emotional Support Conversation task for an advanced mental health management system. We develop the DESC dataset to assess depression symptoms while maintaining user experience by utilizing task-specific utterance generation prompts and a strict filtering algorithm. Evaluations by professional psychological counselors indicate that DESC has a superior ability to diagnose depression than existing data. Additionally, conversational quality evaluation reveals that DESC maintains fluent, consistent, and coherent dialogues.
+ 2024.sigdial-1.59
+ seo-lee-2024-diagesc
+
+
+ Infusing Emotions into Task-oriented Dialogue Systems: Understanding, Management, and Generation
+ ShutongFeng
+ Hsien-chinLin
+ ChristianGeishauser
+ NurulLubis
+ Carelvan Niekerk
+ MichaelHeck
+ Benjamin MatthiasRuppik
+ RenatoVukovic
+ MilicaGasic
+ 699–717
+ Emotions are indispensable in human communication, but are often overlooked in task-oriented dialogue (ToD) modelling, where the task success is the primary focus. While existing works have explored user emotions or similar concepts in some ToD tasks, none has so far included emotion modelling into a fully-fledged ToD system nor conducted interaction with human or simulated users. In this work, we incorporate emotion into the complete ToD processing loop, involving understanding, management, and generation. To this end, we extend the EmoWOZ dataset (Feng et al., 2022) with system affective behaviour labels. Through interactive experimentation involving both simulated and human users, we demonstrate that our proposed framework significantly enhances the user’s emotional experience as well as the task success.
+ 2024.sigdial-1.60
+ feng-etal-2024-infusing
+
+
+ Estimating the Emotional Valence of Interlocutors Using Heterogeneous Sensors in Human-Human Dialogue
+ JingjingJiang
+ AoGuo
+ RyuichiroHigashinaka
+ 718–727
+ Dialogue systems need to accurately understand the user’s mental state to generate appropriate responses, but accurately discerning such states solely from text or speech can be challenging. To determine which information is necessary, we first collected human-human multimodal dialogues using heterogeneous sensors, resulting in a dataset containing various types of information including speech, video, physiological signals, gaze, and body movement. Additionally, for each time step of the data, users provided subjective evaluations of their emotional valence while reviewing the dialogue videos. Using this dataset and focusing on physiological signals, we analyzed the relationship between the signals and the subjective evaluations through Granger causality analysis. We also investigated how sensor signals differ depending on the polarity of the valence. Our findings revealed several physiological signals related to the user’s emotional valence.
+ 2024.sigdial-1.61
+ jiang-etal-2024-estimating
+
+
+ The Gap in the Strategy of Recovering Task Failure between GPT-4V and Humans in a Visual Dialogue
+ RyosukeOshima
+ SeitaroShinagawa
+ ShigeoMorishima
+ 728–745
+ Goal-oriented dialogue systems interact with humans to accomplish specific tasks. However, sometimes these systems fail to establish a common ground with users, leading to task failures. In such cases, it is crucial not to just end with failure but to correct and recover the dialogue to turn it into a success for building a robust goal-oriented dialogue system. Effective recovery from task failures in a goal-oriented dialogue involves not only successful recovery but also accurately understanding the situation of the failed task to minimize unnecessary interactions and avoid frustrating the user. In this study, we analyze the capabilities of GPT-4V in recovering failure tasks by comparing its performance with that of humans using Guess What?! Game. The results show that GPT-4V employs less efficient recovery strategies, such as asking additional unnecessary questions, than humans. We also found that while humans can occasionally ask questions that doubt the accuracy of the interlocutor’s answer during task recovery, GPT-4V lacks this capability.
+ 2024.sigdial-1.62
+ oshima-etal-2024-gap
+
+
+ MindDial: Enhancing Conversational Agents with Theory-of-Mind for Common Ground Alignment and Negotiation
+ ShuwenQiu
+ MingdianLiu
+ HengliLi
+ Song-ChunZhu
+ ZilongZheng
+ 746–759
+ Humans talk in daily conversations while aligning and negotiating the expressed meanings or common ground. Despite the impressive conversational abilities of the large generative language models, they do not consider the individual differences in contextual understanding in a shared situated environment. In this work, we propose MindDial, a novel conversational framework that can generate situated free-form responses to align and negotiate common ground. We design an explicit mind module that can track three-level beliefs – the speaker’s belief, the speaker’s prediction of the listener’s belief, and the belief gap between the first two. Then the next response is generated to resolve the belief difference and take task-related action. Our framework is applied to both prompting and fine-tuning-based models, and is evaluated across scenarios involving both common ground alignment and negotiation. Experiments show that models with mind modeling can generate more human-like responses when aligning and negotiating common ground. The ablation study further validates the three-level belief design can aggregate information and improve task outcomes in both cooperative and negotiating settings.
+ 2024.sigdial-1.63
+ qiu-etal-2024-minddial
+
+
+ An Open Intent Discovery Evaluation Framework
+ GrantAnderson
+ EmmaHart
+ DimitraGkatzia
+ IanBeaver
+ 760–769
+ In the development of dialog systems the discovery of the set of target intents to identify is a crucial first step that is often overlooked. Most intent detection works assume that a labelled dataset already exists, however creating these datasets is no trivial task and usually requires humans to manually analyse, decide on intent labels and tag accordingly. The field of Open Intent Discovery addresses this problem by automating the process of grouping utterances and providing the user with the discovered intents. Our Open Intent Discovery framework allows for the user to choose from a range of different techniques for each step in the discovery process, including the ability to extend previous works with a human-readable label generation stage. We also provide an analysis of the relationship between dataset features and optimal combination of techniques for each step to help others choose without having to explore every possible combination for their unlabelled data.
+ 2024.sigdial-1.64
+ anderson-etal-2024-open
+
+
+ Toximatics: Towards Understanding Toxicity in Real-Life Social Situations
+ MayukhDas
+ Wolf-TiloBalke
+ 770–785
+ The proliferation of social media has increased the visibility and effects of hate speech. To address this, NLP solutions have been developed to identify both explicit and implicit forms of hate speech. Typically, these approaches evaluate the toxicity of utterances in isolation, ignoring the context. Drawing on pragmatics, our study examines how contextual factors can influence the perceived toxicity of utterances, thereby anchoring assessments in a more nuanced semantic framework. We present Toximatics, a dataset that includes context-dependent utterances and it’s toxicity score. We also introduce a novel synthetic data generation pipeline designed to create context-utterance pairs at scale with controlled polarity. This pipeline can enhance existing hate speech datasets by adding contextual information to utterances, either preserving or altering their polarity, and also generate completely new pairs from seed statements. We utilised both features to create Toximatics. To address biases in state-of-the-art hate datasets, which often skew towards specific sensitive topics such as politics, race, and gender, we propose a method to generate neutral utterances typical of various social settings. These are then contextualized to show how neutrality can shift to toxicity or benignity depending on the surrounding context. The evaluation results clearly indicate that the current models are underperforming on this dataset.
+ 2024.sigdial-1.65
+ das-balke-2024-toximatics
+
+
+
diff --git a/data/xml/2024.wassa.xml b/data/xml/2024.wassa.xml
index 1ba88b73ca..dc77e28df1 100644
--- a/data/xml/2024.wassa.xml
+++ b/data/xml/2024.wassa.xml
@@ -33,11 +33,11 @@
SEC: Context-Aware Metric Learning for Efficient Emotion Recognition in Conversation
BarbaraGendronUniversity of Lorraine
- GaelGuibonGaelGuibonUniversity of Lorraine
+ GaëlGuibonUniversity of Lorraine
11-22
The advent of deep learning models has made a considerable contribution to the achievement of Emotion Recognition in Conversation (ERC). However, this task still remains an important challenge due to the plurality and subjectivity of human emotions. Previous work on ERC provides predictive models using mostly graph-based conversation representations. In this work, we propose a way to model the conversational context that we incorporate into a metric learning training strategy, with a two-step process. This allows us to perform ERC in a flexible classification scenario and end up with a lightweight yet efficient model. Using metric learning through a Siamese Network architecture, we achieve 57.71 in macro F1 score for emotion classification in conversation on DailyDialog dataset, which outperforms the related work. This state-of-the-art result is promising in terms of the use of metric learning for emotion recognition, yet perfectible compared to the micro F1 score obtained.
2024.wassa-1.2
- gendron-gaelguibon-2024-sec
+ gendron-guibon-2024-wassa
Modeling Complex Interactions in Long Documents for Aspect-Based Sentiment Analysis
diff --git a/data/xml/W17.xml b/data/xml/W17.xml
index a6ebf7bb3f..cb4b3a9998 100644
--- a/data/xml/W17.xml
+++ b/data/xml/W17.xml
@@ -3387,7 +3387,6 @@ is able to handle phenomena related to scope by means of an higher-order type th
10.18653/v1/W17-1810
This paper presents an open-source toolkit for negation detection. It identifies negation cues and their corresponding scope in either raw or parsed text using maximum-margin classification. The system design draws on best practice from the existing literature on negation detection, aiming for a simple and portable system that still achieves competitive performance. Pre-trained models and experimental results are provided for English.
enger-etal-2017-open
- marenger/negtool
diff --git a/data/xml/W19.xml b/data/xml/W19.xml
index 602d659c22..09e5292de3 100644
--- a/data/xml/W19.xml
+++ b/data/xml/W19.xml
@@ -2262,6 +2262,7 @@
10.18653/v1/W19-1909
alsentzer-etal-2019-publicly
EmilyAlsentzer/clinicalBERT
+ MIMIC-III
A General-Purpose Annotation Model for Knowledge Discovery: Case Study in Spanish Clinical Text
diff --git a/data/yaml/name_variants.yaml b/data/yaml/name_variants.yaml
index 93342e530e..e7a67908e7 100644
--- a/data/yaml/name_variants.yaml
+++ b/data/yaml/name_variants.yaml
@@ -10637,3 +10637,11 @@
- canonical: {first: Genta Indra, last: Winata}
variants:
- {first: Genta, last: Winata}
+- canonical: {first: Juan Pablo, last: Munoz}
+ id: juan-pablo-munoz
+ variants:
+ - {first: J. Pablo, last: Munoz}
+ - {first: J. P., last: Munoz}
+ - {first: Juan P., last: Munoz}
+ - {first: Juan Pablo, last: Muñoz}
+ - {first: J. Pablo, last: Muñoz}
diff --git a/data/yaml/sigs/sigdial.yaml b/data/yaml/sigs/sigdial.yaml
index fbac942bf2..01bfa3d963 100644
--- a/data/yaml/sigs/sigdial.yaml
+++ b/data/yaml/sigs/sigdial.yaml
@@ -2,6 +2,8 @@ Name: ACL/ISCA Special Interest Group on Discourse and Dialogue
ShortName: SIGDIAL
URL: http://www.aclweb.org/sigdial
Meetings:
+ - 2024:
+ - 2024.sigdial-1 # Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue
- 2023:
- 2023.sigdial-1
- 2022:
diff --git a/data/yaml/sigs/siggen.yaml b/data/yaml/sigs/siggen.yaml
index 9ab9d7efcb..98c7fe2a85 100644
--- a/data/yaml/sigs/siggen.yaml
+++ b/data/yaml/sigs/siggen.yaml
@@ -2,6 +2,10 @@ Name: Special Interest Group on Natural Language Generation (SIGGEN)
ShortName: SIGGEN
URL: https://aclweb.org/aclwiki/SIGGEN
Meetings:
+ - 2024:
+ - 2024.inlg-main
+ - 2024.inlg-demos
+ - 2024.inlg-tutorials # Proceedings of the 17th International Natural Language Generation Conference: Tutorial Abstract
- 2023:
- 2023.inlg-main
- 2023.inlg-demos
diff --git a/data/yaml/venues/aiwolfdial.yaml b/data/yaml/venues/aiwolfdial.yaml
new file mode 100644
index 0000000000..0717230d25
--- /dev/null
+++ b/data/yaml/venues/aiwolfdial.yaml
@@ -0,0 +1,2 @@
+acronym: AIWolfDial
+name: The 2nd International AIWolfDial Workshop
diff --git a/data/yaml/venues/practicald2t.yaml b/data/yaml/venues/practicald2t.yaml
new file mode 100644
index 0000000000..90406eec81
--- /dev/null
+++ b/data/yaml/venues/practicald2t.yaml
@@ -0,0 +1,2 @@
+acronym: PracticalD2T
+name: The 2nd Workshop on Practical LLM-assisted Data-to-Text Generation