diff --git a/data/xml/2024.acl.xml b/data/xml/2024.acl.xml index f20229ebe1..8e53e8ad6d 100644 --- a/data/xml/2024.acl.xml +++ b/data/xml/2024.acl.xml @@ -11145,8 +11145,10 @@ ArmanCohanYale University 16103-16120 Recent LLMs have demonstrated remarkable performance in solving exam-like math word problems. However, the degree to which these numerical reasoning skills are effective in real-world scenarios, particularly in expert domains, is still largely unexplored. This paper introduces DocMath-Eval, a comprehensive benchmark specifically designed to evaluate the numerical reasoning capabilities of LLMs in the context of understanding and analyzing financial documents containing both text and tables. We evaluate a wide spectrum of 27 LLMs, including those specialized in math, coding and finance, with Chain-of-Thought and Program-of-Thought prompting methods. We found that even the current best-performing system (i.e., GPT-4) still significantly lags behind human experts in solving complex numerical reasoning problems grounded in long contexts. We believe DocMath-Eval can be used as a valuable benchmark to evaluate LLMs’ capabilities to solve challenging numerical reasoning problems in expert domains. - 2024.acl-long.852 + 2024.acl-long.852 zhao-etal-2024-docmath + + Included experimental results. Unintended Impacts of <fixed-case>LLM</fixed-case> Alignment on Global Representation