http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
-
Notifications
You must be signed in to change notification settings - Fork 0
shintaro-ozaki/cs_bot
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
 |  | |||
 |  | |||
 |  | |||
Repository files navigation
About
The automatic system update the information of survey everyday on readme.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published