This project is to compare those whose first language is English learning Korean with those whose first language is Korean learning English.
This project comprises two parts; intergroup comparison and intragroup comparison. For the intergroup comparison, the L2 proficiency of the same level in each group is the main focus. For the intragroup comparison, the anslysis focuse on how much the proficiency increases between each level.
This is the corpus from Dr. Park's github repository:
https://github.com/jungyeul/korean-learner-corpus/tree/main/data
This corpus contains participants' unique ID, nationality, gender, the topic of the text, raw text, POS tagged morphemes, proficiency level of Korean, and their essay score.
Since the aim of this project is to compare English-speaking participants learning Korean to Korean speaking participants learning English, the data is sorted out by the nationality of the participants. Details are delineated in the final_report.md.
This is the learner corpus data collected in the University of Pittsburgh English Language Institute, and it is accessible through github:
https://github.com/ELI-Data-Mining-Group/PELIC-dataset
It is a longitudinal corpus, which underlines its significance of giving "greater opportunity for tracking development in a natural classroom setting". (PELIC readme.md, 1. overview)
This corpus contains 1,177 participants, and the total number of token is 4,250,703. In order to match the purpose of this project, the data of those whose mother tongue is Korean is sorted out. Details are delineated in the final_report.md