This term project aims to carry out explanatory and linguistic analyses of Kazakh-Russian code-switching based on the conversational dataset. Kazakh-Russian code-switching is extremely common in daily communication since the majority of Kazakhs are bilingual. Inter-sentential and intra-sentential types of code-switching are practiced, however, intra-word code-switching is also observed.
This project uses the IARPA Babel Program Kazakh language collection release IARPA-babel302b-v1.0a. It contains approximately 203 hours of Kazakh conversational and scripted telephone speech collected in 2013 and 2014 along with corresponding transcripts. The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech. The Kazakh speech in this release represents that spoken in the Northeastern and Southern dialect regions of Kazakhstan. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 64 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.
The primary goal is to count the frequency of code-switching within conversation and analyze its morphological features through linguistic annotation. The project plan is as follows:
-
Screening the dataset for occurrence of Kazakh-Russian code switching: create general characteristics of code-switching examples (inter-sentential,intra-sentential, and intra-word);
-
Literature Review on syntactic and morphological annotation types and structure, review of research related to computational approach to code-switching;
-
Creating an annotation scheme that can capture Kazakh and Russian morphology (+syntax?);
-
Choosing an appropriate annotation structure from the following options: Inline annotation of Plain Text; Inline XML; Column-based annotations (offline); Standoff and Hybrid Standoff Annotations (offline);
-
Choosing annotation tools and annotating text files (50 conversations, tentative);
-
Analyzing annotated texts for frequency and types of code-switching, syntactic-morphological features of each occurrence, etc.
-
Sampling and building data frame from the Babel collection for explanatory purposes (age, gender, region of speakers, etc.);
-
Writing summary of analyses;
-
Identifying limitations;
-
Preparing a presentation based on the analysis.