Language identification to map language relatedness

Sonia Cromp, [email protected]

Sonia's Data Science for Linguists term project: A classifier to detect the language of a text sample, and an analysis of which languages are commonly confused by the classifier and why

The data used is derived from Wikipedia server dumps. Here are all Wikipedia contributors.

Directory

Administrivia

project_plan.md gives the ideas I had at the start of the project, which is pretty much what I adhered to
progress_report.md gives progress updates as I worked on the projects. I tried to make the reports fairly detailed and able to be followed
final_report.md contains the final report
[final_pres.pdf] is the slideshow I used for the final presentation
LICENSE.md contains licensing information
README.md is what you're reading right now
.gitignore should be ignored

Code

1-data-explanation.ipynb explains the data gathering process. Since GitHub mangles section links, here's the same notebook through Jupyter's nbviewer.
2-dataAnonAndPrep.ipynb anonymizes all languages' writing systems, then reformats the anonymized and non-anonymized datasets to be all ready for machine learning. Since GitHub mangles section links, here's the same notebook through Jupyter's nbviewer.
3-naive_bayes.ipynb does the language identification and relatedness mapping. Since GitHub mangles section links, here's the same notebook through Jupyter's nbviewer.
4-clusterfun.ipynb does a bit more relatedness mapping and analysis. Since GitHub mangles section links, here's the same notebook through Jupyter's nbviewer.
nbgrid.py was used to perform grid search with the classifier on CRC
nbgrid.sh is the corresponding slurm script to run nbgrid.py

Subdirectories

data_chunked contains the non-anonymized dataset (i.e. the final product of 1-data-explanation.ipynb)
data_samples contains intermediary examples of the data during the datagathering process, which are discussed in 1-data-explanation.ipynb
datagather contains the actual scripts used to gather the data.
- get-data-wiki.py is the main gathering script
- get-data-wiki.sh contains some utilities used by the Python script
- crc contains the files needed to run the data gathering process on Pitt's Computing Research Center (CRC) resources
  - get-data-wiki.py and get-data-wiki.sh correspond with the same scripts in the parent directory, but with modifications for CRC
  - guide.md explains how to set up the script and all networking configurations on CRC, your personal computer, and a secondary storage computer
  - paramiko-tuto.py is a tutorial/test script to experiment with the networking setup used in the datagathering process
  - ex.txt is just a text file, sent across the network by paramiko-tuto.py
figs contains figures, most of which are in the final report

The guestbook is here.

License

The data is licensed under the Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) policy and the GNU Free Documentation License (GFDL). See the README.md in that directory for more information.

All other parts of this repository are licensed under the GNU General Public License v3.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language identification to map language relatedness

Sonia Cromp, [email protected]

Directory

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
data_chunked		data_chunked
data_samples		data_samples
datagather		datagather
figs		figs
.gitignore		.gitignore
1-data-explanation.ipynb		1-data-explanation.ipynb
2-dataAnonAndPrep.ipynb		2-dataAnonAndPrep.ipynb
3-naivebayes.ipynb		3-naivebayes.ipynb
4-clusterfun.ipynb		4-clusterfun.ipynb
LICENSE.md		LICENSE.md
README.md		README.md
final-pres.pdf		final-pres.pdf
final_report.md		final_report.md
nbgrid.py		nbgrid.py
nbgrid.sh		nbgrid.sh
progress_report.md		progress_report.md
project_plan.md		project_plan.md

License

Data-Science-for-Linguists-2021/languageID-relatedconfusion

Folders and files

Latest commit

History

Repository files navigation

Language identification to map language relatedness

Sonia Cromp, [email protected]

Directory

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages