Skip to content

Sonia's Data Science for Linguists term project, identifying languages and mapping their relatedness

License

Notifications You must be signed in to change notification settings

Data-Science-for-Linguists-2021/languageID-relatedconfusion

Repository files navigation

Language identification to map language relatedness

Sonia Cromp, [email protected]

Sonia's Data Science for Linguists term project: A classifier to detect the language of a text sample, and an analysis of which languages are commonly confused by the classifier and why

The data used is derived from Wikipedia server dumps. Here are all Wikipedia contributors.

Directory

Administrivia

  • project_plan.md gives the ideas I had at the start of the project, which is pretty much what I adhered to
  • progress_report.md gives progress updates as I worked on the projects. I tried to make the reports fairly detailed and able to be followed
  • final_report.md contains the final report
  • [final_pres.pdf] is the slideshow I used for the final presentation
  • LICENSE.md contains licensing information
  • README.md is what you're reading right now
  • .gitignore should be ignored

Code

Subdirectories

  • data_chunked contains the non-anonymized dataset (i.e. the final product of 1-data-explanation.ipynb)
  • data_samples contains intermediary examples of the data during the datagathering process, which are discussed in 1-data-explanation.ipynb
  • datagather contains the actual scripts used to gather the data.
    • get-data-wiki.py is the main gathering script
    • get-data-wiki.sh contains some utilities used by the Python script
    • crc contains the files needed to run the data gathering process on Pitt's Computing Research Center (CRC) resources
      • get-data-wiki.py and get-data-wiki.sh correspond with the same scripts in the parent directory, but with modifications for CRC
      • guide.md explains how to set up the script and all networking configurations on CRC, your personal computer, and a secondary storage computer
      • paramiko-tuto.py is a tutorial/test script to experiment with the networking setup used in the datagathering process
      • ex.txt is just a text file, sent across the network by paramiko-tuto.py
  • figs contains figures, most of which are in the final report

The guestbook is here.

License

The data is licensed under the Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) policy and the GNU Free Documentation License (GFDL). See the README.md in that directory for more information.

All other parts of this repository are licensed under the GNU General Public License v3.0.

About

Sonia's Data Science for Linguists term project, identifying languages and mapping their relatedness

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages