Margaret Jones (mmj32)
3 October 2017
Data Science for Linguists
Final Project
- Cats vs. Dogs on Twitter: Who is more popular?
- Comparing the amount of likes and also comparisons between neologisms for/about cats and dogs.
- neologisms that will be compared:
- dogs: "doggo" and "doge"
- cats: "kitteh" and "toe beans"
- Plurality must also be taken into account, and because it is uncertain how Twitter handles this both will be taken.
- neologisms that will be compared:
- I will use tweepy to gather data from Twitter on posts people have about cats and dogs.
- Then I will compare with numpy and matplotlib to show statistics and compare the data.
- Compare number of retweets and favorites.
- Compare location, number, and timezone.
- If there is a particular place that it more likely to post under that hashtag than others.
- Average length of the posts will also be compared.
- This will be completed in a jupyter notebook that will also be found in this repository.
- Twitter is a little bit tricky to work with it seems when it comes to data sharing. According to what is found here, Twitter allows people to store ONLY tweet IDs into a dataset to distribute. However, even this has limitations on size, and time scope, as well as licensing issues. So, sharing my data isn't going to work very well. In order to stay on the safe size, I will not post my dataset of stored tweets anywhere on GitHub, and what can be seen on GitHub will only be small clips from my dataset. That way, you can see what the data looks like and what it is doing, but ultimately will be read in from a local file on my machine.