_ _ _ _ _
___ _ __ ___ __ _ _ __ _ __ | |_ ___ ___ | | | _(_) |_
/ __| '_ ` _ \ / _` | '_ \| '_ \ _____| __/ _ \ / _ \| | |/ / | __|
\__ \ | | | | | (_| | |_) | |_) |_____| || (_) | (_) | | <| | |_
|___/_| |_| |_|\__,_| .__/| .__/ \__\___/ \___/|_|_|\_\_|\__|
|_| |_|
This is an user-friendly python package for interfacing with large collections of tweets. Developped at the SMaPP lab at New York University.
- MongoTweetCollection
- BSONTweetCollection
- Shared Collection Functions
- containing
- count
- texts
- term_counts
- sample
- apply_labels
- since
- until
- language
- user_lang_contains
- excluding_retweets
- user_location_containing
- field_containing
- geo_enabled
- non_geo_enabled
- limit
- top_hashtags
- top_unigrams top_bigrams top_trigrams
- top_urls
- top_images
- top_mentions
- top_links
- top_user_locations
- top_geolocation_names
- top_entities
- top_X to_csv
- group_by
- dump_csv
- dump_bson_topath
- dump_bson
- dump_json
- MongoTweetCollection Only Functions
- BSONTweetCollection Only Functions
Supports Python 2.7
Simplest: using pip
:
pip install smapp-toolkit
To update to the latest version, if you have an older one installed:
pip install -U smapp-toolkit
Or download the source code using git
git clone https://github.com/SMAPPNYU/smapp-toolkit
cd smapp-toolkit
python setup.py install
or download the tarball and install.
The smapp-toolkit
depends on the following packages, which will be automatically installed when installing smapp-toolkit
:
- pymongo, the Python MongoDB driver
- smappPy, a utility library from SMaPP
- networkx, a library for building and analyzing graphs
- pandas, a Python data analysis library
- simplejson
##Pushing to PyPi
To bump the version and push to github run, bash yvanbump.sh
.
To bump the version, push to github, and upload to pypi run bash upload+to_pypi.sh
.
To upload to pypi you need:
- to be added with the right permissions to pypi
- a .pypirc file in your ~ directory
- to follow this guide.
This allows you to plug into a running live mongodb database and run toolkit methods on the resulting collection object. Abstract:
from smapp_toolkit.twitter import MongoTweetCollection
collection = MongoTweetCollection(address='MONGODB-HOSTNAME',
port='MONGODB-PORT',
username='MONGO-DATABASE-USER',
password='MONGO-DATABASE-PASSWORD',
dbname='MONGO-DATABASE')
Practical:
from smapp_toolkit.twitter import MongoTweetCollection
collection = MongoTweetCollection(address='superhost.bio.nyu.edu',
port='27017',
username='readWriteUser',
password='readwritePassword',
dbname='GermanyElectionDatabase')
MONGODB-HOSTNAME
is the domain name or ip address of the server that is hosting the database.
MONGODB-PORT
is the port on which the running monogodb instance is accessible on the server.
MONGO-DATABASE-USER
is the user on the database that can at least read the database.
MONGO-DATABASE-PASSWORD
the password on that particular database for that user.
MONGO-DATABASE
the name of the database running on the mongo instance.
Returns an iterable collection object that can be used like so:
for tweet in MongoTweetCollection: print tweet
This allows you to plug in a bson file and run toolkit methods on the resulting collection object.
Abstract:
from smapp_toolkit.twitter import BSONTweetCollection
collection = BSONTweetCollection('/PATH/TO/FILE.bson')
Practical:
from smapp_toolkit.twitter import BSONTweetCollection
collection = BSONTweetCollection('/home/toolkituser/datafolder/file.bson')
/PATH/TO/FILE.bson
the path on your computers filesystem / disk to the bson file.
Returns an iterable collection object that can be used like so:
for tweet in MongoTweetCollection: print tweet
Gets the tweets that contain one or more terms.
Abstract:
collection.containing('TERM-ONE', 'TERM-TWO', 'ETC')
Practical:
collection.containing('#bieber', '#sexy')
Returns a collection object with a filter applied to it to only return tweet objects where the tweet text contains those terms.
Counts the number of occurrences of a given word. Can be called on a collection object with a chained method.
Abstract:
collection.containing('TERM')
Practical:
collection.containing('#bieber')
Chained:
collection.containing('#bieber').count()
Returns the number of tweets in a collection object.
Gets the texts from a collection object or a collection object with a chained method applied.
Abstract:
texts = collection.texts()
Chained:
texts = collection.containing('#bieber').texts()
Allows you to count particular terms and split up the counts by a particular time period.
Abstract:
collection.term_counts(['TERM-TWO', 'TERM-ONE'], count_by='TIME-DELIMITER', plot=BOOLEAN)
Practical:
collection.term_counts(['justin', 'miley'], count_by='days', plot=False)
count_by
can be in days, hours, or minutes.
plot
is a True or False variable.
Returns a dictionary where each key is the date and each value is another dictionary.
In the sub dictionary the keys are the terms you chose and potentially a _total
field which to be honest I'm not really sure what the _total field does. I know it isn't the total number of tweets. Dictionary looks like so:
{
'2015-04-01': {'justin': 1312, 'miley': 837},
'2015-04-02': {'justin': 3287, 'miley': 932}
}
WARNING DOES NOT WORK
Gets a random sample of tweets.
Abstract:
collection.sample(FRACTION-OF-1-TO-SAMPLE)
Practical:
collection.sample(0.33)
Chained:
collection.containing('#bieber').sample(0.33).texts()
Returns a collection object that will now have a filter to only return then number of tweets as deteremined by the sample randomly.
Applies a set of named labels and attaches them to objects from a collection if the certain fields in the collection meet certain criteria. It then outputs a bson file where tweets that matched teh filter have an extra labels field in them with the appropriate labels.
Abstract:
collection.apply_labels(
list_of_labels
,list_of_fields
,list_for_values
,bsonoutputpath
)
Practical:
collection.apply_labels(
[['religious_rank', 'religious_rank', 'political_rank'], ['imam', 'cleric', 'politician']]
,['user.screen_name', 'user.id']
,[['Obama', 'Hillary'], ['1234567', '7654321']]
,'outputfolder/bsonoutput.bson'
)
NOTE: ['1234567', '7654321'] are not the actual ids of any twitter users they are just dummy numbers.
list_of_labels
is a list with two lists inside it where the first list contains names for labels and the second list
contains the labels themselves. For example: religious_rank
and imam
would be a label called religious_rank for the label value imam.
Each field in the list_of_fields
array is a string that takes dot notation. user.screen_name would be the screen_name
entry in the user entry in the collection object. You can nest these for as many levels as you have in the collection
object.
list_for_values
is a list that contains as many lists as there are fields to match. Each of these lists (inside list_for_
values) is a list of the values you would like that field to match. So if you want the user.screen_name to match 'obama'
'hillary' or 'lessig' then you would use:
list_of_fields = ['user.screen_name']
list_for_values = [['obama', 'hillary', 'lessig']]
as inputs.
bsonoutputpath
is the path realtive to where you run the script that will be the output file with the new labels.
After you run this method each tweet object in your output BSON will now have a field called 'labels' like so:
{
.
.
.
'labels' : {
'1': {name: “religious_rank”, type: “cleric”},
'2': {name: ”religious_rank”, type: ”imam'},
'3': {name: “eye_color”, type :”brown'}
}
.
.
.
}
Abstract:
collection.since(DATETIME)
Practical:
collection.since(datetime(2014,1,30))
Chained:
from datetime import datetime
collection.since(datetime(2014,1,30)).count()
collection.since(datetime(2014,2,16)).until(datetime(2014,2,19)).containing('obama').texts()
Returns a collection object with the added filter that it will only return objects after a certain date.
Check out a reference on datetime here.
Abstract:
collection.until(DATETIME)
Practical:
collection.until(datetime(2014,1,30))
Chained:
from datetime import datetime
collection.until(datetime(2014,1,30)).count()
collection.since(datetime(2014,2,16)).until(datetime(2014,2,19)).containing('obama').texts()
Note: that both 'since(...)' and 'until(...)' are exclusive (ie, they are GT/> and LT/<, respectively, not GTE/>= or LTE/<=) This means that since(datetime(2014, 12, 24)) will return tweets after EXACTLY 12/24/2014 00:00:00 (M/D/Y H:M:S). Datetimes may be specified to the second: datetime(2014, 12, 24, 6, 30, 25) is 6:30 and 25 seconds AM Universal Timezone. If time (hours, minutes, etc) is not specified, time defaults 00:00:00.
Gets all tweets tagged by twitter (and not the user themselves) with a certain language. Get's all tweets twitter thinks are in language X (french, english, etc) and returns a new collection object with those tweets.
Abstract:
collection.language(LANGUAGE-CODE)
Practical:
collection.language('en')
Chained:
collection.language('en').texts()
collection.language('ru', 'uk') //get tweets in russian or ukranian
Returns a collection with an added filter such that all the tweets returned from the collection will be tweets tagged with that language code.
LANGUAGE-CODE
You can check out the various language codes on twitter's API page here.
This gets all tweets tagged where the user has marked their own language preference. So It would look for all users who marked language X (french, english, etc) as their language on their user profile and then get tweets that are fro those users that are present in the collection object.
Abstract:
collection.user_lang_contains('LANGUAGE-CODE', 'LANGUAGE-CODE')
Practical:
collection.user_lang_contains('de')
collection.user_lang_contains('de', 'fr')
Chained:
collection.user_lang_contains('de', 'fr').texts()
collection.user_lang_contains('de', 'fr') //get tweets in German or French
Returns a collection with an added filter such that all the tweets returned from the collection will be tweets tagged with that language code.
Abstract:
collection.excluding_retweets()
Chained:
collection.excluding_retweets().count()
Returns a collection object filtered to exclude retweets.
This gets tweets where the user locations contains certain location names.
Abstract:
collection.user_location_containing('PLACE-NAME', 'PLACE-WORD')
Practical:
collection.user_location_containing('new york', 'nyc')
Chained:
collection.user_location_containing('new york', 'nyc').texts()
Returns a collection object filtered to only include tweets where the user's location field matches or contains
This method can be used to query a particular field on a tweet object by using dot notation to dig into each sub object or field.
Abstract:
collection.field_containing('user.description', 'TERM', 'TERM', 'TERM')
Practical:
collection.field_containing('user.description', 'kittens', 'imgur', 'internet')
Chained:
collection.field_containing('user.description', 'kittens', 'imgur', 'internet').texts()
You can see the fields and tweet structure here.
Adds a filter to a collection object that only returns geo tweets.
Abstract:
collection.geo_enabled()
Chained:
collection.collection.geo_enabled().texts()
Returns a collection object that only contains tweets that have geo location enabled.
Abstract:
collection.non_geo_enabled()
Chained:
collection.collection.non_geo_enabled().texts()
Returns a collection object that only contains tweets that do not have geo location enabled.
Abstract:
collection.limit(NUMBER-TO-LIMIT)
Practical:
collection.limit(10)
Chained:
collection.sort('timestamp',-1).limit(10).texts()
Returns a collection object that only contains the number of tweets specified by the limit. This is not a random sample. It should just get the first 10 tweets returned.
collection.sort('timestamp',-1).limit(10).texts()
Gets the top hashtags
Abstract:
counts = collection.top_hashtags(n=NUMBEROFHASHTAGS)
Practical:
counts = collection.top_hashtags(n=10)
Chained:
counts = collection.since(datetime(2015,1,1)).until(datetime(2015,1,2)).top_hashtags(n=10)
Returns a pandas data series that contains the top hashtags
Abstract:
counts = collection.top_unigrams(n=NUMBER-UNIGRAMS)
# or
counts = collection.top_bigrams(n=NUMBER-BIGRAMS)
# or
counts = collection.top_trigrams(n=NUMBER-TRIGRAMS)
Practical:
counts = collection.top_unigrams(n=5)
# or
counts = collection.top_bigrams(n=5)
# or
counts = collection.top_trigrams(n=5)
Chained:
counts = collection.since(datetime(2015,1,1)).until(datetime(2015,1,2)).top_unigrams(n=5)
# or
counts = collection.since(datetime(2015,1,1)).until(datetime(2015,1,2)).top_bigrams(n=5)
# or
counts = collection.since(datetime(2015,1,1)).until(datetime(2015,1,2)).top_trigrams(n=5)
Returns a pandas data series that contains the top unigrams, bigrams, or trigrams.
Gets the urls from the entities field of a tweet object. The difference between this and top_links is that top links gets both urls and media references.
counts = collection.top_urls(n=NUMBERURLS)
Practical:
counts = collection.top_urls(n=10)
Chained:
counts = collection.since(datetime(2015,1,1)).until(datetime(2015,1,2)).top_urls(n=10)
Returns a pandas data series that contains the top urls.
counts = collection.top_images(n=NUMBERIMAGES)
Practical:
counts = collection.top_images(n=10)
Chained:
counts = collection.since(datetime(2015,1,1)).until(datetime(2015,1,2)).top_images(n=10)
Returns a pandas data series that contains the top images.
Gets the top twitter mentions, other twitter screen names marked with @ symbols in front of them. So it gets the X many most mentioned people in a collection.
counts = collection.top_mentions(n=NUMBERMENTIONS)
Practical:
counts = collection.top_mentions(n=10)
Chained:
counts = collection.since(datetime(2015,1,1)).until(datetime(2015,1,2)).top_mentions(n=10)
Returns a pandas data series that contains the top mentions.
Gets the top retweets tweet objects from a certain collection.
Abstract:
top_retweets = collection.top_retweets(n=NUMBER-TOP-RETWEETS, rt_columns=['FIELD-ONE', 'FIELD-TWO', 'ETC'])
# or
top_retweets = collection.top_retweets(n=NUMBER-TOP-RETWEETS)
Practical:
top_retweets = collection.top_retweets(n=10, rt_columns=['user.screen_name', 'user.location', 'created_at', 'text'])
# or
top_retweets = collection.top_retweets(n=10)
Chained:
top_retweets = collection.since(datetime.utcnow()-timedelta(hours=1)).top_retweets(n=10, rt_columns=['user.screen_name', 'user.location', 'created_at', 'text'])
Output:
id count
123456789 350
123456444 330
987654321 305
987654329 266
987654323 244
554286237 236
554286238 236
231379283 226
874827344 185
482387489 185
rt_columns
is a python list where each element of the list is a field on a tweet object or a nested/compound field on the tweet object. Specify which columns / fields (of the original tweet) to include in the result, by passing thr rt_columns
argument. The default columns included are ['user.screen_name', 'created_at', 'text']
if no rt_columns
argument is passed to the function.
Returns a pandas data frame which is like a pandas data series except that it is not one dimensional. The data frame has the columns id
and count
and any extra columns you sepcified in your rt_columns
input parameter if there is one.
Gets the urls and media references from the entities field of a tweet object. The difference between this and top_urls is that top urls gets only urls.
counts = collection.top_links(n=NUMBERLINKS)
Practical:
counts = collection.top_links(n=10)
Chained:
counts = collection.since(datetime(2015,1,1)).until(datetime(2015,1,2)).top_links(n=10)
Returns a pandas data series that contains the top links.
Gets the top locations methioned in the user's location field in the user object inside each tweet object.
counts = collection.top_user_locations(n=NUMBERLOCATIONS)
Practical:
counts = collection.top_user_locations(n=10)
Chained:
counts = collection.since(datetime(2015,1,1)).until(datetime(2015,1,2)).top_user_locations(n=10)
Returns a pandas data series that contains the top user locations.
If the place field exists inside a tweet object object. Then this will return the top X places on geolocated tweets.
counts = collection.top_geolocation_names(n=NUMBERLOCATIONS)
Practical:
counts = collection.top_geolocation_names(n=10)
Chained:
counts = collection.since(datetime(2015,1,1)).until(datetime(2015,1,2)).top_geolocation_names(n=10)
Returns a pandas data series that contains the top geolocation names.
Top entities lets you do multiple top_x methods in one go and have then all be returned in one data structure.
Abstract:
top_entities_returned = collection.top_entities(n=NUMBERENTITIES, urls=TRUE/FALSE, images=TRUE/FALSE, hts=TRUE/FALSE, mentions=TRUE/FALSE, geolocation_names=TRUE/FALSE, user_locations=TRUE/FALSE, ngrams=(1,2), ngram_stopwords=[], ngram_hashtags=TRUE/FALSE, ngram_mentions=TRUE/FALSE, ngram_rts=TRUE/FALSE, ngram_mts=TRUE/FALSE, ngram_https=TRUE/FALSE)
Practical 1:
# get the top unigrams, bigrams, and tri grams and return in a dict()
top_entities_returned = collection.top_entities(ngrams=(1,2,3))
Output:
print top_entities_returned['2-grams']
فيديو قوات 350
الطوارى السعودية 330
قوات الطوارى 305
#السعودية #saudi 266
#ksa #السعودية 244
قوات الطوارئ 236
الطوارئ السعودية 236
#saudi #الرياض 226
يقبضون على 185
السعودية يقبضون 185
dtype: int64
Practical 2:
# get the top 5 hashtags and return in a dict()
top_entities_returned = collection.top_entities(n=2, hts=True)
Output:
print top_entities_returned['hts']
#obamaisoursavior #oregonmilitia
Returns a python dictionary object with pandas.Series objects for each top entity list in the dictionary.
for exporting top_X methods:
- top_hashtags
- top_unigrams top_bigrams top_trigrams
- top_urls
- top_images
- top_mentions
- top_links
- top_user_locations
- top_geolocation_names
and each sub dictionary in:
All top_x()
methods return pandas.Series objects. The only one that doesn't is top_retweets
which returns a matrix/pandas data frame (for some reason?). These are subclasses of pandas.DataFrame which can be exported to csv like so:
Abstract:
hashtags = collection.top_hashtags(n=NUMBER-HASHTAGS)
hashtags.to_csv('/path/to/my/output.csv', encoding='utf8')
Practical:
hashtags = collection.top_hashtags(n=5)
hashtags.to_csv('~/hashtags-output.csv', encoding='utf8')
Use the group_by
method to group tweets by time slices. Supported time slices are days
, hours
, minutes
, and seconds
.
Abstract:
collection.group_by('TIME-UNIT')
Practical:
# counting by time slice
for time, tweets in collection.group_by('hours'):
print('{time}: {count}'.format(time=time, count=len(list(tweets))))
which outputs:
2015-01-12 17:00:00: 13275
2015-01-12 18:00:00: 23590
Chaining 1: (not sure if this works, MAY NOT WORK)
#counting by time slice
print collection.since(datetime(2015,6,18,12)).until(datetime(2015,6,18,15)).group_by('hours').count()
which outputs:
2015-06-18 12:00:00 164181
2015-06-18 13:00:00 167129
2015-06-18 14:00:00 165057
Chaining 2: (not sure if this works, MAY NOT WORK)
# countng user locations by time slice
print collection.since(datetime(2015,6,1)).group_by('days').top_user_locations(n=5)
which outputs:
# London London, UK Manchester Scotland UK
# 2015-06-1 4 2 1 1 2
# 2015-06-2 11 4 9 3 3
# 2015-06-3 14 1 5 NaN 4
# 2015-06-4 17 1 5 1 6
# 2015-06-5 10 3 3 3 3
Chaining 3: (not sure if this works, MAY NOT WORK)
print collection.group_by('hours').entities_counts()
which outputs:
_total url image mention hashtag geo_enabled retweet
2015-01-12 17:00:00 13275 881 1428 6612 2001 10628 15
2015-01-12 18:00:00 23590 1668 2509 12091 3575 19019 36
Chaining 4: (not sure if this works, MAY NOT WORK)
# counting tweet languages over time slice
print collection.since(datetime.utcnow()-timedelta(minutes=10)).until(datetime.utcnow()).group_by('minutes').language_counts(langs=['en', 'es', 'other'])
which outputs:
en es other
2015-06-18 21:23:00 821 75 113
2015-06-18 21:24:00 2312 228 339
2015-06-18 21:25:00 2378 196 339
2015-06-18 21:26:00 2352 233 295
2015-06-18 21:27:00 2297 239 344
2015-06-18 21:28:00 1776 173 247
2015-06-18 21:29:00 1825 162 269
2015-06-18 21:30:00 2317 237 326
2015-06-18 21:31:00 2305 233 342
2015-06-18 21:32:00 2337 235 308
2015-06-18 21:33:00 1508 136 228
Chaining 5: (not sure if this works, MAY NOT WORK)
# counting number of unique users per time slice
unique_users = collection.group_by('minutes').unique_users()
tweets = collection.group_by('minutes').count()
unique_users['total tweets'] = tweets['count']
unique_users
which outputs:
unique_users total tweets
2015-04-16 17:01:00 377 432
2015-04-16 17:02:00 432 582
2015-04-16 17:03:00 442 610
2015-04-16 17:04:00 393 531
2015-04-16 17:05:00 504 756
2015-04-16 17:06:00 264 365
Note: there is no/minimal chaining on this method. Doing so can create bugs or crashes or worse. This is because the function doesn't return a data type but returns a generator.
Returns a generator that can be iterated through in a for loop. The generator is split into two parts, a time stamp and a list of tweets. So if you decide to group a collection with tweets spanning an entire day by hours this generator loop should fire 24 times (24 hrs in a day), produce 24 time stamps, and produce 24 lists of tweets. Each list of tweets contains tweets from the time slice of 1 hour you asked for. The same logic from above applies to any time slice.
Takes a collection and dumps its contents to a csv.
Abstract:
collection.dump_csv('/path/to/output.csv')
Practical:
collection.dump_csv('~/my_tweets.csv')
# or
# the desired columns may be specified in the `columns=` named argument.
collection.dump_csv('my_tweets.csv', columns=['id_str', 'user.screen_name', 'user.location', 'user.description', 'text'])
#or
#If the filename specified ends with `.gz`, the output file will be gzipped.
collection.dump_csv('my_tweets.csv.gz')
Returns a csv file that will write to disk. Default columns in this csv should be ['id_str', 'user.screen_name', 'timestamp', 'text']
This will dump whole tweets in MongoDB's BSON format into a specified file. Note that BSON is a 'binary' format (it will look a little funny if opened in a text editor). This is the native format for MongoDB's mongodump program. The file is NOT line-separated.
Abstract:
collection.dump_bson_topath('/path/to/output.bson')
Practical:
collection.dump_bson_topath('~/output.bson')
This will dump a bson file of tweets. Once you have this bson you can convert it to JSON formatted bson (a file with a json object on each line) with the bsondump tool (if you have it) like so:
bsondump output.bson > output.json
Returns a bson file. This is a binary file and is not human readable.w
Dumps a json formatted BSON. This is not a binary file. It is a list of json objects stored line by line. (At least I'm pretty sure.) This is why we have dump_bson
and dump_bson_topath
because the dump_bson method (this method) was not dumping actual binary bson files.
Abstract:
collection.dump_bson('/path/to/output.bson')
Practical:
collection.dump_bson('~/output.bson')
# or
# to append BSON tweets to the given filename (if file already has tweets)
collection.dump_bson('~/output.bson', append=True)
Returns a file that is written to disk that has a json object on each line. This is human readable.
MAY NOT WORK
This will dump whole tweets in JSON format into a specified file, one tweet per line.
Abstract:
collection.dump_json('/path/to/output.json')
Practical:
collection.dump_json('~/output.json')
# or
# to append tweets in the collection to an existing file
collection.dump_json('~/output.json', append=True)
# to write JSON into pretty, line-broken and properly indented format (takes more space)
collection.dump_json('~/output.json', pretty=True)
collection.dump_json('~/output.json', pretty=True, append=True)
Returns a file with a json object on each line that is written to disk. It is human readable.
Sorts tweets inside a collection by a particular field.
Abstract:
collection.sort('FIELD', ORDER)
Practical:
collection.sort('timestamp',-1)
collection.sort('timestamp', 1)
Chained:
collection.sort('timestamp',-1).limit(10).texts()
Returns a collection where the tweets are sorted by the given field.
You can check out the ORDER
here.
-1 means sort in DESCENDING order. 1 means sort in ASCENDING order.
##----- none for now in BSONTweetCollection Only Functions -----
The smapp_toolkit.plotting
module has functions that can make canned visualizations of the data generated by the functions above.
For more examples, see the examples folder.
See examples in the gallery.
from smapp_toolkit.plotting import stacked_bar_plot
data = col.since(datetime(2015,6,18,12)).until(datetime(2015,6,18,12,10)).group_by('minutes').entities_counts()
data['original tweet'] = data['_total'] - data['retweet']
plt.figure(figsize=(10,10))
stacked_bar_plot(data, ['retweet', 'original tweet'], x_tick_date_format='%H:%M', colors=['salmon', 'lightgrey'])
plt.title('Retweet proportion', fontsize=24)
plt.tight_layout()
data = col.since(datetime(2015,6,18,12)).until(datetime(2015,6,18,12,10)).group_by('minutes').top_user_locations()
stacked_bar_plot(data, ['London', 'New York'], x_tick_date_format='%H:%M')
plt.title('Tweets from London and New York users', fontsize=18)
plt.tight_layout()
See more examples in the gallery.
The following functions make plots by first getting data from collection and then making the plots. Their use is discouraged as getting the data can sometimes be slow. Always prefer to get the data and make plots separately, saving the data first.
bins, counts = collection.containing('#sexy').tweets_over_time_figure(
start_time,
step_size=timedelta(minutes=1),
num_steps=60,
show=False)
plt.title('Tweets containing '#sexy'')
plt.show()
collection.term_counts(['justin', 'miley'], count_by='days', plot=True, plot_total=True)
plt.show()
collection.since(datetime(2015,6,1)).tweet_retweet_figure(group_by='days')
you may set group_by=
to days
, hours
, minutes
, or seconds
.
collection.since(datetime(2015,6,1)).geocoded_tweets_figure()
collection.tweets_with_urls_figure()
collection.tweets_with_images_figure()
collection.tweets_with_mentions_figure()
collection.tweets_with_hashtags_figure()
for tweet in collection.containing('#nyc'):
print(tweet['text'])
Here are functions for exporting data from collections to different formats.
For geolocated tweets, in order to get the geolocation out in the csv, add coordinates.coordinates
to the columns list. This will put the coordinates in GeoJSON (long, lat) in the column.
Alternatively¸ add coordinates.coordinates.0
and coordinates.coordinates.1
to the columns list. This will add two columns with the longitude and latitude in them respectively.
The toolkit supports exporting a retweet graph using the networkx
library. In the exported graph users are nodes, retweets are directed edges.
If the collection result includes non-retweets as well, users with no retweets will also appear in the graph as isolated nodes. Only retweets are edges in the resulting graph.
Exporting a retweet graph is done as follows:
import networkx as nx
digraph = collection.containing('#AnyoneButHillary').only_retweets().retweet_network()
nx.write_graphml(digraph, '/path/to/outputfile.graphml')
Nodes and edges have attributes attached to them, which are customizable using the user_metadata
and tweet_metadata
arguments.
user_metadata
is a list of fields from the User object that will be included as attributes of the nodes.tweet_metadata
is a list of the fields from the Tweet object that will be included as attributes of the edges.
The defaults are
user_metadata=['id_str', 'screen_name', 'location', 'description']
tweet_metadata=['id_str', 'retweeted_status.id_str', 'timestamp', 'text', 'lang']
For large graphs where the structure is interesting but the tweet text itself is not, it is advisable to ommit most of the metadata. This will make the resulting file smaller, and is done as follows:
import networkx as nx
digraph = collection.containing('#AnyoneButHillary').only_retweets().retweet_network(user_metadata=['screen_name'], tweet_metadata=[''])
nx.write_graphml(digraph, '/path/to/outputfile.graphml')
The .graphml
file may then be opened in graph analysis/visualization programs such as Gephi or Pajek.
The networkx
library also provides algorithms for vizualization and analysis.
Smapp-toolkit has some built-in plotting functionality. See the example scripts, and check out the gallery!
Currently implemented:
- barchart of tweets per time-unit (
tweets_over_time_figure(...)
) - barchart by language by day (
languages_per_day_figure(...)
) - line chart (tweets per day) with vertical event annotations (
tweets_per_day_with_annotations_figure(...)
) - geolocation names by time (
geolocation_names_by_day_figure(...)
) - user locations by time (
user_locations_by_day_figure(...)
)
In order to get these to work, some extra packages (not automatically installed) need to be installed:
matplotlib
seaborn
SMAPP stores tweets in MongoDB databases, and splits the tweets across multiple MongoDB collections, because this gives better performance than a single large MongoDB collection. The MongoDB Database needs to have a smapp_metadata
collection with a single smapp-tweet-collection-metadata
document in it, which specifies the names of the tweet collections.
The smapp-tweet-collection-metadata
document has the following form:
{
'document': 'smapp-tweet-collection-metadata',
'tweet_collections': [
'tweets_1',
'tweets_2',
'tweets_3',
]
}
The MongoTweetCollection
object may still be used if the metadata collection and document have different names:
collection = MongoTweetCollection(..., metadata_collection='smapp_metadata', metadata_document='smapp-tweet-collection-metadata')
All you need to do is insert the following collection and document into your MongoDB database:
(from the mongo shell)
db.smapp_metadata.save({
'document': 'smapp-tweet-collection-metadata',
'tweet_collections': [ 'tweets' ]
})
and the default behavior will work as advertised.
Code and documentation © 2014 New York University. Released under the GPLv2 license.