VoterFraud2020 is a multi-modal Twitter dataset with 7.6M tweets and 25.6M retweets from 2.6M users related to voter fraud claims.
- The Associated Paper on arXiv
- voterfraud2020.io, interactive web application for exploring the dataset
- Figshare dataset publication with digital object identifier (DOI) 10.6084/m9.figshare.13571084
- github/sTechLab/VoterFraud2020-analysis, the code behind the data analysis
- github/vegetable68/streaming-candidates-2020, the twitter streaming code
The tweets and user objects in the dataset can be hydrated using Twarc or Hydrator.
Note: tweets from suspended users will not be available for hydration. We believe it's in the public interest to make these tweets available. We will share those tweets with published academic researchers; email us for details.
Hydrating using Hydrator (GUI)
Navigate to the Hydrator github repository and follow the instructions for installation in their README. To use the GUI, tweet IDs must first be extracted to a tweet id file from the CSVs in this repository.
Hydrating using Twarc (CLI, python 3)
First install Twarc and tqdm
pip3 install twarc tqdm
Configure Twarc with your Twitter API tokens (note you must apply for a Twitter developer account first in order to obtain the needed tokens). You can also configure the API tokens in the script, if unable to configure through CLI.
twarc configure
Run the script. The hydrated Tweets will be stored in the same folder as the Tweet-ID file, and is saved as a compressed jsonl file
python3 hydrate.py
This guide was inspired by the #Election2020 Dataset Repository.
The columns in the data are described below. See the paper for more details, or explore the project website for additional descriptive statistics.
Total count: 7,603,103
Original tweets: 3,781,524
Quote tweets: 3,821,579
The tweets are split into daily chunks.
Data Column | Description |
---|---|
tweet_id | The ID of the tweet. |
user_community | The community of the tweet's author in the retweet graph, which is found using the Infomap community detection algorithm with default parameters. Values: 0, 1, 2, 3, 4, null |
user_active_status | The active status of the tweet's author (as of January 10th). Values: 'active', 'suspended', 'deleted' (not found) |
retweet_count_metadata | The number of retweets the tweet has received according to the tweet object's metadata (as of December 16th). |
quote_count_metadata | The number of quotes the tweet has received according to the tweet object's metadata (as of December 16th). |
retweet_count_by_community_X | The number of retweets the tweet received from users in community X (X=0-4). |
quote_count_by_community_X | The number of quotes the tweet received from users in community X (X=0-4). |
retweet_count_by_suspended_users | The number of retweets the tweet received from suspended users. |
quote_count_by_suspended_users | The number of quotes the tweet received from suspended users. |
Total count: 25,566,698
The retweets are split into daily chunks.
Data Column | Description |
---|---|
retweeted_id | The ID of the retweeted tweet. |
user_id | The ID of the user that retweeted. |
Total count: 2,559,018
The users are split into 5 chunks, sorted by user id (ascending).
Data Column | Description |
---|---|
user_id | The ID of the user. |
user_community | The community of the user in the retweet graph, which is found using the Infomap community detection algorithm with default parameters. Values: 0, 1, 2, 3, 4, null |
user_active_status | The active status of the user (as of January 10th). Values: 'active', 'suspended', 'deleted' (not found) |
closeness_centrality_detractor_cluster | Normalized closeness centrality of the top 10,000 users in the detractor cluster (computed using Networkit). |
closeness_centrality_promoter_cluster | Normalized closeness centrality of the top 10,000 users in the promoter cluster (computed using Networkit). |
retweet_count_by_community_X | Aggregated count of the retweets the user received from other users in community X (X=0-4). |
quote_count_by_community_X | Aggregated count of the quotes the user received from other users in community X (X=0-4). |
retweet_count_by_suspended_users | Aggregated count of the retweets the user received from suspended users. |
quote_count_by_suspended_users | Aggregated count of the quotes the user received from suspended users. |
Total count: 167,696
The image perceptual hash values were calculated using the ImageHash python package.
Data Column | Description |
---|---|
unique_id | Unique identifier of the image. |
tweet_id | The ID of the tweet that contained the image. |
a_hash | The Average hash of the image. |
p_hash | The Perceptive hash of the image. |
w_hash | The Wavelet hash of the image. |
Data Column | Description |
---|---|
url | The URL. |
domain | The domain of the URL. |
tweet_count | Aggregated count of the tweets that contained the URL. |
retweet_count_metadata | Aggregated count of the retweets that tweets containing the URL received according to the tweet object's metadata (as of December 16th). |
quote_count_metadata | Aggregated count of the quotes that tweets containing the URL received according to the tweet object's metadata (as of December 16th). |
tweet_count_by_community_X | Aggregated count of tweets that contained the URL by users in community X (X=0-4). |
retweet_count_by_community_X | Aggregated count of the retweets that tweets containing the URL received from users in community X (X=0-4). |
quote_count_by_community_X | Aggregated count of the quotes that tweets containing the URL received from users in community X (X=0-4). |
tweet_count_by_suspended_users | Aggregated count of tweets that contained the URL by suspended users. |
retweet_count_by_suspended_users | Aggregated count of the retweets that tweets containing the URL received from suspended users. |
quote_count_by_suspended_users | Aggregated count of the quotes that tweets containing the URL received from suspended users. |
Data Column | Description |
---|---|
video_id | ID of the Youtube video. |
video_title | Title of the video (as of January 1st). |
channel_id | Channel ID of the channel where the video was posted. |
channel_title | Channel title of the channel where the video was posted (as of January 1st). |
published_at | Timestamp of when the video was published. |
tweet_count | Aggregated count of the tweets that contained the video. |
retweet_count_metadata | Aggregated count of the retweets that tweets containing the video received according to the tweet object's metadata (as of December 16th). |
quote_count_metadata | Aggregated count of the quotes that tweets containing the video received according to the tweet object's metadata (as of December 16th). |
tweet_count_by_community_X | Aggregated count of tweets that contained the video by users in community X (X=0-4). |
retweet_count_by_community_X | Aggregated count of the retweets that tweets containing the video received from users in community X (X=0-4). |
quote_count_by_community_X | Aggregated count of the quotes that tweets containing the video received from users in community X (X=0-4). |
tweet_count_by_community_X | Aggregated count of tweets that contained the video by suspended users. |
retweet_count_by_suspended_users | Aggregated count of the retweets that tweets containing the video received from suspended users. |
quote_count_by_suspended_users | Aggregated count of the quotes that tweets containing the video received from suspended users. |