Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to scrape a large body of tweets (e.g. a million)? #261

Open
aliiabbasi opened this issue Mar 5, 2020 · 8 comments
Open

how to scrape a large body of tweets (e.g. a million)? #261

aliiabbasi opened this issue Mar 5, 2020 · 8 comments

Comments

@aliiabbasi
Copy link

I wanted to know what exactly the criteria are for the extracted Toptweets? Is it the likes or the number of retweets?
And also, is it possible to extract a large body of tweets (e.g. a million tweets) using this library?
I tried it for almost 800K tweets, but it neither respond nor it did error!

@julienbeisel
Copy link

hey @aliiabbasi

I am currently trying to scrape tweets containing the word "coronavirus" for sentiment analysis purposes and I can explain my workaround, maybe it could give you some advices :)

I tried to download every tweets since January but it didn't work well... 24h wasn't enough and maybe splitting the request for each day is the great way to do it but I don't want to wait approximately 2 days (the full request for one day lasted 1h!). So what I am doing now is scraping each day 1000 "top tweets"

What I'm observing is :

  • when I set the date to January 22 i first get tweets from january 22 late at night, then tweets earlier that day until 00:00
  • when I set topTweets argument to "True", I only get tweets with high retweets and favorite numbers, and it's still in the same order.

example :

image

  • When there is no top tweets left to scrape, the other tweets are just all the tweets the request would find if i didnt specify "topTweets"

image

I'm still trying to find ways to analyze covid-19 related tweets though, if you find better ways to scrape a lot of tweets don't hesitate to comment

@aliiabbasi aliiabbasi changed the title what are the criteria for top tweets? how to scrape a large body of tweets (e.g. a million)? Mar 15, 2020
@anshumanchak
Copy link

anshumanchak commented May 21, 2020

How is the topTweets criteria calculated? Is it a direct sort of retweets or a combination of retweets, favourites and replies?
@aliiabbasi

@victorperezc
Copy link

@anshumanchak The topTweets criteria is determined by twitter. You can enable twitter to show you the tweets he consider more relevant, normally are those having high engagement with the network.

@anshumanchak
Copy link

Thanks @Victorpc98

@AshtonCoop
Copy link

Try to to similar things. For 1000 tweets, it took me 2min 35 seconds. So imaging 1 million tweets...
One way that I can think of:

  1. use multiprocessing
  2. setTopTweets to limit results

@SRaina11
Copy link

Hi @julienbeisel

I'm currently trying to scrape tweets for sentiment analysis. I'm also using the setNear("County in Ireland") and setQuerySearch("Covid19" or "covid" or "corona" or ......), but not getting many tweets. Is there a way to scrape more tweets?
Thanks

@julienbeisel
Copy link

Hi @SRaina11

When I did this project I had the same issue. If I remember well, you can query tweets from short time periods and then have more tweets. For example, instead of getting the tweets from the last 6 month, you can divide it into 6x4 queries corresponding to the tweets each week, or into 6x4x7 queries corresponding to the tweets each day. Sometimes it didnt work well and it stopped working after some queries, you can maybe wait before each query so the API does not stop your queries.

I hope it'll help you!

@SRaina11
Copy link

Hi @SRaina11

When I did this project I had the same issue. If I remember well, you can query tweets from short time periods and then have more tweets. For example, instead of getting the tweets from the last 6 month, you can divide it into 6x4 queries corresponding to the tweets each week, or into 6x4x7 queries corresponding to the tweets each day. Sometimes it didnt work well and it stopped working after some queries, you can maybe wait before each query so the API does not stop your queries.

I hope it'll help you!

Thank you for your response @julienbeisel. Would you know if we can use OR operator and try to find tweets for multiple keywords. For eg: tweetCriteria.setQuerySearch("Covid19" or "covid" or "corona")?

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants