Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

404 HTTP status code is not handled or not allowed #92

Closed
Sebastokratos42 opened this issue Sep 18, 2020 · 19 comments
Closed

404 HTTP status code is not handled or not allowed #92

Sebastokratos42 opened this issue Sep 18, 2020 · 19 comments

Comments

@Sebastokratos42
Copy link

Hello!

I used Tweetscraper without any problems yesterday, but today always the following issue emerges:

2020-09-18 08:59:36 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://twitter.com/i/search/timeline?l=&f=tweets&q=%23dfl%20since%3A2020-09-16&src=typed&max_position=>: HTTP status code is not handled or not allowed

I already changed my IP-Adress via VPN and I changed the user agent, but the issue remains. How am I able to solve this problem? Did twitter change the search url? Trying the url "https://twitter.com/i/search/timeline?l=&f=tweets&q=%23dfl%20since%3A2020-09-16&src=typed&max_position" leads to a 404 response. Thanks for your help!

@Buaasinong
Copy link

me too

@imfht
Copy link

imfht commented Sep 18, 2020

me too.

@yangyangdotcom
Copy link

same here

@adriprmn
Copy link

same here.

@adriprmn
Copy link

any luck?

@kujbika
Copy link

kujbika commented Sep 20, 2020

I also face this issue.

@Spaskich
Copy link

When trying to access https://twitter.com/search-home, it redirects to https://twitter.com/explore. I think Twitter have removed this functionality altogether.

@ASIAMI
Copy link

ASIAMI commented Sep 21, 2020

We've got https://twitter.com/search-advanced?lang=en-GB. I try to change code but I am new here, so...
got error ValueError: Expecting value: line 1 column 1 (char 0) - It's in decode line. I think my item are null - must to work with Xpath... _> after change of URL when I inspect respone I've got result on my Local/Tempt so it's works. Wors with items...

@irwanOyong
Copy link

Twitter changed a little bit so we need to alter the url in TweetScraper class into self.url = "https://twitter.com/search/?lang={}".format(lang)

But unfortunately there is still this error @ASIAMI mentioned above,
File "/home/airone/external-git/TweetScraper/TweetScraper/spiders/TweetCrawler.py", line 44, in parse_page data = json.loads(response.body.decode("utf-8")) File "/usr/lib/python3.8/json/__init__.py", line 357, in loads return _default_decoder.decode(s) File "/usr/lib/python3.8/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
meaning that they might have also changed the structured a little bit, any workaround?

@ASIAMI
Copy link

ASIAMI commented Sep 22, 2020

Working on it, but still have same error. Must change Xpath - I think it is easy to repair but need someone with experience on it

@shiwenThu
Copy link

shiwenThu commented Sep 23, 2020 via email

@shiwenThu
Copy link

Working on it, but still have same error. Must change Xpath - I think it is easy to repair but need someone with experience on it

I have some experience on Xpath but I am not familiar with how the url should be changed.
I have some time to work on it tomorrow. Can you share me with your progress and problems?

@shiwenThu
Copy link

Twitter changed a little bit so we need to alter the url in TweetScraper class into self.url = "https://twitter.com/search/?lang={}".format(lang)

But unfortunately there is still this error @ASIAMI mentioned above,
File "/home/airone/external-git/TweetScraper/TweetScraper/spiders/TweetCrawler.py", line 44, in parse_page data = json.loads(response.body.decode("utf-8")) File "/usr/lib/python3.8/json/__init__.py", line 357, in loads return _default_decoder.decode(s) File "/usr/lib/python3.8/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
meaning that they might have also changed the structured a little bit, any workaround?

I agree that the url should be changed, but I do not think "https://twitter.com/search/?lang={}".format(lang)` is the right answer, although it's the url displayed when users search the tweet with browser. The right url should response with data containing 'min_position'.
In fact, I am curious about how @jonbakerfish discovered the old url, which may help us deal with current situation.

@jonbakerfish
Copy link
Owner

Hi guys, the way I find out the URL for searching is using Chrome's DevTools to monitor the network activity.

Twitter changed the URL for searching and the format of returned data. So you need to modify the code accordingly. Here I give you an example and PRs are welcome.

When you search (e.g., abc) , you will find that a URL is issued:

https://api.twitter.com/2/search/adaptive.json?
include_profile_interstitial_type=1
&include_blocking=1
&include_blocked_by=1
&include_followed_by=1
&include_want_retweets=1
&include_mute_edge=1
&include_can_dm=1
&include_can_media_tag=1
&skip_status=1
&cards_platform=Web-12
&include_cards=1
&include_ext_alt_text=true
&include_quote_count=true
&include_reply_count=1
&tweet_mode=extended
&include_entities=true
&include_user_entities=true
&include_ext_media_color=true
&include_ext_media_availability=true
&send_error_codes=true
&simple_quoted_tweet=true
&q=abc
&count=20
&query_source=typed_query
&pc=1
&spelling_corrections=1
&ext=mediaStats%2ChighlightedLabel

When you sroll down to the bottom of the page, the same URL is issued again but with a cursor this time, which is used for loading more tweets from the severs:

https://api.twitter.com/2/search/adaptive.json?
include_profile_interstitial_type=1
&include_blocking=1
&include_blocked_by=1
&include_followed_by=1
&include_want_retweets=1
&include_mute_edge=1
&include_can_dm=1
&include_can_media_tag=1
&skip_status=1
&cards_platform=Web-12
&include_cards=1
&include_ext_alt_text=true
&include_quote_count=true
&include_reply_count=1
&tweet_mode=extended
&include_entities=true
&include_user_entities=true
&include_ext_media_color=true
&include_ext_media_availability=true
&send_error_codes=true
&simple_quoted_tweet=true
&q=abc
&count=20
&query_source=typed_query
&cursor=scroll%3AthGAVUV0VFVBYBFoDA5vOhzpWqJBIY1AISY8LrAAAB9D-AYk3S8an8AAAAKBIqQzEPVYAAEioifQPWoAISKjCkGxfQABIqRohJVgACEipDLymXcAISKi4vkpfQAxIp9qV-V9ADEioDygCXsAASKi_YX1ewABIqQ24Yl3AAEinrgDJXYAASKgwYq9eQAhIp-fq_FpAHEiomnoUWoAISKiYu2dawAhIqQazWl4ABEioVvr6WsAISKgbowZdwABIqRmtB12AFEio2DxKXcAYSKj_m_ReAARIqP3OwV4ABEiozbLXXgAASKkSyQZfQARIqRLolVrAAEioTWJoXkAISKggPHNaQBhIqFdDFlpAAEioRojMXcAgSKjdOLpawBhIqAajZV7ADEinyOp2XsAASKeST6xYAAxIqRLVBl5ABEipGHaaWoAASKkPnk9eQARIqEoqcF4ACEipGeS3XgAASKh1fvhawARIqRd1117ABJQAVACUAERW4gnoVgIl6GARVU0VSFQAVABVQFQIVAAA%3D
&pc=1
&spelling_corrections=1
&ext=mediaStats%2ChighlightedLabel

The response of the URL is like this, where you can find all the tweets and the cursor for the next request:

{globalObjects: {tweets: {,…},…}, timeline: {id: "search-6714704651894409473",…}}
globalObjects: {tweets: {,…},…}
broadcasts: {}
cards: {}
lists: {}
media: {}
moments: {}
places: {}
topics: {}
tweets: {,…}
    1308836102560768000: {created_at: "Wed Sep 23 18:30:18 +0000 2020", id: 1308836102560768000, id_str: "1308836102560768000",…}
    1308843500293828608: {created_at: "Wed Sep 23 18:59:42 +0000 2020", id: 1308843500293828600, id_str: "1308843500293828608",…}
    1308848357377560579: {created_at: "Wed Sep 23 19:19:00 +0000 2020", id: 1308848357377560600, id_str: "1308848357377560579",…}
    1308852022070906887: {created_at: "Wed Sep 23 19:33:34 +0000 2020", id: 1308852022070907000, id_str: "1308852022070906887",…}
    1308862807832768512: {created_at: "Wed Sep 23 20:16:25 +0000 2020", id: 1308862807832768500, id_str: "1308862807832768512",…}
    1308866238454657024: {created_at: "Wed Sep 23 20:30:03 +0000 2020", id: 1308866238454657000, id_str: "1308866238454657024",…}
    1308877845746331654: {created_at: "Wed Sep 23 21:16:10 +0000 2020", id: 1308877845746331600, id_str: "1308877845746331654",…}
    1308878030044098568: {created_at: "Wed Sep 23 21:16:54 +0000 2020", id: 1308878030044098600, id_str: "1308878030044098568",…}
    1308879028238123010: {created_at: "Wed Sep 23 21:20:52 +0000 2020", id: 1308879028238123000, id_str: "1308879028238123010",…}
    1308882628116910080: {created_at: "Wed Sep 23 21:35:11 +0000 2020", id: 1308882628116910000, id_str: "1308882628116910080",…}
    1308896562035204098: {created_at: "Wed Sep 23 22:30:33 +0000 2020", id: 1308896562035204000, id_str: "1308896562035204098",…}
    1308911248063574016: {created_at: "Wed Sep 23 23:28:54 +0000 2020", id: 1308911248063574000, id_str: "1308911248063574016",…}
    1308928407816863745: {created_at: "Thu Sep 24 00:37:05 +0000 2020", id: 1308928407816863700, id_str: "1308928407816863745",…}
    1308928903025754113: {created_at: "Thu Sep 24 00:39:03 +0000 2020", id: 1308928903025754000, id_str: "1308928903025754113",…}
    1308930852294983681: {created_at: "Thu Sep 24 00:46:48 +0000 2020", id: 1308930852294983700, id_str: "1308930852294983681",…}
    1308935459171708929: {created_at: "Thu Sep 24 01:05:06 +0000 2020", id: 1308935459171709000, id_str: "1308935459171708929",…}
    1308936068184629253: {created_at: "Thu Sep 24 01:07:32 +0000 2020", id: 1308936068184629200, id_str: "1308936068184629253",…}
    1308937253486579712: {created_at: "Thu Sep 24 01:12:14 +0000 2020", id: 1308937253486579700, id_str: "1308937253486579712",…}
    1308938288925937664: {created_at: "Thu Sep 24 01:16:21 +0000 2020", id: 1308938288925937700, id_str: "1308938288925937664",…}
    1308938422069977089: {created_at: "Thu Sep 24 01:16:53 +0000 2020", id: 1308938422069977000, id_str: "1308938422069977089",…}
users: {3135241: {id: 3135241, id_str: "3135241", name: "RedState", screen_name: "RedState",…},…}
timeline: {id: "search-6714704651894409473",…}

I'm currently working on some other projects and may update the code later. Hope this can help and PRs are welcome.

@ASIAMI
Copy link

ASIAMI commented Sep 24, 2020

OK, but with this new site - got my response , still -> should I use Bs4 as response is not in JSON
according to information from scrapy web?

Handling different response formats

Once you have a response with the desired data, how you extract the desired data from it depends on the type of response:

If the response is HTML or XML, use selectors as usual.

If the response is JSON, use json.loads() to load the desired data from response.text:

@irwanOyong
Copy link

Anyone done with the update?

Tried it myself but haven't reached successful one.

@ASIAMI
Copy link

ASIAMI commented Sep 26, 2020

not yet, I've tried with bs4 and got something but still working on it

@ASIAMI
Copy link

ASIAMI commented Sep 26, 2020

@jonbakerfish Can you help me, I can see everything you wrote about. Ale url api.twitter is not allowed. How to get into these data

@AndyAsare
Copy link

Same problem here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests