Skip to content
This repository has been archived by the owner on Mar 5, 2022. It is now read-only.

Googler no results on 3.9 #306

Closed
amitai opened this issue Nov 22, 2019 · 15 comments · Fixed by #307
Closed

Googler no results on 3.9 #306

amitai opened this issue Nov 22, 2019 · 15 comments · Fixed by #307

Comments

@amitai
Copy link

amitai commented Nov 22, 2019

Output of googler -d:

$ googler hello --debug
[DEBUG] googler version 3.9
[DEBUG] Python version 3.6.8
[DEBUG] Connecting to new host www.google.com
[DEBUG] Fetching URL /search?ie=UTF-8&oe=UTF-8&q=hello&sei=MMvdbA1wEeq7OH1_4X723w
[DEBUG] Cookie: 1P_JAR=2019-11-22-21
[DEBUG] Response body written to '/tmp/googler-response-_sjb5zc5.html'.
No results.
googler (? for help) 

Link to the response body : https://gist.github.com/amitai/c840955133e1938d4369eafdbd1232a7

Details of operating system, Python version used, terminal emulator and shell;
Python 3.6.8, ubuntu 18.04.3, bash 4.4.20(1)

@jarun
Copy link
Owner

jarun commented Nov 22, 2019

@zmwangx I've been noticing this today too.

@webctrl
Copy link

webctrl commented Nov 23, 2019

I'm having the same results.

@zmwangx
Copy link
Collaborator

zmwangx commented Nov 23, 2019

This is in fact the same problem as #299, and it's getting a bit ridiculous. The markup is pretty damn hard to parse as discussed before.

Again, we wait for maybe 48hrs. If things don't go back to normal by then, we move to a modern UA, and update the parser.

Until then, here's a patch (with modern UA) that works:

diff --git a/googler b/googler
index 460350e..20698c7 100755
--- a/googler
+++ b/googler
@@ -102,7 +102,7 @@ COLORMAP = {k: '\x1b[%sm' % v for k, v in {
     'x': '0', 'X': '1', 'y': '7', 'Y': '7;1',
 }.items()}
 
-USER_AGENT = 'googler/%s (like MSIE)' % _VERSION_
+USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
 
 text_browsers = ['elinks', 'links', 'lynx', 'w3m', 'www-browser']
 
@@ -2192,13 +2192,18 @@ class GoogleParser(object):
                 # Skip smart cards.
                 continue
             try:
-                h3 = div_g.select('h3.r')
-                a = h3.select('a')
-                title = a.text
-                mime = div_g.select('.mime')
-                if mime:
-                    title = mime.text + ' ' + title
-                url = self.unwrap_link(a.attr('href'))
+                h3 = div_g.select('div.r h3')
+                if h3:
+                    title = h3.text
+                    url = self.unwrap_link(h3.parent.attr('href'))
+                else:
+                    h3 = div_g.select('h3.r')
+                    a = h3.select('a')
+                    title = a.text
+                    mime = div_g.select('.mime')
+                    if mime:
+                        title = mime.text + ' ' + title
+                    url = self.unwrap_link(a.attr('href'))
                 matched_keywords = []
                 abstract = ''
                 for childnode in div_g.select('.st').children:
@@ -2233,10 +2238,12 @@ class GoogleParser(object):
         # Search instead for ...
         spell_orig = tree.select("span.spell_orig")
         if spell_orig:
-            self.autocorrected = True
-            self.showing_results_for = next(
+            showing_results_for_link = next(
                 filter(lambda el: el.tag == "a", spell_orig.previous_siblings()), None
-            ).text
+            )
+            if showing_results_for_link:
+                self.autocorrected = True
+                self.showing_results_for = showing_results_for_link.text
 
         # No results found for ...
         # Results for ...:
@@ -2252,14 +2259,14 @@ class GoogleParser(object):
         self.filtered = tree.select('p#ofr') is not None
 
     # Unwraps /url?q=http://...&sa=...
-    # May raise ValueError.
+    # TODO: don't unwrap if URL isn't in this form.
     @staticmethod
     def unwrap_link(link):
         qs = urllib.parse.urlparse(link).query
         try:
             url = urllib.parse.parse_qs(qs)['q'][0]
         except KeyError:
-            raise ValueError(link)
+            return link
         else:
             if "://" in url:
                 return url

If it doesn't work, show me the markup and I'll fix it.

@jarun
Copy link
Owner

jarun commented Nov 23, 2019

The patch works fine for me. Is there a way to auto-detect if the results are in markup?

What if we use the FF user agent and this patch. Looks like we are detecting whether the results are in new markup or earlier.

@ajithkumar-natarajan
Copy link

The patch provided by @zmwangx works for vanilla searches. Can you also please provide the patch for retrieving news (-N argument) results? It gives the same "No results" error.

Thank you.

@zmwangx
Copy link
Collaborator

zmwangx commented Nov 25, 2019

Problem still not resolved. I'll turn the patch into a PR soonish and we'll probably need to cut a release.


@jarun

Is there a way to auto-detect if the results are in markup?

Looks like we are detecting whether the results are in new markup or earlier.

If you're talking about the if h3 conditional: both layouts could appear in a modern UA response.

Yeah, we can possibly maintain compatibility with the older layout we were targeting, but since the older layout appears to be gone, there's no point.

Note that we used this googler (like MSIE) user agent with the assumption that Google would serve a classic, stable layout to that UA, instead of frequently doing A/B testing on modern browsers. However, that assumption seems broken beyond repair now, so no point in using a non-modern UA now.

we use the FF user agent

I propose we use a Chrome UA. It is said that FF is more likely to be reCAPTCHA'ed than Chrome (although it's not clear whether that's based on UA detection).


@ajithkumar-natarajan I did test my patch with -N and it was working for me, and it still does. Please use --debug and share the markup like OP did.

@jarun
Copy link
Owner

jarun commented Nov 25, 2019

Please go ahead. The Chrome UA sounds good.

@jarun
Copy link
Owner

jarun commented Nov 25, 2019

I'll make a release this evening if things are good.

@jarun
Copy link
Owner

jarun commented Nov 25, 2019

Tracking update: the patch works for me so far.

@amitai
Copy link
Author

amitai commented Nov 25, 2019

Hi, I know I opened this ticket but I will not have access to my affected workstation until late in the week. Just want to make sure you don't wait on me for testing! :-D

@jarun
Copy link
Owner

jarun commented Nov 25, 2019

No problem! Looks like it's reproducible globally. Just came across a post on HN that google is no longer working on Lynx.

@bpalmer7440
Copy link

Just FYI, but the patch works for me too running OSX.
Thanks!

@mikeaich
Copy link

The patch works here on Ubuntu 18.04 as well.

zmwangx added a commit to zmwangx/googler that referenced this issue Nov 26, 2019
Fixes jarun#306, hopefully.

Not refined (even left a TODO), not extensively tested against edge cases.
zmwangx added a commit to zmwangx/googler that referenced this issue Nov 26, 2019
Fixes jarun#306, hopefully.

Not refined (even left a TODO), not extensively tested against edge cases.
@zmwangx
Copy link
Collaborator

zmwangx commented Nov 26, 2019

Turns out I have more pressing matters and didn't have time to refine and test the patch... Instead of delaying the fix further, I just pushed the patch to #307.

I'll refine it and rewrite our currently useless testing system later, but let's have a working release first...

@jarun
Copy link
Owner

jarun commented Nov 26, 2019

I'll make a release today.

Repository owner locked as resolved and limited conversation to collaborators Dec 9, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants