Code Challenge

Intro

Original description of the challenge placed here

I guess it's a good challenge so I play with that for a bit. I tried different styles of parsing and added solutions for actual responses (initial response is 5 years old I guess), including a few types of carousel and one type of masonry (for the same request Google can return different layouts sometimes).

Yeah, I know "you shouldn't use regexp for parsing html"

I'm agreed with two exceptions:

you need parse something as fast as possible (Nokogiri and alternatives will spent some time to parse large html tree even if you need only a few elements)
regexp working good for custom components w/o nested levels (ok, it's possible to use recursive regexp but that's not so easy)
Looks like string scanner is good and fast for parsing script elements. It will be interesting to compare that with any of javascript syntax tree parsers.

Results

Average results based on parsing of 1000 html files (see /benchmarks/time.rb).

FIRST VARIANT 
(van-gogh-paintings.html)
51 entries
RegexpBased:        2.66
NokogiriBased:      7.74 (nokogiri spents 5sec to parse 1000 html files)
NokolexborBased:    2.95
StringScannerBased: 1.32

SECOND VARIANT (us-presidents.html, tries to find FIRST VARIANT then switch to SECOND)
46 entries
RegexpBased:        3.26
NokogiriBased:      8.26
NokolexborBased:    3.31
StringScannerBased: 1.62

THIRD VARIANT 
(dune-actors.html, tries FIRST, then SECOND, then THIRD)
11 entries
RegexpBased:        1.37
NokogiriBased:      7.73
NokolexborBased:    2.78
StringScannerBased: 1.04

FOURTH VARIANT 
(colorado-cities.html, tries to find FIRST VARIANT then switch to SECOND)
51 entries
RegexpBased:        3.84
NokogiriBased:      9.04
NokolexborBased:    3.73
StringScannerBased: 1.98

MASONRY VARIANT 
(van-gogh-paintings.html, tries to find carousel then switch to Masonry)
46 entries
RegexpBased:        4.76
NokogiriBased:     11.04
NokolexborBased:    4.42
StringScannerBased: 1.97

UNKNOWN VARIANT
(unknown-variant.html, tries to find Carousel, then tries to find Masonry, then fail)
It's useful to check how much time we need to fail so I removed some chars from carousel tag.
For example Nokogiri and Nokolexbor will spent time to build html tree
even without correct variants.

RegexpBased:        2.60
NokogiriBased:      7.29
NokolexborBased:    2.60
StringScannerBased: 0.69

Notes

See ya, Nokogiri.
I tried to not use generated classes. It is hard is some cases and looks like Google have a plan to use them a lot. I guess it's fine to use them if you have a hourly/daily jobs that checking used class has not be refreshed or you have a stats that changes have been never happened in a five years.
String scanner based parsing is fastest but it strictly depends on order of elements/attrs and can produce garbage data more often than other solutions. Also I don't know how it will work if you need to convert full page to JSON result (I imagine this variant can be powered with parallel workers). Complexity and time of creation also are bigger.

Installation

You need to install some gems (nokogiri, nokolexbor, memory_profiler, rspec) with a bundler. I used ruby-3.2.4.

bunle install

Index

/benchmarks - run this scripts to check the average time or memory allocated.
/files - google search result html files, its prettified versions, expected parsing result and screenshots.
/specs - a lot of tests general for each variant of parsing.
/src - ruby code.

Actual vs Original

Van gogh paintings Masonry

The page does not contain custom components like 'g-carousel' and contains only generated classes basically so it's difficult to parse something with the Regular expressions or StringScanner (I'm not master of the recursive regexp unfortunately).

Then current version is not so safe and good.

US presidents

Cards

Link element <a> doesn't have klitem class attribute. It has klitem-tr class attribute and contains <div class='klitem'> child instead.
href attributed placed before aria-label attribute it can be important for code that based on explicit order.

<a
  jscontroller="e8Ezlf"
  jsname="I4kCu"
  CLASS="ct5Ked klitem-tr PZPZlf"
  aria-selected="false"
  aria-setsize="46"
  data-entityid="/m/012gx2"
  data-entityname="Joe Biden"
  HREF="/search?something"
  role="tab"
  title="Joe Biden (2021-)"
  ARIA-LABEL="Joe Biden"
  jsdata="Cls7rd;_;B+9CXg"
  jsaction="rcuQ6b:npT2md"
  data-hveid="CCMQBw"
  data-ved="2ahUKEwj82tmG4pCHAxV3MTQIHQ5xAZYQ-BZ6BAgjEAc"
>

Thumbnails

Each thumbnail blob placed inside own <script> element instead one big <script> in the original example.

<script nonce="aX9jIq1cdIXQinWOtdsHrA">
  window._setImagesSrc...
</script>
<script nonce="aX9jIq1cdIXQinWOtdsHrA">
  (function () {
    var s =
      "data:image/jpeg;base64,...";
    var ii = ["dimg_1"];
    _setImagesSrc(ii, s);
  })();
</script>
...

Images

A half of carousel cards don't contain <img> (for optimization purposes I guess). Ideally we should find a way to fetch it but I will skip this functional for now.

<a
  jscontroller="e8Ezlf"
  jsname="I4kCu"
  class="ct5Ked klitem-tr PZPZlf"
  aria-selected="false"
  aria-setsize="46"
  data-entityid="/m/034rd"
  data-entityname="George Washington"
  href="/search?something"
  id="_JluIZvyvF_fi0PEPjuKFsAk_109"
  role="tab"
  title="George Washington (1789-1797)"
  aria-label="George Washington"
  jsdata="Cls7rd;_;B+9CXg"
  jsaction="rcuQ6b:npT2md"
  data-hveid="CCMQjgE"
  data-ved="2ahUKEwj82tmG4pCHAxV3MTQIHQ5xAZYQ-BZ6BQgjEI4B"
></a>

Extensions

Extension's <div> doesn't have a klmeta class. It has a generated not uniqal class FozYP and placed in the last <div> before link close tag </a>

...
      <div jsname="QUk9kd" class="WGwSK SoZvjb">
        <div class="dAassd">
          <div><div class="FozYP">Joe Biden</div></div>
          <div class="cp7THd"><div class="FozYP">2021-</div></div>
        </div>
      </div>
    </div>
  </div>
</a>

A lot of data placed in a scripts (for lazy loading maybe?). I will skip it for now.

<a
  jscontroller="e8Ezlf"
  jsname="I4kCu"
  class="ct5Ked klitem-tr PZPZlf"
  aria-selected="false"
  aria-setsize="46"
  data-entityid="/m/02yy8"
  data-entityname="Franklin D. Roosevelt"
  href="/search?something"
  id="_JluIZvyvF_fi0PEPjuKFsAk_47"
  role="tab"
  title="Franklin D. Roosevelt (1933-1945)"
  aria-label="Franklin D. Roosevelt"
  jsdata="Cls7rd;_;B+9CXg"
  jsaction="rcuQ6b:npT2md"
  data-hveid="CCMQMQ"
  data-ved="2ahUKEwj82tmG4pCHAxV3MTQIHQ5xAZYQ-BZ6BAgjEDE"
></a>

// see a[id]
(function () {
  window.jsl.dh(
    "_JluIZvyvF_fi0PEPjuKFsAk_47",
    "\x3cdiv jsname\x3d\x22tc98Ke\x22 aria-hidden\x3d\x22true\x22 class\x3d\x22klitem EsIlz\x22 tabindex\x3d\x22-1\x22\x3e\x3cdiv jsname\x3d\x22vgzRDe\x22 class\x3d\x22UnFsfe SoZvjb\x22 aria-hidden\x3d\x22true\x22\x3e\x3cdiv jsname\x3d\x22mfX1bc\x22 class\x3d\x22keP9hb\x22\x3e\x3c/div\x3e\x3cdiv jsname\x3d\x22fmfJsc\x22 class\x3d\x22XAOBve\x22\x3e\x3cg-img class\x3d\x22ZGomKf\x22 style\x3d\x22height:120px;width:120px;background-color:#27211B\x22\x3e\x3cimg data-src\x3d\x22https://encrypted-tbn0.gstatic.com/images?q\x3dtbn:ANd9GcSPJ_0t1v3fLwEbR5-vVgMp1YAdfRLdeIV_gB5UNuXR0plsBUX-KERc\x26amp;s\x3d0\x22 src\x3d\x22data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw\x3d\x3d\x22 jscontroller\x3d\x22TSZEqd\x22 jsaction\x3d\x22load:K1e2pe;BUYwVb:eNYuKb;LsLGHf:KpWyKc;rcuQ6b:npT2md\x22 class\x3d\x22YQ4gaf\x22 height\x3d\x22120\x22 width\x3d\x22120\x22 alt\x3d\x22Franklin D. Roosevelt\x22\x3e\x3c/g-img\x3e\x3c/div\x3e\x3cdiv jsname\x3d\x22QUk9kd\x22 class\x3d\x22WGwSK SoZvjb\x22\x3e\x3cdiv class\x3d\x22dAassd\x22\x3e\x3cdiv\x3e\x3cdiv class\x3d\x22FozYP\x22\x3eFranklin D. \x3c/div\x3e\x3cdiv class\x3d\x22FozYP\x22\x3eRoosevelt\x3c/div\x3e\x3c/div\x3e\x3cdiv class\x3d\x22cp7THd\x22\x3e\x3cdiv class\x3d\x22FozYP\x22\x3e1933-1945\x3c/div\x3e\x3c/div\x3e\x3c/div\x3e\x3c/div\x3e\x3c/div\x3e\x3c/div\x3e"
  );
})();

Fields Naming

Since it is not artworks anymore we should rename this field to neutral or dynamically change the file name based on queries. I will skip this issue for now.

Dune actors

Cards

Link elements don't have aria-label attribute. We can use title attribute instead.

<a
  class="ttwCMe"
  style="width: 92px;"
  title="Christopher Walken"
  href="/search?something"
>

Carousel ended with a link element that not a card.

<div class="VgWAtf">
  <a
    class="LcmtUb oRJe3d"
    jscontroller="PCqCoe"
    data-ti="FilmCast"
    role="link"
    tabindex="0"
    jsaction="rcuQ6b:npT2md;KjsqPd"
    data-ved="2ahUKEwjHqJ-64ZCHAxWmweYEHb3nAHkQ9JYCKAt6BAhJEBk"
  >
    <div>
      <div class="VDgVie btku5b k0Jjg fCrZyc NQYJvc x2GJWb OJeuxf PrjL8c">
        ...
        <div class="QuU3Wb sjVJQd">View all</div>
      </div>
    </div>
  </a>
</div>

Important links placed inside of <div role="listitem"> elements.

<div
  jsname="ibnC6b"
  data-attrid="kc:/film/film:cast"
  role="listitem"
  jscontroller="qlogIf"
  class="PZPZlf FQ7nIf nZWEZc ojFefc Si5xMe"
  jsaction="rcuQ6b:npT2md"
  data-item-card="true"
  data-hveid="CEkQFw"
  data-ved="2ahUKEwjHqJ-64ZCHAxWmweYEHb3nAHkQ_uMBKAp6BAhJEBc"
>
  <a ...

Extensions

Extension's <div> doesn't have a klmeta class. It has a few generated classes and ellip class and placed in the last <div> before link close tag </a>

      <div class="Bo9xMe">
        <div class="oyj2db OSrXXb" aria-level="3" role="heading">Dave Bautista</div>
        <div class="wwLdc"><div class="MVXjze ellip buTsJf">Rabban</div></div>
      </div>
    </div>
  </div>
</a>

Images

images don't placed inside <g-img> elements
a half of images don't have id attribute for thumbnails search (We can download images through data-src to fetch it but I will skip this functional for now).

Fields Naming

Since it is not artworks anymore we should rename this field to neutral or dynamically change the file name based on queries. I will skip this issue for now.

Colorado Cities

Cards

Unfortunately, cards contains city name info inside the same FozYP element that we used to find a meta. Then we can't simple use this class or last div.

There is a two solutions I see:

we can check that name doesn't include meta
or we can use a['title'] - a['aria-label']

        </g-img>
      </div>
      <div jsname="QUk9kd" class="WGwSK ghJsNe">
        <div class="dAassd">
          <div>
            <div class="FozYP">Colorado</div>
            <div class="FozYP">Springs</div>
          </div>
        </div>
      </div>
    </div>
  </div>
</a>

TODO

What exactly will happens if carousel is found, a.klmeta link is found but order of attributes was changed or aria-label attribute is missing?
- we need add more specs for each component and false cases (see commented specs)
- there is a two exceptions should exists
  - ElementNotFound means we need to try masonry
  - ElemenInvalid means we need to stop with the error (for example the carousel card format is invalid)
Both layouts - carousel and masonry require the same thumbnails map. So if this part is missing we need to stop execution with the error instead trying next associated layout.
'artworks' field name is not correct for all cases
the quotes are matter, so single instead double will destroy the city
we need to check more masonry examples
there is a good chance to play with the RBS files
constants overloading for carousel cards can be refactored
parallel parsing of differnet parts/styles instead sequential one. For example one process can try to find carousel and other can try to find masonry.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
files		files
specs		specs
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
.ruby-version		.ruby-version
CHALLENGE.md		CHALLENGE.md
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
README.md		README.md
dummy_knowledge_graph.rb		dummy_knowledge_graph.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code Challenge

Intro

Results

Notes

Installation

Index

Actual vs Original

Van gogh paintings Masonry

US presidents

Cards

Thumbnails

Images

Extensions

Fields Naming

Dune actors

Cards

Extensions

Images

Fields Naming

Colorado Cities

Cards

TODO

About

Releases

Packages

Languages

License

Krugloff/serpapi-code-challenge

Folders and files

Latest commit

History

Repository files navigation

Code Challenge

Intro

Results

Notes

Installation

Index

Actual vs Original

Van gogh paintings Masonry

US presidents

Cards

Thumbnails

Images

Extensions

Fields Naming

Dune actors

Cards

Extensions

Images

Fields Naming

Colorado Cities

Cards

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages