Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: return empty noindex webpage when crawlers hit specific pages #8744

Merged
merged 4 commits into from
Jul 28, 2023

Conversation

raphael0202
Copy link
Contributor

@raphael0202 raphael0202 commented Jul 27, 2023

An analysis of nginx logs made us realize that 6% of our traffic was due to Bing bot, half of the queries were "facet" queries that involve aggregate MongoDB queries and consume a lot of resources.
See https://openfoodfacts.slack.com/archives/C1FPYCWM7/p1690454042958259 discussion for more context.

As a result, we decided to prevent known crawlers from crawling nested facet pages (2 facets). This should limit drastically the number of crawlable pages and reduce server load (especially the DB server).

I checked locally with custom User Agent, nested facet pages are blocked as expected.

Copy link
Member

@alexgarel alexgarel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect but… could you add a test ?

when a crawler hit nested facets (ex:
/category/popcorn-with-caramel/data-quality-error/nutrition-value-total-over-105)
we return a blank HTML page with a noindex directive to prevent the
crawler from overloading our servers.
@raphael0202 raphael0202 force-pushed the add-no-index branch 2 times, most recently from a88949f to c1f3cdc Compare July 28, 2023 12:22
@codecov-commenter
Copy link

codecov-commenter commented Jul 28, 2023

Codecov Report

Merging #8744 (0e82a6f) into main (b128504) will decrease coverage by 0.06%.
Report is 4 commits behind head on main.
The diff coverage is 3.84%.

@@            Coverage Diff             @@
##             main    #8744      +/-   ##
==========================================
- Coverage   48.78%   48.73%   -0.06%     
==========================================
  Files         117      117              
  Lines       21882    21908      +26     
  Branches     4869     4872       +3     
==========================================
+ Hits        10676    10677       +1     
- Misses       9903     9927      +24     
- Partials     1303     1304       +1     
Files Changed Coverage Δ
lib/ProductOpener/Display.pm 9.94% <0.00%> (-0.05%) ⬇️
tests/unit/routing.t 91.66% <ø> (ø)
lib/ProductOpener/Routing.pm 29.68% <16.66%> (-0.43%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Crawling bots can't visit all page and crawl OFF continuously.
We want to limit crawlers on interesting pages, so we return a noindex
html page on most facet pages (except most interesting ones such as
brand, category,...)
@sonarcloud
Copy link

sonarcloud bot commented Jul 28, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@@ -564,6 +569,24 @@ sub analyze_request ($request_ref) {
$request_ref->{text} = 'index-pro';
}

# Return noindex empty HTML page for web crawlers that crawl specific facet pages
if (($request_ref->{is_crawl_bot} eq 1) and (defined $request_ref->{tagtype})) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "eq" operator converts both values to a string before comparing, it's best to use == for numbers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants