[docs] simple web crawler example (ray-project#31900)

Signed-off-by: Edward Oakes <[email protected]>
edoakes · Mar 22, 2023 · a8ca410 · a8ca410
1 parent f2bb9a8
commit a8ca410
Show file tree

Hide file tree

Showing 9 changed files with 322 additions and 92 deletions.
diff --git a/doc/source/_static/css/custom.css b/doc/source/_static/css/custom.css
@@ -316,6 +316,10 @@ img.horizontal-scroll {
     float: right;
 }
 
+.card-body {
+    padding: 0.5rem !important;
+}
+
 /* Wrap code blocks instead of horizontal scrolling. */
 pre {
     white-space: pre-wrap;
@@ -325,6 +329,7 @@ pre {
 .cell .cell_output {
     max-height: 250px;
     overflow-y: auto;
+    font-weight: bold;
 }
 
 /* Yellow doesn't render well on light background */

diff --git a/doc/source/_static/js/custom.js b/doc/source/_static/js/custom.js
@@ -28,6 +28,24 @@ window.addEventListener("scroll", loadVisibleTermynals);
 createTermynals();
 loadVisibleTermynals();
 
+
+document.addEventListener("DOMContentLoaded", function() {
+  let images = document.getElementsByClassName("fixed-height-img");
+  let maxHeight = 0;
+
+  for (let i = 0; i < images.length; i++) {
+    if (images[i].height > maxHeight) {
+      maxHeight = images[i].height;
+    }
+  }
+
+  for (let i = 0; i < images.length; i++) {
+    let margin = Math.floor((maxHeight - images[i].height) / 2);
+    images[i].style.cssText = "margin-top: " + margin + "px !important;" +
+        "margin-bottom: " + margin + "px !important;"
+  }
+});
+
 // Remember the scroll position when the page is unloaded.
 window.onload = function() {
     let sidebar = document.querySelector("#bd-docs-nav");

diff --git a/doc/source/_toc.yml b/doc/source/_toc.yml
@@ -32,6 +32,7 @@ parts:
               - file: ray-core/examples/batch_prediction
               - file: ray-core/examples/batch_training
               - file: ray-core/examples/automl_for_time_series
+              - file: ray-core/examples/web-crawler
           - file: ray-core/api
 
       - file: cluster/getting-started

diff --git a/doc/source/custom_directives.py b/doc/source/custom_directives.py
@@ -313,10 +313,10 @@ def build_gallery(app):
         ---
         :img-top: {item["image"]}
 
-        {item["description"]}
-
         {gh_stars}
 
+        {item["description"]}
+
         +++
         .. link-button:: {item["website"]}
             {ref}

diff --git a/doc/source/ray-air/user-guides.rst b/doc/source/ray-air/user-guides.rst
@@ -12,14 +12,13 @@ AIR User Guides
 .. panels::
     :container: text-center
     :column: col-md-4 px-2 py-2
-    :img-top-cls: pt-5 w-75 d-block mx-auto
+    :img-top-cls: pt-5 w-75 d-block mx-auto fixed-height-img
 
     ---
     :img-top:  /ray-air/images/preprocessors.svg
 
     .. https://docs.google.com/drawings/d/1ZIbsXv5vvwTVIEr2aooKxuYJ_VL7-8VMNlRinAiPaTI/edit
 
-    +++
     .. link-button:: /ray-air/preprocessors
         :type: ref
         :text: Using Preprocessors
@@ -30,7 +29,6 @@ AIR User Guides
 
     .. https://docs.google.com/drawings/d/15SXGHbKPWdrzx3aTAIFcO2uh_s6Q7jLU03UMuwKSzzM/edit
 
-    +++
     .. link-button:: trainer
         :type: ref
         :text: Using Trainers
@@ -41,7 +39,6 @@ AIR User Guides
 
     .. https://docs.google.com/drawings/d/10GZE_6s6ss8PSxLYyzcbj6yEalWO4N7MS7ao8KO7ne0/edit
 
-    +++
     .. link-button:: air-ingest
         :type: ref
         :text: Configuring Training Datasets
@@ -52,7 +49,6 @@ AIR User Guides
 
     .. https://docs.google.com/drawings/d/1yMd12iMkyo6DGrFoET1TIlKfFnXX9dfh2u3GSdTz6W4/edit
 
-    +++
     .. link-button:: /ray-air/tuner
         :type: ref
         :text: Configuring Hyperparameter Tuning
@@ -63,7 +59,6 @@ AIR User Guides
 
     .. https://docs.google.com/presentation/d/1jfkQk0tGqgkLgl10vp4-xjcbYG9EEtlZV_Vnve_NenQ/edit#slide=id.g131c21f5e88_0_549
 
-    +++
     .. link-button:: predictors
         :type: ref
         :text: Using Predictors for Inference
@@ -74,7 +69,6 @@ AIR User Guides
 
     .. https://docs.google.com/drawings/d/1-rg77bV-vEMURXZw5_mIOUFM3FObIIYbFOiYzFJW_68/edit
 
-    +++
     .. link-button:: /ray-air/examples/serving_guide
         :type: ref
         :text: Deploying Predictors with Serve
@@ -85,7 +79,6 @@ AIR User Guides
 
     .. https://docs.google.com/drawings/d/1ja1RfNCEFn50B9FHWSemUzwhtPAmVyoak1JqEJUmxs4/edit
 
-    +++
     .. link-button:: air-deployment
         :type: ref
         :text: How to Deploy AIR

diff --git a/doc/source/ray-core/examples/web-crawler.ipynb b/doc/source/ray-core/examples/web-crawler.ipynb
@@ -0,0 +1,244 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "source": [
+    "# Speed up your web crawler by parallelizing it with Ray"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "In this example we'll quickly demonstrate how to build a simple web scraper in Python and\n",
+    "parallelize it with Ray Tasks with minimal code changes.\n",
+    "\n",
+    "To run this example locally on your machine, please first install `ray` and `beautifulsoup` with\n",
+    "\n",
+    "```\n",
+    "pip install \"beautifulsoup4==4.11.1\" \"ray>=2.2.0\"\n",
+    "```\n",
+    "\n",
+    "First, we'll define a function called `find_links` which takes a starting page (`start_url`) to crawl,\n",
+    "and we'll take the Ray documentation as example of such a starting point.\n",
+    "Our crawler simply extracts all available links from the starting URL that contain a given `base_url`\n",
+    "(e.g. in our example we only want to follow links on `http://docs.ray.io`, not any external links).\n",
+    "The `find_links` function is then called recursively with all the links we found this way, until a\n",
+    "certain depth is reached.\n",
+    "\n",
+    "To extract the links from HTML elements on a site, we define a little helper function called\n",
+    "`extract_links`, which takes care of handling relative URLs properly and sets a limit on the\n",
+    "number of links returned from a site (`max_results`) to control the runtime of the crawler more easily.\n",
+    "\n",
+    "Here's the full implementation:"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 154,
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "from bs4 import BeautifulSoup\n",
+    "\n",
+    "def extract_links(elements, base_url, max_results=100):\n",
+    "    links = []\n",
+    "    for e in elements:\n",
+    "        url = e[\"href\"]\n",
+    "        if \"https://\" not in url:\n",
+    "            url = base_url + url\n",
+    "        if base_url in url:\n",
+    "            links.append(url)\n",
+    "    return set(links[:max_results])\n",
+    "\n",
+    "\n",
+    "def find_links(start_url, base_url, depth=2):\n",
+    "    if depth == 0:\n",
+    "        return set()\n",
+    "\n",
+    "    page = requests.get(start_url)\n",
+    "    soup = BeautifulSoup(page.content, \"html.parser\")\n",
+    "    elements = soup.find_all(\"a\", href=True)\n",
+    "    links = extract_links(elements, base_url)\n",
+    "\n",
+    "    for url in links:\n",
+    "        new_links = find_links(url, base_url, depth-1)\n",
+    "        links = links.union(new_links)\n",
+    "    return links"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Let's define a starting and base URL and crawl the Ray docs to a `depth` of 2."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 162,
+   "outputs": [],
+   "source": [
+    "base = \"https://docs.ray.io/en/latest/\"\n",
+    "docs = base + \"index.html\""
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 163,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 19.3 s, sys: 340 ms, total: 19.7 s\n",
+      "Wall time: 25.8 s\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": "591"
+     },
+     "execution_count": 163,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "%time len(find_links(docs, base))"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "As you can see, crawling the documentation root recursively like this returns a\n",
+    "total of `591` pages and the wall time comes in at around 25 seconds.\n",
+    "\n",
+    "Crawling pages can be parallelized in many ways.\n",
+    "Probably the simplest way is to simple start with multiple starting URLs and call\n",
+    "`find_links` in parallel for each of them.\n",
+    "We can do this with [Ray Tasks](https://docs.ray.io/en/latest/ray-core/tasks.html) in a straightforward way.\n",
+    "We simply use the `ray.remote` decorator to wrap the `find_links` function in a task called `find_links_task` like this:"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 157,
+   "outputs": [],
+   "source": [
+    "import ray\n",
+    "\n",
+    "@ray.remote\n",
+    "def find_links_task(start_url, base_url, depth=2):\n",
+    "    return find_links(start_url, base_url, depth)"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "To use this task to kick off a parallel call, the only thing you have to do is use\n",
+    "`find_links_tasks.remote(...)` instead of calling the underlying Python function directly.\n",
+    "\n",
+    "Here's how you run six crawlers in parallel, the first three (redundantly) crawl\n",
+    "`docs.ray.io` again, the other three crawl the main entry points of the Ray RLlib,\n",
+    "Tune, and Serve libraries, respectively:"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 160,
+   "outputs": [],
+   "source": [
+    "links = [find_links_task.remote(f\"{base}{lib}/index.html\", base)\n",
+    "         for lib in [\"\", \"\", \"\", \"rllib\", \"tune\", \"serve\"]]"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 161,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "591\n",
+      "591\n",
+      "105\n",
+      "204\n",
+      "105\n",
+      "CPU times: user 65.5 ms, sys: 47.8 ms, total: 113 ms\n",
+      "Wall time: 27.2 s\n"
+     ]
+    }
+   ],
+   "source": [
+    "%time for res in ray.get(links): print(len(res))"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "This parallel run crawls around four times the number of pages in roughly the same time as the initial, sequential run.\n",
+    "Note the use of `ray.get` in the timed run to retrieve the results from Ray (the `remote` call promise gets resolved with `get`).\n",
+    "\n",
+    "Of course, there are much smarter ways to create a crawler and efficiently parallelize it, and this example\n",
+    "gives you a starting point to work from."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
diff --git a/doc/source/ray-overview/eco-gallery.yml b/doc/source/ray-overview/eco-gallery.yml
@@ -1,8 +1,8 @@
 meta:
-  section-titles: true
-  container: container pb-12
-  column: col-md-12 px-2 py-2
-  img-top-cls: pt-10 w-50 d-block mx-auto
+  section-titles: false
+  container: container pb-4
+  column: col-md-4 px-1 py-1
+  img-top-cls: p-2 w-75 d-block mx-auto fixed-height-img
 
 buttons:
   classes: btn-outline-info btn-block
@@ -146,7 +146,7 @@ projects:
       random forests, gradient boosting, k-means and DBSCAN, and is designed to
       interoperate with the Python numerical and scientific libraries NumPy and SciPy.
     website: https://docs.ray.io/en/master/joblib.html
-    repo: https://docs.ray.io/en/master/joblib.html
+    repo: https://github.com/scikit-learn/scikit-learn
     image: ../images/scikit.png
   - name: Seldon Alibi Integration
     section_title: Seldon Alibi