Skip to content

Commit

Permalink
[docs] simple web crawler example (#31900)
Browse files Browse the repository at this point in the history
  • Loading branch information
maxpumperla authored Jan 28, 2023
1 parent c889349 commit 80d13d1
Show file tree
Hide file tree
Showing 9 changed files with 322 additions and 92 deletions.
5 changes: 5 additions & 0 deletions doc/source/_static/css/custom.css
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,10 @@ img.horizontal-scroll {
float: right;
}

.card-body {
padding: 0.5rem !important;
}

/* Wrap code blocks instead of horizontal scrolling. */
pre {
white-space: pre-wrap;
Expand All @@ -325,6 +329,7 @@ pre {
.cell .cell_output {
max-height: 250px;
overflow-y: auto;
font-weight: bold;
}

/* Yellow doesn't render well on light background */
Expand Down
18 changes: 18 additions & 0 deletions doc/source/_static/js/custom.js
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,24 @@ window.addEventListener("scroll", loadVisibleTermynals);
createTermynals();
loadVisibleTermynals();


document.addEventListener("DOMContentLoaded", function() {
let images = document.getElementsByClassName("fixed-height-img");
let maxHeight = 0;

for (let i = 0; i < images.length; i++) {
if (images[i].height > maxHeight) {
maxHeight = images[i].height;
}
}

for (let i = 0; i < images.length; i++) {
let margin = Math.floor((maxHeight - images[i].height) / 2);
images[i].style.cssText = "margin-top: " + margin + "px !important;" +
"margin-bottom: " + margin + "px !important;"
}
});

// Remember the scroll position when the page is unloaded.
window.onload = function() {
let sidebar = document.querySelector("#bd-docs-nav");
Expand Down
1 change: 1 addition & 0 deletions doc/source/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ parts:
- file: ray-core/examples/batch_prediction
- file: ray-core/examples/batch_training
- file: ray-core/examples/automl_for_time_series
- file: ray-core/examples/web-crawler
- file: ray-core/api

- file: cluster/getting-started
Expand Down
4 changes: 2 additions & 2 deletions doc/source/custom_directives.py
Original file line number Diff line number Diff line change
Expand Up @@ -313,10 +313,10 @@ def build_gallery(app):
---
:img-top: {item["image"]}
{item["description"]}
{gh_stars}
{item["description"]}
+++
.. link-button:: {item["website"]}
{ref}
Expand Down
9 changes: 1 addition & 8 deletions doc/source/ray-air/user-guides.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,13 @@ AIR User Guides
.. panels::
:container: text-center
:column: col-md-4 px-2 py-2
:img-top-cls: pt-5 w-75 d-block mx-auto
:img-top-cls: pt-5 w-75 d-block mx-auto fixed-height-img

---
:img-top: /ray-air/images/preprocessors.svg

.. https://docs.google.com/drawings/d/1ZIbsXv5vvwTVIEr2aooKxuYJ_VL7-8VMNlRinAiPaTI/edit
+++
.. link-button:: /ray-air/preprocessors
:type: ref
:text: Using Preprocessors
Expand All @@ -30,7 +29,6 @@ AIR User Guides

.. https://docs.google.com/drawings/d/15SXGHbKPWdrzx3aTAIFcO2uh_s6Q7jLU03UMuwKSzzM/edit
+++
.. link-button:: trainer
:type: ref
:text: Using Trainers
Expand All @@ -41,7 +39,6 @@ AIR User Guides

.. https://docs.google.com/drawings/d/10GZE_6s6ss8PSxLYyzcbj6yEalWO4N7MS7ao8KO7ne0/edit
+++
.. link-button:: air-ingest
:type: ref
:text: Configuring Training Datasets
Expand All @@ -52,7 +49,6 @@ AIR User Guides

.. https://docs.google.com/drawings/d/1yMd12iMkyo6DGrFoET1TIlKfFnXX9dfh2u3GSdTz6W4/edit
+++
.. link-button:: /ray-air/tuner
:type: ref
:text: Configuring Hyperparameter Tuning
Expand All @@ -63,7 +59,6 @@ AIR User Guides

.. https://docs.google.com/presentation/d/1jfkQk0tGqgkLgl10vp4-xjcbYG9EEtlZV_Vnve_NenQ/edit#slide=id.g131c21f5e88_0_549
+++
.. link-button:: predictors
:type: ref
:text: Using Predictors for Inference
Expand All @@ -74,7 +69,6 @@ AIR User Guides

.. https://docs.google.com/drawings/d/1-rg77bV-vEMURXZw5_mIOUFM3FObIIYbFOiYzFJW_68/edit
+++
.. link-button:: /ray-air/examples/serving_guide
:type: ref
:text: Deploying Predictors with Serve
Expand All @@ -85,7 +79,6 @@ AIR User Guides

.. https://docs.google.com/drawings/d/1ja1RfNCEFn50B9FHWSemUzwhtPAmVyoak1JqEJUmxs4/edit
+++
.. link-button:: air-deployment
:type: ref
:text: How to Deploy AIR
Expand Down
244 changes: 244 additions & 0 deletions doc/source/ray-core/examples/web-crawler.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
{
"cells": [
{
"cell_type": "markdown",
"source": [
"# Speed up your web crawler by parallelizing it with Ray"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"In this example we'll quickly demonstrate how to build a simple web scraper in Python and\n",
"parallelize it with Ray Tasks with minimal code changes.\n",
"\n",
"To run this example locally on your machine, please first install `ray` and `beautifulsoup` with\n",
"\n",
"```\n",
"pip install \"beautifulsoup4==4.11.1\" \"ray>=2.2.0\"\n",
"```\n",
"\n",
"First, we'll define a function called `find_links` which takes a starting page (`start_url`) to crawl,\n",
"and we'll take the Ray documentation as example of such a starting point.\n",
"Our crawler simply extracts all available links from the starting URL that contain a given `base_url`\n",
"(e.g. in our example we only want to follow links on `http://docs.ray.io`, not any external links).\n",
"The `find_links` function is then called recursively with all the links we found this way, until a\n",
"certain depth is reached.\n",
"\n",
"To extract the links from HTML elements on a site, we define a little helper function called\n",
"`extract_links`, which takes care of handling relative URLs properly and sets a limit on the\n",
"number of links returned from a site (`max_results`) to control the runtime of the crawler more easily.\n",
"\n",
"Here's the full implementation:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 154,
"outputs": [],
"source": [
"import requests\n",
"from bs4 import BeautifulSoup\n",
"\n",
"def extract_links(elements, base_url, max_results=100):\n",
" links = []\n",
" for e in elements:\n",
" url = e[\"href\"]\n",
" if \"https://\" not in url:\n",
" url = base_url + url\n",
" if base_url in url:\n",
" links.append(url)\n",
" return set(links[:max_results])\n",
"\n",
"\n",
"def find_links(start_url, base_url, depth=2):\n",
" if depth == 0:\n",
" return set()\n",
"\n",
" page = requests.get(start_url)\n",
" soup = BeautifulSoup(page.content, \"html.parser\")\n",
" elements = soup.find_all(\"a\", href=True)\n",
" links = extract_links(elements, base_url)\n",
"\n",
" for url in links:\n",
" new_links = find_links(url, base_url, depth-1)\n",
" links = links.union(new_links)\n",
" return links"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"Let's define a starting and base URL and crawl the Ray docs to a `depth` of 2."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 162,
"outputs": [],
"source": [
"base = \"https://docs.ray.io/en/latest/\"\n",
"docs = base + \"index.html\""
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 163,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 19.3 s, sys: 340 ms, total: 19.7 s\n",
"Wall time: 25.8 s\n"
]
},
{
"data": {
"text/plain": "591"
},
"execution_count": 163,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%time len(find_links(docs, base))"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"As you can see, crawling the documentation root recursively like this returns a\n",
"total of `591` pages and the wall time comes in at around 25 seconds.\n",
"\n",
"Crawling pages can be parallelized in many ways.\n",
"Probably the simplest way is to simple start with multiple starting URLs and call\n",
"`find_links` in parallel for each of them.\n",
"We can do this with [Ray Tasks](https://docs.ray.io/en/latest/ray-core/tasks.html) in a straightforward way.\n",
"We simply use the `ray.remote` decorator to wrap the `find_links` function in a task called `find_links_task` like this:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 157,
"outputs": [],
"source": [
"import ray\n",
"\n",
"@ray.remote\n",
"def find_links_task(start_url, base_url, depth=2):\n",
" return find_links(start_url, base_url, depth)"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"To use this task to kick off a parallel call, the only thing you have to do is use\n",
"`find_links_tasks.remote(...)` instead of calling the underlying Python function directly.\n",
"\n",
"Here's how you run six crawlers in parallel, the first three (redundantly) crawl\n",
"`docs.ray.io` again, the other three crawl the main entry points of the Ray RLlib,\n",
"Tune, and Serve libraries, respectively:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 160,
"outputs": [],
"source": [
"links = [find_links_task.remote(f\"{base}{lib}/index.html\", base)\n",
" for lib in [\"\", \"\", \"\", \"rllib\", \"tune\", \"serve\"]]"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 161,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"591\n",
"591\n",
"105\n",
"204\n",
"105\n",
"CPU times: user 65.5 ms, sys: 47.8 ms, total: 113 ms\n",
"Wall time: 27.2 s\n"
]
}
],
"source": [
"%time for res in ray.get(links): print(len(res))"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"This parallel run crawls around four times the number of pages in roughly the same time as the initial, sequential run.\n",
"Note the use of `ray.get` in the timed run to retrieve the results from Ray (the `remote` call promise gets resolved with `get`).\n",
"\n",
"Of course, there are much smarter ways to create a crawler and efficiently parallelize it, and this example\n",
"gives you a starting point to work from."
],
"metadata": {
"collapsed": false
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
10 changes: 5 additions & 5 deletions doc/source/ray-overview/eco-gallery.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
meta:
section-titles: true
container: container pb-12
column: col-md-12 px-2 py-2
img-top-cls: pt-10 w-50 d-block mx-auto
section-titles: false
container: container pb-4
column: col-md-4 px-1 py-1
img-top-cls: p-2 w-75 d-block mx-auto fixed-height-img

buttons:
classes: btn-outline-info btn-block
Expand Down Expand Up @@ -146,7 +146,7 @@ projects:
random forests, gradient boosting, k-means and DBSCAN, and is designed to
interoperate with the Python numerical and scientific libraries NumPy and SciPy.
website: https://docs.ray.io/en/master/joblib.html
repo: https://docs.ray.io/en/master/joblib.html
repo: https://github.com/scikit-learn/scikit-learn
image: ../images/scikit.png
- name: Seldon Alibi Integration
section_title: Seldon Alibi
Expand Down
Loading

0 comments on commit 80d13d1

Please sign in to comment.