How can node-scrapy be used recursively to crawl a site? #23

mariusa · 2020-12-12T09:35:29Z

mariusa
Dec 12, 2020

Hi, would you please add a recursive example ?
How can one crawl entire wikipedia.org , and save only URLs matching a pattern?

Thanks!

Answered by stefanmaric

Dec 14, 2020

Hi @mariusa,

Even tho we use node-scrapy for crawling at Eeshi, it is focused on scraping part. It doesn't provide anything at the network layer and less so for the crawling logic.

For http fetching there's a plethora of options (request, got, axios, node-fetch, etc). Here a quick example I put together with node-fetch:

const fs = require('fs')
const path = require('path')

// need to be installed in the project
const fetch = require('node-fetch')
const { extract } = require('node-scrapy')

const wait = () =>
  new Promise((resolve) => {
    setTimeout(resolve, Math.round(Math.random() * 10000))
  })

const LINKS_STORE = {}
const START_URL = 'https://en.wikipedia.org/wiki/Printmaking'

const

View full answer

stefanmaric · 2020-12-14T19:21:43Z

stefanmaric
Dec 14, 2020
Maintainer

Hi @mariusa,

Even tho we use node-scrapy for crawling at Eeshi, it is focused on scraping part. It doesn't provide anything at the network layer and less so for the crawling logic.

For http fetching there's a plethora of options (request, got, axios, node-fetch, etc). Here a quick example I put together with node-fetch:

const fs = require('fs')
const path = require('path')

// need to be installed in the project
const fetch = require('node-fetch')
const { extract } = require('node-scrapy')

const wait = () =>
  new Promise((resolve) => {
    setTimeout(resolve, Math.round(Math.random() * 10000))
  })

const LINKS_STORE = {}
const START_URL = 'https://en.wikipedia.org/wiki/Printmaking'

const crawl = async (url) => {
  console.log(`Fetching: ${url}`)

  // Don't choke Wikipedia's servers, but this must be replaced by an actual queue with a parallel limit
  await wait()

  const response = await fetch(url, {
    headers: {
      // Wikipedia needs user-agent to not block requests right ahead.
      'User-Agent':
        'Mariusa/1.0 (http://mariusa.github.io/crawler/; [email protected]) used-base-library/1.0',
    },
  })

  if (!response.ok) {
    console.log(`Failed to fetch: ${url}`)
    console.dir(response)
    return
  }

  const body = await response.text()

  const links = extract(body, [
    // Get links only from the right-hand sidebars, two kinds of it
    `.infobox a[href^="/wiki/"], .vertical-navbox a[href^="/wiki/"] (href | normalizeWhitespace | trim | prefix:"https://en.wikipedia.org")`,
  ])

  if (!links) {
    console.log(`No links found in page: ${url}`)
    return
  }

  console.log(`${links.length} links found at: ${url}`)

  LINKS_STORE[url] = links

  for (let link of links) {
    if (link in LINKS_STORE) {
      console.log(`Skipping URL because it was crawled already: ${url}`)
    } else {
      crawl(link)
    }
  }
}

const writeResults = (() => {
  let writing

  return () => {
    if (writing) {
      return
    }

    writing = true

    const filename = path.join(__dirname, 'result.json')

    fs.writeFileSync(filename, JSON.stringify(LINKS_STORE, null, 2), 'utf-8')

    console.log(`Results saved to ${filename}`)

    process.exit(process.exitCode)
  }
})()

process.on('exit', writeResults)
process.on('SIGINT', writeResults)

crawl(START_URL)

Hope this helps you.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can node-scrapy be used recursively to crawl a site? #23

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How can node-scrapy be used recursively to crawl a site? #23

mariusa Dec 12, 2020

Replies: 1 comment

stefanmaric Dec 14, 2020 Maintainer

mariusa
Dec 12, 2020

stefanmaric
Dec 14, 2020
Maintainer