HTML scraper with templates
Most HTML pages are generated using templates. Why not use templates too for scraping HTML pages? As for a template language, let's use HTML plus a few keywords. That way, the workflow with weakscraper
is the following :
- Get the source of a HTML page you want to scrap.
- Using a few keywords, edit the HTML to select which information is of interest and which parts to discard.
- If complex processing is required, write additional callbacks in python.
- Run
weakscraper
on the template and on the HTML.
- Observes the rule of least power. A declarative language helps to focus on what to extract. How the information is scrapped is the job of the library.
- The template covers the whole page. If
weakscraper
encounters a part it is not aware of, it will throw an exception, allowing you to complete the template. There is little chance to miss information, even if a few pages contain adiv
you are not aware of at first.
- To extract a bit of information, the whole parent list from the node root must be specified in the template. If you want to extract a few bits of information buried deep into the HTML tree, you probably want to use tools like
Beautiful Soup
.
pip install weakscraper
A weakscraper
template is like a regular HTML file with some keywords to tell which parts are to be kept.
wp-name="name"
: This tag names an element in the output object. Optional ifwp-function
is set.wp-recursive
: This attribute signals that everything under this tag is to be kept.wp-list
: This attribute signals that this tag may be found zero, one or several time. This outputs a list.wp-function="f"
: This attribute enables to process the output of the tag with a callback.wp-optional
: This attribute signals that this tag can be found zero or one time.wp-ignore
: As a tag, signals that everything should be ignored until the parent end tag. As an attribute, same aswp-ignore-attrs
andwp-ignore-content
.wp-until="important-tag"
: As an attribute to the tagwp-ignore
, signal to ignore everything until animportant-tag
is found.wp-ignore-attrs
: This attribute prevents to spawn an exception if an unlisted attribute is found.wp-ignore-content
: This attribute signals that the content of this tag should be ignored.wp-name-attrs="name"
: This attribute that the attribute of the tag should be outputed with the namename
. Optional ifwp-function-attrs
is set.wp-function-attrs="f"
: This attribute enables to process the attributes dictionary with a callback.
template.html
<!DOCTYPE html>
<head>
<title>Title</title>
</head>
<body attr1="val1" wp-name="body">
<div wp-name="content"/>
</body>
</html>
webpage.html
<!DOCTYPE html>
<head>
<title>Title</title>
</head>
<body attr1="val1">
<div>Hi !</div>
</body>
</html>
scraper.py
from weakscraper import WeakScraper
f = open('template.html', 'r')
template_string = f.read()
f.close()
scraper = WeakScraper(template_string)
f = open('webpage.html', 'r')
html_string = f.read()
f.close()
result_data = scraper.scrap(html_string)
f = open('output.json', 'w')
f.write(json.dumps(result_data, ident=2))
f.close()
output.json
{
"body": {
"content": "Hi !"
}
}
See the tests for more examples.
- Class
TemplateParser
useshtml.parser
to parse the template string to a tree of dicts. - Class
HtmlParser
useshtml.parser
to parse the HTML string to a tree of dicts. - A tree of objects of class
Template
if recursively created from the output of aTemplateParser
. Then it compares recursively it's own structure with the output ofHtmlParser
. - Class
WeakScraper
glues the pieces together.