Skip to content
forked from fxsjy/jparser

A readability parser which can extract title, content, images from html pages

License

Notifications You must be signed in to change notification settings

qjfoidnh/jparser

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

jparser

A readability parser which can extract title, content, images from html pages

Install:

pip install jparser

   (requirement: lxml)

Usage Example:

import urllib2
from jparser import PageModel
html = urllib2.urlopen("http://news.sohu.com/20170512/n492734045.shtml").read().decode('gb18030')
pm = PageModel(html)
result = pm.extract()

print "==title=="
print result['title']
print "==content=="
for x in result['content']:
    if x['type'] == 'text':
        print x['data']
    if x['type'] == 'image':
        print "[IMAGE]", x['data']['src']

Demo:

http://jparser.duapp.com/

About

A readability parser which can extract title, content, images from html pages

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 90.0%
  • HTML 10.0%