Skip to content

subhendusethi/nytimes-article-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Dependencies:

  • Python 3.7
  • BeautifulSoup 4.5.1

HOT TO EXECUTE

$ python crawler_usage.py <start-url-string> <number-of-documents-to-crawl> <results-directory-path>

Note:
  <start-url-string> : Root of the BFS tree of article document URLs
  
  <number-of-documents-to-crawl> : Number of article documents to crawl
  
  <results-directory-path> : Result directory path without "/" at the end.
  Here the output of the crawled documents will be stored in this format:
  "DOC"-<ID>-".txt"

EXTRACTED DOCUMENT FORMAT

The format of the retrieved articles files:

Name:

DOC_<ID>.txt

Content :

URL
TITLE
META-KEYWORDS
DATE
DOC ID
CONTENT

IMPROVEMENTS

Improvements: Add Depth in the BFS routine. Add more documentation Add more functionality: Like crawling specific type of content e.g. [music, crime, politics, etc]

License

MIT

About

Crawl data from articles of the New York Times website

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages