Skip to content

digitalTranshumant/real-time-wiki-covid-tracker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

real-time-page-tracker

This repo contains the code used to run this service: https://covid-data.wmflabs.org/

The main components are:

  • PageCrawler.py: Giving a set of Wikidata Item as seeds - where len(set(seeds) >0- discover related wikidata items and using the sitelinks returns a list of pages related with the seed(s). The output is written and sqlite db, in two tables: ** pagesPerProjectTable: Information about the Wikipedia articles, such as: project,page,url,wikilink,wikidataItem ** itemsInfoTable: Information about the Wikidata Items and their relation with the seeds.

  • getEdits.py: A rudimentary crawler to count the number of edits in each of the pages in pagesPerProjectTable. Save all edits with user,timestamp,page_title,project in 'revisions' table in sqlite.

  • app.py: It is Flask server to run https://covid-data.wmflabs.org/. It offers some endpoinds to download the data in JSON, and also some visualizations and statistics about the data.

To understand the methodology behind the Crawler please follow this notebook.

TODOs:

  • Replace/improve geEdits.py with direct connection to the Wiki Replicas.
  • Add more interactive visualizations,
  • Move the seeds list to a configuration file.
  • Clean the code ;)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published