Wikipedia metrics generation on comparison table pages
This project aims to retreive the maximum information about matrix (comparison tables) creation and edition history on the Wikipedia page revisions to give visibility and drive the functional developpement of the main project, OpenCompare.
## Limits
As wikitext is an ambiguous and human editing language, there are some limitations towards matrices identification inside a page. Like moved matrices inside different section (used to name them) unables the facility to identify and compare same matrices of different revisions. The position inside the page can also vary, this main problem remains the same.
An other difficulty is to identify pages containing matrices, there is no such tools on Wikipedia API to do so. And parsing all pages revisions to search matrices would be a huge server resources consuption (especialy in an international context).
Sqlite has been firstly used to prevent issues and data loss from originaly csv file accessed throught threads. Secondly, it allows to create custom select requests to manage metrics processing easier than in a csv file.
Before each launch, the database is saved, postfixed by the date of the day.
## Functional processes
### Main launcher
The main class is org.opencompare.stats.Launcher
which launches sequentialy the two main processes as follow.
This task is done by org.opencompare.stats.processes.Revisions
class. It reads a custom csv page list to know wich page to parse.
This process uses several other classes such as :
org.opencompare.stats.utils.RevisionsParser
helps to provide an easiest way to retreive revisions metadataorg.opencompare.io.wikipedia.io.MediaWikipediaApi
provides a scala API interface to the Wikipedia API
Due to performance issues (tested in a quadcore AMD processor with 8GB RAM, it lasts 7 hours), it's preferable to add these options to the JVM (see http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html for further explaination) to ensure any Java stack and GC limitation org.opencompare.stats.exceptions :
- -Xmx5120m
- -XX:-UseParallelOldGC
- -XX:InitiatingHeapOccupancyPercent=10
To spare some Wikipedia servers resources (and already time consuption) when a revision has been processed, the revision's associated wikitext is not retreived next times the script is launched.
If the output folder metrics
is deleted, the script regenerates all the necessary files before processing.
This task is done by org.opencompare.stats.processes.Metrics
class.
This process uses several other classes such as :
org.opencompare.stats.utils.RevisionsComparator
returns the comparison between all matrices through revisionsorg.opencompare.io.wikipedia.io.WikiTextLoader
returns aorg.opencompare.api.java.PCMContainer
list containing revision matrices to be compared
Like the wikitext grabbing process, a line previously created will not be reprocessed to save database access by using of a CONSTRAINTS on several fields (generates a great amount of threads org.opencompare.stats.exceptions, has to be fixed soon) and by using a simple SELECT instead of an INSERT.
Use of R to process the graphical representation of metrics. Scripts are located in r-metrics
.
Once main processes finished, launch the R scripts to process graphical representation.
cd r-metrics
Rscript metrics_log.R => top have a full logarithmic metrics view
Rscript metrics_revisions.R => to have a linear comparison between changes and revisions
Rscript revisions_matrices.R => to have a linear comarison between revisions and matrices
RSqlite