Six Degrees of Wikipedia is a captivating concept inspired by the theory of six degrees of separation, commonly used in social networks, demonstrating that any two Wikipedia articles can be connected within six clicks or fewer. This project specifically focuses on uncovering the shortest path between articles on the English version of Wikipedia, exploring the vast web of interconnected knowledge present on the platform.
This project also allows you to build a clean SQLITE3 database with a adjacency list and partioned graph for easy traversal and use in your own projects.
Tool | Explanation for why it's needed |
---|---|
Python 3.x | Used for dump processing |
Go 1.20 | Used to build serialized adjacency lists and webserver |
NodeJS v18+ | Needed for development only |
Docker | Build and run webserver |
In order to build Wikipedia link adjacency it is required to download a Wikipedia dump file from here (~30gb). The file required to download from Wikipedia archives should be named: enwiki-xxxxxxxx-pages-articles-multistream.xml.bz2
.
Once downloaded, set the following environnement variables:
Name | Description |
---|---|
OUT_DIR | Path of a directory to put link adjacency lists to |
WIKI_XML_DUMP | Wikipedia XML dump file path |
SQLITE3_DB_PATH | Path to a SQLite3 database for informations about adjacency and articles |
ADJACENCY_LIST_PATH | Path to serialized adjacency list (Should equal to OUT_DIR ) (Optional if you're not planning to run the webserver) |
To run all the steps for dump processing run this command on a terminal:
make dump-processing
The following section is a step by step guide on building wikipedia link adjacency lists in a CSV format which is not processed and in a SQLite3 database format that
The SQLite3 schema is available here
After environnement variables are set the first step required to build Wikipedia link adjacency lists is to parse the dumps and write them to a csv file (Note: They are written to a CSV for faster parsing). This is done doing the following command on a terminal:
make step1-dp
This will create 3 csv files (article.csv, redirect.csv, pagesmentioned.csv) to the directory set by OUT_DIR
.
-
article.csv:
Article titles string -
redirect.csv:
Article A title actually redirects to Article B title string string -
pagesmentioned.csv: Article A has a link to Article B
Article A title Contains links to article B title string string
Once the dumps has been processed by step 1, it is necessary to write them to a sqlite3 database to perform some data manipulation such as deleting articles that don't exists, removing redirect loops and chains, knowing which articles are simply aliases to another article, and partioning the graph in step 3.
This is done using the command on a terminal:
make step2-dp
Once step 2 is done, the final step required is partioning the graph. This is done to reduce execution time for requests to articles that doesn't have a path.
NOTE: Because this is a directed graph, partitioning the graph doesn't allow to know every pairs of articles that doesn't have a path but a significant amount.
This is done using the following command:
make step3-dp
NOTE: Don't run this if you want to use the adjacency lists for your own projects.
make step4-dp
After step 3, there are some large tables in the database set by SQLITE3_DB_PATH
such as article_link_edge_directed that will not be used by the webserver. This can be done by running
make step5-dp
NOTE: Running VACUUM
in the SQLite3 database will reduce the size also.
This is for production only.
Environment variables needed:
Name | Description |
---|---|
ADJACENCY_LIST_PATH | Path to serialized adjacency list directory generated in section 1 |
SQLITE3_DB_DIR | Path to SQLITE3 database directory generated in section 1 |
CAPTCHA_ENABLED | Determine if captcha should be enabled (default: 1) |
CAPTCHA_SECRET | Google Recaptcha secret (Optional) |
CAPTCHA_SITEKEY | Google Recaptcha site key (optional) |
Requirement: Docker and Docker Compose
-
Build the webserver image
make build-image
-
Running the server
make run-webapp
Environment variables needed:
Name | Description |
---|---|
ADJACENCY_LIST_PATH | Path to serialized adjacency list directory generated in section 1 |
SQLITE3_DB_PATH | Path to SQLITE3 database file generated in section 1 |
CAPTCHA_ENABLED | Determine if captcha should be enabled (default: 1) |
CAPTCHA_SECRET | Captcha secret (Optional) |
CAPTCHA_SITEKEY | Captcha site key (Optional) |
-
Run webpack to enable hot reloading of the webpage
make webpack-watch cd shortestpath/webapp/client && npm run watch
-
Host a Redis server
Environment variables needed:
Name Description REDIS_HOST Hostname where Redis is running REDIS_PORT Port where Redis is running -
Run the webserver
Environment variable
WIKISP_DEBUG
must be set to 1make run-webpp-dev cd shortestpath/webapp && go run main.go