Python scripts for downloading Wikipedia data dump files. You can download Wikipedia dump files listed in target.dat
using manipulate.py
.
Also, you can parse a XML article dump file (that is, pages-articles.xml
) and import it into your MySQL database, using manipulate.py
.
- Python 3.X
- Python libraries
- requests
- mysqlclient
- click
- tqdm
- Download all files in this repository
- Install python libraries specified in
requirements.txt
- Make a directory (DATA_DIR) to download Wikipedia dump files.
- Specifiy and confirm the files which you want to download in
target.dat
- Run
manipulate.py
. The following description is about how to usemanipulate.py
for downloading Wikipedia dump files. For example, if you want to download Japanese Wikipedia dump files into DATA_DIR directory, run the commandpython manipulate.py download --lang ja DATA_DIR
.
Usage: manipulate.py download [OPTIONS] [DATA_DIR]
This script downloads Wikipedia data dump files. You can download the
files listed in target.dat into the directory DATA_DIR.
Options:
--lang [ja|en|de|fr|zh|pl|pt|it|ru|es]
target language for Wikipedia. Default value
is 'ja'.
--help Show this message and exit.
Modify and run import_sql.sh
in the misc
directory.
- Before running a script, download a table structure file from here. Then, create tables on your MySQL database.
- Run
manipulate.py
by following the below description. For example, image that you want to import a XML article dump file (data_dir/pages-articles.xml) into your MySQL database (hostname:localhost, port:3306, user:ja_wikipedia, password:ja_wikipedia, charset: utf8). Then, run the commandpython manipulate.py import_page_article_xml --db_host localhost --db_port 3306 --db_user ja_wikipedia --db_password ja_wikipedia --db_charset utf8 data_dir/pages-articles.xml
.
Usage: manipulate.py import_page_article_xml [OPTIONS] FILE_PATH
This script parses a page article XML file on FILE_PATH and import it into
a MySQL database.
Options:
--extract_count INTEGER number of articles of which XML content will be
imported. If not specified, all articiles will be
imported.
--db_host TEXT MySQL host name (default: localhost).
--db_port INTEGER MySQL port number (default: 3306).
--db_name TEXT MySQL database name. Dafault value is ja_wikipedia.
--db_user TEXT MySQL user name (default: ja_wikipedia).
--db_passwd TEXT MySQL user password (default: ja_wikipedia).
--db_charset TEXT character code on MySQL (default: utf8).
--help Show this message and exit.