CRITICAL NOTE: in both get_counts.py
and update_graphml.py
, there is a line in the code to toggle between using the "official wikidata endpoint" and a local copy. Default currently is to use a live wikidata endpoint
Search mode will take a set of seed node-types, count how many nodes are within that type, find the type and number of external identifers attached to those nodes, and search outward for other connected node types, returning both edge types and counts.
-
Run
python3 get_counts.py
-
Download & install yEd
-
In yEd: "File -> Open" output of step 1
-
Edit -> Properties Manager -> Imports additional configurations: select "prop_mapper_config.cnfx"
-
Click "apply" for each the node and edge configurations
Original output is messy and needs some manual editing
However, with editing in yEd, the following graph can be produced.
Update mode will do an in-place update of all the counts of node-types, external identifiers, and relationships between nodes. No new relationships will be added, however, all previous manual edits to the graph shape and structure will remain intact.
Update mode now has two distinct methods of updating the graph.
-
In-place update: Update counts to only the properties that currently exist on the graph.
-
Property-search update: Update properties, potentially discovering new ones or removing old ones. Properties will then be filtered according to variables
min_counts
andfilt_props
.
- Run
python3 update_graphml.py <inputfilename> -o <outputfilename>
inputfilename
should be an graphml file
output will be written to outputfilename
. If no outputfilename
is specified, an autogenerated one will be created based on the inputfilename
.
Use flag -p
to run a property-search update.
Additional Optional Flags:
-h, --help show this help message and exit
-c MIN_COUNTS, --min_counts MIN_COUNTS
The mininum nubmer of counts a new property must have
to be included (defaults 200)
-f FILT_PROPS, --filt_props FILT_PROPS
The fraction of the total number of counts for a node
that a property must have to be included (default
0.05)
-e ENDPOINT, --endpoint ENDPOINT
Use a wikibase endpoint other than standard wikidata
New feature has been added to find counts for the reference information for statements represented in the graph. This can be achieved by running two different scripts.
- run
python3 parse_graphml_connectivity.py <input_filename> -o <output_filename>
input_filename
should be a graphml file.
output csv will be written to output_filename
, if None is give, the name query_info.csv
is used.
Additional Optional Flags:
-e ENDPOINT, --endpoint ENDPOINT
Use a wikibase endpoint other than standard wikidata
Endpoint is used soley to get accurate mappings from property P-identifiers to english names.
- run
python3 get_prov_counts.py <input_filename> -o <output_filename>
input_filename
is the .csv output of parse_graphml_connectivity.py
output csv will be written to output_filename
, if none is given then prov_counts.csv
will be used as default.
This script will automatically write failed queries to a logfile.
Additional optional commandline arguments:
-l LOGFILE, --logfile LOGFILE
Filename for log of failed queries. Unique filenmae will be used if none passed
-a, --agg_objects Aggreate results on object, if True, only unique subject and predicates will be queried
-m ABSOLUTE_MIN, --absolute_min ABSOLUTE_MIN
The mininum nubmer of counts a reference must have for a given group, to be included (default 10)
-f FILT_LEVEL, --filt_level FILT_LEVEL
The fraction of the max counts for a group that a reference must must have to be included (default 0.05)
-e ENDPOINT, --endpoint ENDPOINT
Use a wikibase endpoint other than standard wikidata