Skip to content

amidict: wikisparql

petermr edited this page Aug 6, 2020 · 6 revisions

linking amidict to output of Wikidata SPARQL query.

A manual Wikidata query can create an XML file of results. This details how to process the output into an AMI dictionary.

background

We assume you have run a sparql query to output a list of Items (Q-values) to be used as input to a dictionary.

NOTE: insert example of query here

sparql XML

Typical SPARQL output (truncated to 3 results):

<sparql xmlns='http://www.w3.org/2005/sparql-results#'>
	<head>
		<variable name='DiseaseLabel'/>
		<variable name='instanceofLabel'/>
		<variable name='DiseaseAltLabel'/>
		<variable name='Disease'/>
		<variable name='ICDcode'/>
	</head>
	<results>
		<result>
			<binding name='Disease'>
				<uri>http://www.wikidata.org/entity/Q12135</uri>
			</binding>
			<binding name='ICDcode'>
				<literal>F00-F99</literal>
			</binding>
			<binding name='DiseaseLabel'>
				<literal xml:lang='en'>mental disorder</literal>
			</binding>
			<binding name='instanceofLabel'>
				<literal xml:lang='en'>disease</literal>
			</binding>
			<binding name='DiseaseAltLabel'>
				<literal xml:lang='en'>disease of mental health, disorder of mental process, mental dysfunction, mental illness, mental or behavioural disorder, psychiatric condition, psychiatric disease, psychiatric disorder, mental disorders</literal>
			</binding>
		</result>
		<result>
			<binding name='Disease'>
				<uri>http://www.wikidata.org/entity/Q12198</uri>
			</binding>
			<binding name='ICDcode'>
				<literal>A64</literal>
			</binding>
			<binding name='DiseaseLabel'>
				<literal xml:lang='en'>sexually transmitted infection</literal>
			</binding>
			<binding name='instanceofLabel'>
				<literal xml:lang='en'>type of pathogen transmission</literal>
			</binding>
			<binding name='DiseaseAltLabel'>
				<literal xml:lang='en'>sexually transmitted disease, sexually transmitted diseases, STD, STI, VD, venereal disease</literal>
			</binding>
		</result>
		<result>
			<binding name='Disease'>
				<uri>http://www.wikidata.org/entity/Q16495</uri>
			</binding>
			<binding name='ICDcode'>
				<literal>E06.5</literal>
			</binding>
			<binding name='DiseaseLabel'>
				<literal xml:lang='en'>Riedel's fibrosing thyroiditis</literal>
			</binding>
			<binding name='instanceofLabel'>
				<literal xml:lang='en'>disease</literal>
			</binding>
			<binding name='DiseaseAltLabel'>
				<literal xml:lang='en'>Riedel disease, Riedel fibrosing thyroiditis, Riedel thyroiditis, Riedel's struma, Riedel's thyroiditis</literal>
			</binding>
		</result>
	</results>
</sparql>

variables

These names are chosen by the query creator. We assume that there are no constraints and we map the values later. Many names end with Label as Wikidata can be asked to generate these for names and descriptive strings. They need not be those used in the dictionary.

result/s

Each result is a separate XML <result> and each of these has binding children - the same number and order as variables. The bindings have a single child (literal or uri - there may be others we don't know). We want their string value to be used as attributes for each entry. For <uri> we trim the leading slashes to give the Q-number.

synonyms

The wikidata altLabel gives a list of comma-separated alternatives (approximate synonyms), which we split at commas. Unfortunately these have been flattened from discrete lines (separated by line-ends) which means that words or phrases may originally have commas and these create false splits. Currently no ideas for fixing this.

mapping names

AMIDict has the following names (might increase in future):

  • term. The word or phrase which defines the concept. This is the only mandatory (MUST) attribute.
  • name. A human-readable word or phrase naming the concept. Often the same as term but might be an expansion of an acronym, etc. SHOULD be present.
  • wikidata . The Q-name or P-name of the item or property. The unique identifier for the Wikidata item/property. SHOULD be present.
  • description A phrase used to describe the Wikidata entry (often a sentence). SHOULD be present.

P- and Q- values

The user may also retrieve other name-value pairs. Wikidata P- values may take predicate values and either these or the P-name may be retrieved. Q-values are always an identifier. The ami-names have the syntax [p|q]-\d+-[a-z0-9]+ , e.g. p_31_instanceof . Current amidict preserves but does not use these - later versions might.

unknown values

Users can retrieve other values and these can be flagged and preserved with a leading underscore.

the mapping

This has to be done by the users who created the labels in the query. Typically we map amiNames to sparqlNames, ami1=sparql1,ami2=sparql2...

the picocli Option sparqlmap is a Map and takes a list of name-value pairs. NOTE the commas and NO SPACES.

--sparqlmap wikidata=Disease,p_31_instanceOf=instanceofLabel,term=DiseaseLabel,name=DiseaseLabel,_icd10=ICDcode

The amiNames are listed above. In addition there may be synonyms which will point to a field (usually an *AltLabel).

    --synonyms=DiseaseAltLabel"

This mapping is created with a separate Option (--sparqlmap).

full input

This can be put on a single line. If you split it, you must use backslashes on Unix/MAC (??on Windows). Values in <...> must be replaced by your values.

amidict -vv --dictionary <mydictionary> --directory=<myOutputDir> --input=<myinputFile> \
	 create \
	 --informat=wikisparqlxml \
	 --sparqlmap \
	wikidata=Disease,p_31_instanceOf=instanceofLabel,wikidataAltLabel=DiseaseAltLabel,\
	term=DiseaseLabel,name=DiseaseLabel,_icd10=ICDCode\
	 --synonyms=DiseaseAltLabel

### Latest 2020-08-06

	@Test
	public void testCreateFromWikidataQueryMapTransform() throws IOException {
		String dictionary = "disease4";
		File queryFile = new File(TEST_DICTIONARY, dictionary + ".sparql");
		File outputDir = TARGET_DICTIONARY;
		String cmd = "-vvv"
				+ " --dictionary " + dictionary
				+ " --directory=" + outputDir
				+ " create"
				+ " --informat=wikisparqlxml"
				+ " --sparqlquery "+queryFile
				+ " --sparqlmap "
				+ "wikidataURL=wikidata,"
				+ "wikipediaURL=wikipedia,"
				+ "description=wikidataDescription,"
				+ "wikidataAltLabel=wikidataAltLabel,"
				+ "term=wikidataLabel,"
				+ "name=wikidataLabel"
				+ " --transformName wikidataID=EXTRACT(wikidataURL,.*/(.*))"
				+ ""
				+ " --synonyms=wikidataAltLabel"
				;

Note: --synonyms will be transferred to amidict update

complex Queries

organisms including viruses

Roderic Page: https://iphylo.blogspot.com/2017/01/displaying-taxonomic-classifications.html