-
Notifications
You must be signed in to change notification settings - Fork 9
Initial Design
##Approach
The new approach is to use the triple store as a backend to build a persistent cache with all the term information that the annotator needs--except parents and mappings. The annotation would run from this cache entirely, not querying the triple store, unless the annotations is expanded with mappings and/or parents. When the annotation needs to use mappings and/or parent then there will be SPARQL queries executed against the triple store.
The cache of terms would get rebuilt daily and we expect this process to be reasonable faster the current one.
##Label IDs
With mgrep
supporting a 64 bit key space we can just hash the label value to calculate the string IDs. This facilitates the process since we do not need to keep memory of an incremental value.
##Dictionary ###First pass
Iterate over every last submission to generate the dictionary file. Class pagination should be used to traverse all classes. Some approximation of the ruby code to do this:
Ontology.all.each do |ont|
#skip if summary only
last = ont.last_submission
begin
class_page = Class.page submission: last, page: page, size: size
class_page.each do |cls|
generate_annotator_entry(cls)
end
page = class_page.next_page
end while !page.nil?
end
The call generate_annotator_entry
should record the mapping between labels (pref/syn) and their hashes. Something like the following data structure is an approximation of that:
9876543210 => { term1 => "syn,ACR1" , term2 => "pref,ACR2" }
This is a hash where the keys are the string IDs and the values are another hash where the keys are term IDs and the values are a pref/syn flag and the ontology acronym. Just a simple comma separated string would be a good approach to optimize performance--no JSON here. For mgrep
we need another hash that maps the label with hash number:
9876543210 => "melanoma"
Redis seems like a suitable backend for this. We can have two tables:
- IdMappingTable:
9876543210 => "melanoma"
- IdHashTable:
9876543210 => { term1 => "syn,ACR1" , term2 => "pref,ACR2" }
To record the data in the redis cache one can run the following operations:
SET 987654321 "melanoma"
HMSET 987654321 term1 "syn,acr1" term2 "alt,acr2"
For retrieval:
HKEYS 987654321 #term1, term2: for when the request does not filter on type of label or ontology.
HGETALL 987654321 #returns the complete hash. It can be use to filter on type of label and/or ontology.
mgrep
often annotates hundreds of terms so we must batch these look-ups. The redis protocol allows pipelining. The ruby client supports this protocol:
redis.pipelined do
redis.hgetall "987654321"
redis.hgetall "999999999"
#(...)
redis.hgetall "000000000"
end
In this way the client will only send one command to the redis server.
Some more tricks here
It would be built out of the redis server. We can traverse all keys and append a line into the dictionary file:
987654321 "melanoma"
...
000000000 "label 0"
Note: redis allows to traverse all keys with the command KEYS *
###Incremental updates
Every time me parse an ontology we can update the redis server with the new labels. Once every day we can regenerate the mgrep
dictionary and restart the mgrep
servers.
Deleting old terms becomes more complicated. If generating the redis cache from the triple store is fast enough we can just repopulated entirely the redis server data once a week.
The annotation without the triple covers the predominant usage for the annotator. The mapping/parents annotation expansion can be addressed as a final step of the implementation.
The initial annotation done by mgrep
would be translated into term IRIs using the data in the redis cache. That process would also filter out alt labels if only pref label annotations are requested; and ontologies that are not part of that targeted set--if not empty.
The most optimal query to up n-steps is to iteratively run the following SPARQL query:
SELECT DISTINCT ?sub ?super ?graph WHERE {
GRAPH ?graph {
?sub rdfs:subClassOf ?super .
}
FILTER (isURI(?super))
FILTER ( $sub_filter )
}
Where $sub_filter
is an OR filter for the initial set of annotated IRIs. Iteratively we replace $sub_filter
n times the values bound in ?super
in the previous query.
If the request wants to traverse the trees all the way to the roots it is better to just enable reasoning.