Skip to content
msalvadores edited this page Mar 27, 2013 · 8 revisions

##Approach

The new approach is to use the triple store as a backend to build a persistent cache with all the term information that the annotator needs--except parents and mappings. The annotation would run from this cache entirely, not querying the triple store, unless the annotations is expanded with mappings and/or parents. When the annotation needs to use mappings and/or parent then there will be SPARQL queries executed against the triple store.

The cache of terms would get rebuilt daily and we expect this process to be reasonable faster the current one.

##Label IDs

With mgrep supporting a 64 bit key space we can just hash the label value to calculate the string IDs. This facilitates the process since we do not need to keep memory of an incremental value.

##Dictionary ###First pass

Iterate over every last submission to generate the dictionary file. Class pagination should be used to traverse all classes. Some approximation of the ruby code to do this:

Ontology.all.each do |ont|
    #skip if summary only
    last = ont.last_submission
    begin
        class_page = Class.page submission: last, page: page, size: size
        class_page.each do |cls|
            generate_annotator_entry(cls)
        end
        page = class_page.next_page
    end while !page.nil?
end

The call generate_annotator_entry should record the mapping between labels (pref/syn) and their hashes. Something like the following data structure is an approximation of that:

  9876543210 => { term1 => "syn,ACR1" , term2 => "pref,ACR2" }

This is a hash where the keys are the string IDs and the values are another hash where the keys are term IDs and the values are a pref/syn flag and the ontology acronym. Just a simple comma separated string would be a good approach to optimize performance--no JSON here. For mgrep we need another hash that maps the label with hash number:

   9876543210 => "melanoma"

Redis seems like a suitable backend for this. We can have two tables:

  1. IdMappingTable: 9876543210 => "melanoma"
  2. IdHashTable: 9876543210 => { term1 => "syn,ACR1" , term2 => "pref,ACR2" }

To record the data in the redis cache one can run the following operations:

SET   987654321 "melanoma"
HMSET 987654321 term1 "syn,acr1" term2 "alt,acr2"

For retrieval:

HKEYS 987654321 #term1, term2: for when the request does not filter on type of label or ontology. 
HGETALL 987654321 #returns the complete hash. It can be use to filter on type of label and/or ontology.

mgrep often annotates hundreds of terms so we must batch these look-ups. The redis protocol allows pipelining. The ruby client supports this protocol:

redis.pipelined do
  redis.hgetall "987654321"
  redis.hgetall "999999999"
  #(...)
  redis.hgetall "000000000"
end

In this way the client will only send one command to the redis server.

Some more tricks here

MGREP Dictionary

It would be built out of the redis server. We can traverse all keys and append a line into the dictionary file:

987654321 "melanoma"
...
000000000 "label 0"

Note: redis allows to traverse all keys with the command KEYS *

###Incremental updates

Every time me parse an ontology we can update the redis server with the new labels. Once every day we can regenerate the mgrep dictionary and restart the mgrep servers.

Deleting old terms becomes more complicated. If generating the redis cache from the triple store is fast enough we can just repopulated entirely the redis server data once a week.

Triple Store Data for the annotator - mappings and parents

The annotation without the triple covers the predominant usage for the annotator. The mapping/parents annotation expansion can be addressed as a final step of the implementation.

The initial annotation done by mgrep would be translated into term IRIs using the data in the redis cache. That process would also filter out alt labels if only pref label annotations are requested; and ontologies that are not part of that targeted set--if not empty.

The most optimal query to up n-steps is to iteratively run the following SPARQL query:

SELECT DISTINCT ?sub ?super ?graph WHERE {
GRAPH ?graph {
    ?sub rdfs:subClassOf ?super .
}
FILTER (isURI(?super))
FILTER ( $sub_filter )
}

Where $sub_filter is an OR filter for the initial set of annotated IRIs. Iteratively we replace $sub_filter n times the values bound in ?super in the previous query.

If the request wants to traverse the trees all the way to the roots it is better to just enable reasoning.