Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace addressit with pelias native parser #1287

Merged
merged 55 commits into from
Oct 1, 2019
Merged

Conversation

missinglink
Copy link
Member

@missinglink missinglink commented May 2, 2019

This PR replaces addressit with the new pelias/parser module.

Ready for testing and review

Functionally, I tried to keep them pretty similar, although the native parser is better at detecting addresses in general, and particularly in places like Germany and France 🎉

One major change is that we used to rely on the admin parsing to generate targeted subqueries (ie. when state='CA' was matched we would produce a subquery for region='CA').

This had the disadvantage of not generating a scoring subquery for cases where 'CA' referred to the country of Canada.

So the change I made is that we consider everything from the first matched admin character to the end of the string as 'admin_parts' and generate subqueries for them to match on any admin field.

@Joxit
Copy link
Member

Joxit commented May 7, 2019

I did some tests (only in France for now) and it seems that this solves #1279 🎉

@Joxit
Copy link
Member

Joxit commented May 21, 2019

Here is the current status of acceptance tests on our planet build.

- v3.65.0 pelias_parser
Pass 535 516
Improvements 18 22 (+4 👍 )
Fail 91 87
Placeholders 0 0
Regressions 65 84 (+19 👎 )
Test success rate 90.59% 87.77%

What are we targeting ? As good as addressit (+ improvements) ?

@missinglink
Copy link
Member Author

missinglink commented May 21, 2019

I'd like to see all the sane tests still passing, if tests are brittle or just bad then they can be changed.

I'm working off the assumption that all the broken tests at this stage are either venue queries or intersection queries?

Could you please post any of your test cases which don't fall in one of those two buckets?

But yeah, pretty good so far considering that it's pretty much a rewrite of autocomplete :)

@Joxit
Copy link
Member

Joxit commented May 22, 2019

Yes, most of them are venue queries.

Here is a set of master vs pelias_parser branch

--- master	2019-05-21 17:14:45.528007350 +0200
+++ pelias_parser	2019-05-21 17:18:56.555871513 +0200
 autocomplete admin areas
-  ✔ [1-2] "{"text":"brook"}"
+  ✘ regression [1-2] "{"text":"brook"}": score 0 out of 1
+  diff:
+    label
+      expected: Brooklyn, New York, NY, USA
+      actual:   Crystal Brook, NY, USA
...
 autocomplete daly city
-  ✔ [1] "{"focus.point.lat":"37.769316","focus.point.lon":"-122.484223","text":"dal"}"
+  ✘ regression [1] "{"focus.point.lat":"37.769316","focus.point.lon":"-122.484223","text":"dal"}": score 0 out of 1
+  diff:
+    name
+      expected: Daly City
+      actual:   Berg en Dal
-  ✔ [5] "{"focus.point.lat":"37.769316","focus.point.lon":"-122.484223","text":"daly ci"}"
+  ✘ regression [5] "{"focus.point.lat":"37.769316","focus.point.lon":"-122.484223","text":"daly ci"}": score 1 out of 2
+  diff:
+    priorityThresh is 1 but found at position 9
...
autocomplete focus
-  ✔ [4] "{"focus.point.lat":52.507054,"focus.point.lon":13.321714,"text":"hard rock cafe"}"
-  ✔ [5] "{"focus.point.lat":40.744243,"focus.point.lon":-73.990342,"text":"hard rock cafe"}"
+  ✘ regression [4] "{"focus.point.lat":52.507054,"focus.point.lon":13.321714,"text":"hard rock cafe"}": score 1 out of 2
+  diff:
+    priorityThresh is 1 but found at position 5
+  ✘ regression [5] "{"focus.point.lat":40.744243,"focus.point.lon":-73.990342,"text":"hard rock cafe"}": score 1 out of 2
+  diff:
+    priorityThresh is 1 but found at position 4
...
 autocomplete street centroids
-  ✔ [1] "{"sources":"osm","layers":"street","text":"rushendon furlong, england"}"
-  ✔ [2] "{"sources":"osm","layers":"street","text":"grolmanstr, berlin"}"
+  ✘ regression [1] "{"sources":"osm","layers":"street","text":"rushendon furlong, england"}": no results returned
+  ✘ regression [2] "{"sources":"osm","layers":"street","text":"grolmanstr, berlin"}": no results returned
...
 autocomplete street fallback
-  ✔ [1] "{"text":"grolmanstr, berlin"}"
+  ✘ regression [1] "{"text":"grolmanstr, berlin"}": no results returned
...
 autocomplete synonyms
-  ✔ [3] "{"text":"ucsf mt zion, san francisco, ca"}"
+  ✘ regression [3] "{"text":"ucsf mt zion, san francisco, ca"}": score 2 out of 3
+  diff:
+    name
+      expected: UCSF Mount Zion Campus
+      actual:   San Francisco
...
labels
-  ✔ [24] "{"text":"national air and space museum, washington dc"}"
+  ✘ regression [24] "{"text":"national air and space museum, washington dc"}": score 0 out of 1
+  diff:
+    label
+      expected: National Air and Space Museum, Washington, DC, USA
+      actual:   Frigid Air, Tacoma, WA, USA
...
landmarks
-  ✔ [15] "{"text":"australian war memorial, canberra, australia"}"
+  ✘ regression [15] "{"text":"australian war memorial, canberra, australia"}": score 2 out of 4
+  diff:
+    name
+      expected: Australian War Memorial
+      actual:   Australian Capital Territory
+    locality
+      expected: Campbell
+      actual:   
...
 locality geodisambiguation
-  ✔ [22] "{"text":"Germa","lang":"ru"}"
+  ✘ regression [22] "{"text":"Germa","lang":"ru"}": score 0 out of 2
+  diff:
+    name
+      expected: Германия
+      actual:   Germa
+    country_a
+      expected: DEU
+      actual:   GRC
...
search
-  ✔ [1426636804303:51] "{"text":"4th and King"}"
+  ✘ regression [1426636804303:51] "{"text":"4th and King"}": score 0 out of 6
+  diff:
+    name
+      expected: 4th & King
+      actual:   ۴/۴
+    locality
+      expected: San Francisco
+      actual:   
+    country_a
+      expected: USA
+      actual:   IRN
+    name
+      expected: San Francisco 4th & King Street
+      actual:   ۴/۴
+    locality
+      expected: San Francisco
+      actual:   
+    country_a
+      expected: USA
+      actual:   IRN
...
search_poi
-  ✔ [9] "{"text":"Ohio State University"}"
-  ✔ [10] "{"text":"Antioch University Seattle"}"
-  ✔ [11] "{"text":"Union college, kentucky"}"
+  ✘ regression [9] "{"text":"Ohio State University"}": score 0 out of 2
+  diff:
+    name
+      expected: Ohio State University Lima
+      actual:   Ohio
+    localadmin
+      expected: Bath Township
+      actual:   
+  ✘ regression [10] "{"text":"Antioch University Seattle"}": score 1 out of 2
+  diff:
+    name
+      expected: Antioch University
+      actual:   Seattle
+  ✘ regression [11] "{"text":"Union college, kentucky"}": score 1 out of 2
+  diff:
+    name
+      expected: Union College
+      actual:   Kentucky

@missinglink missinglink force-pushed the pelias_parser branch 2 times, most recently from 07e0296 to e9b68ba Compare June 4, 2019 12:18
@orangejulius
Copy link
Member

orangejulius commented Jun 7, 2019

This PR is showing really good results, especially for intersection queries, where it's a massive improvement over using addressit, which really can't handle them.

Before merging it would be great to get a handle on the performance implications for request time. Here are some things we can look at:

  • Does the new parser have any impact on memory used by the API? I saw the parser does now include a bunch of data from WOF, is that possibly significant? (looking at my local API process usage, it doesn't appear so, but could be good to check)
  • Does the parser itself take a noticeable amount of time during any queries? I notice the CLI version of pelias/parser often shows that the complete parsing process often takes several miliseconds (but never even more than 10). Ideally, we would add the parse time to the output of our summary events in places such as controller/search.js. Then, we can get great info on overall performance of the parser and find any cases where it's slow
  • The impact on the pelias/loadtest numbers. These test have recently been shown to not be super correlated with production performance because they don't test a huge number of queries, but they can still be useful for seeing large trends or identifying any queries that are now extremely slow before sending live queries to the new code.
  • Finally, we would generally test a change like this by directing some small percentage of our geocode.earth production traffic towards it. This is a super rich source of real world info, although we'll have to figure out a good plan here since the behavior is dramatically changed. Furthermore, since this change affects autocomplete, it could be very confusing for users to see results from the master branch for some keypresses, but this branch for others.

@Joxit
Copy link
Member

Joxit commented Jun 9, 2019

So many improvements since my last acceptance test 😄

- v3.65.3 pelias_parser
Pass 544 539
Improvements 18 23 (+5 👍 )
Fail 114 109
Placeholders 0 0
Regressions 40 45 (+5 👎 )
Test success rate 94.27% 93.51%

@missinglink
Copy link
Member Author

missinglink commented Jun 11, 2019

I tested the heap usage and it's actually a lot lower than I expected!

const mem = function(){
  let used = process.memoryUsage().heapUsed / 1024 / 1024
  console.log(`The script uses approximately ${used} MB\n`)
}

console.log('baseline')
mem()

let AddressParser = require('./parser/AddressParser')
console.log('require')
mem()

let parser = new AddressParser()
console.log('instantiate')
mem()
baseline
The script uses approximately 4.654182434082031 MB

require
The script uses approximately 6.072669982910156 MB

instantiate
The script uses approximately 36.584625244140625 MB

@Joxit
Copy link
Member

Joxit commented Sep 9, 2019

Hi there,

Just to let you know that I'm using this this (mixed with #1268) at @jawg and I have some feedback.

The analyzer peliasQuery in the must may be a bit too restrictive for autocomplete.
When we do autocomplete, such as search, we can use some abbreviations/synonyms.

If I use the full address 20 boulevard saint germain paris france, it's well parsed by our beautiful parser

{
        "subject": "20 boulevard saint germain",
        "housenumber": "20",
        "street": "boulevard saint germain",
        "locality": "paris",
        "country": "france",
        "admin": "paris france"
}

And we found in first position the correct answer : 20 Boulevard Saint-Germain, Paris, France from OSM openstreetmap:address:node/715428517. That's good.

But when we do the same search using some synonyms, like 20 boulevard st germain paris france the parser is still OK

{
        "subject": "20 boulevard st germain",
        "housenumber": "20",
        "street": "boulevard st germain",
        "locality": "paris",
        "country": "france",
        "admin": "paris france"
}

But we have no answers, that's because we use the analyzer peliasStreet in a should for the street part which contains some synonyms combined with the analyzer peliasQuery in a must part.

If I replace the peliasQuery by the peliasStreet the API give me the correct answer but this may not be the solution for this issue.

  • We should create a custom analyzer which will contains synonyms ? Side effect: response latency
  • Or remove definitely the subject from the must clause ? Side effect: maybe non pertinent results ?
  • Move the subject for the must clause to the should clause with some boost ? Why not

@missinglink missinglink merged commit e11b0c8 into master Oct 1, 2019
@missinglink missinglink deleted the pelias_parser branch October 1, 2019 15:17
@missinglink
Copy link
Member Author

merged!

thanks everyone for your patience, I think this is a big step forward to improving autocomplete and taking on more control of how we parse inputs.

🎉 🍾

@Joxit
Copy link
Member

Joxit commented Oct 1, 2019

👏 🚀 🎉

orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Oct 3, 2019
orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Oct 3, 2019
orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Oct 3, 2019
These all appear to pass now thanks to
pelias/api#1287
orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Oct 3, 2019
Unclear why these fail, but it may be due to
pelias/api#1287
@nilsnolde
Copy link

Hi guys,
just about to maintain our Pelias servers and I'd like to do a long-needed update. Is there any change in the config required to include the new parser? I guess the textAnalyzer entry is deprecated along with addressit? Thanks for this milestone btw:)

@missinglink
Copy link
Member Author

Hi @nilsnolde there is no configuration change required.
That textAnalyzer setting was deprecated, see 0a95dd2

@nilsnolde
Copy link

Oh damn, blind.. Thanks Peter. Good rant btw! Totally agreed. Might steal your wording maybe;)

orangejulius added a commit to pelias/documentation that referenced this pull request Jan 3, 2020
The issue described has been solved by the Pelias Parser in
pelias/api#1287!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants