-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement phone number analyzer #15915
base: main
Are you sure you want to change the base?
Implement phone number analyzer #15915
Conversation
74429fe
to
d844ea9
Compare
❌ Gradle check result for 74429fe: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for d844ea9: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
d844ea9
to
24e60a5
Compare
❌ Gradle check result for 24e60a5: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
24e60a5
to
f7669e2
Compare
❕ Gradle check result for f7669e2: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #15915 +/- ##
============================================
+ Coverage 71.87% 71.88% +0.01%
- Complexity 64285 64329 +44
============================================
Files 5278 5282 +4
Lines 300833 300904 +71
Branches 43473 43482 +9
============================================
+ Hits 216224 216312 +88
+ Misses 66812 66802 -10
+ Partials 17797 17790 -7 ☔ View full report in Codecov by Sentry. |
this is a flaky test: #14304 and the failure of the "mend security check" also seems to be random (but i don't have the rights to re-trigger it) |
this should now be ready for review 🚀 |
Thanks! Probably @msfroh is our best reviewer for this. |
could someone please add the backport 2.x label and re-run the changelog verifier? thanks! |
libs/core/src/test/java/org/opensearch/core/common/StringsTests.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rursprung , I feel strongly that we should have it under separate plugin (analysis-phonenumber
fe) and keep the analysis-common
clean (not everyone needs this functionality)
@reta: i put it in |
@reta -- do you mean a separate plug-in in a separate repo? Or just a separate plug-in in the core repo? In my opinion, the overhead of maintaining a separate plugin build/distribution is not worth it for a single analyzer class of ~100 lines. |
inspiration taken from [this SO answer][SO]. note that the stream is not parallelised to avoid the overhead of this as the method is intended to be called primarily with shorter strings where the time to set up would take longer than the actual check. [SO]: https://stackoverflow.com/a/35150400 Signed-off-by: Ralph Ursprung <[email protected]>
this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. closes opensearch-project#11326 [elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone [libphonenumber]: https://github.com/google/libphonenumber [falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <[email protected]>
Thanks @msfroh , yes, I mean a separate plug-in in the core repo (under |
f7669e2
to
2b8f1d1
Compare
My apologies @rursprung , I missed that but would have commented there. If folks think |
Between analysis-common and a separate plugin in this repo, I don't have strong opinions. I definitely agree that it's not likely to be widely used, so a separate plugin in this repo gives people the choice to install it if they want it. |
wouldn't the separate plugin in this repo cause quite a bit of overhead for the release process? (i'm not familiar with that part but would presume that it'd have to be patched in at several places to make sure that it gets published?) having it in |
No, we build and release the core as a whole
It would be the same plugin as any other bundled with core and installable with |
❕ Gradle check result for 2b8f1d1: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
you say "bundled" - does that mean that it'll be part of the standard (or even |
it would be a plugin which is not installed by default, has to be installed and uninstalled explicitly when there is a need to use it. |
oh 🙁 so, is it the final consensus of the reviewers that i should move it to a new |
+1 from me, thanks @rursprung |
Description
this is largely based on elasticsearch-phone and internally uses
libphonenumber.
this intentionally only ports a subset of the features: only
phone
andphone-search
are supported right now,phone-email
can be addedif/when there's a clear need for it.
using
libphonenumber
is required since parsing phone numbers is anon-trivial task (even though it might seem trivial at first glance!),
as can be seen in the list falsehoods programmers believe about phone
numbers.
this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (
ZZ
) had been usedwhich worked as long as international numbers were prefixed with
+
butdid not work when using local numbers (e.g. a number stored as
+4158...
was not matched against a number entered as004158...
or058...
).example configuration for an index:
this creates four analyzers:
phone
andphone-search
which do notexplicitly specify a region and thus fall back to
ZZ
(unknown region,regional version of international dialing prefix (e.g.
00
instead of+
in most of europe) will not be recognised) andphone-ch
andphone-search-ch
which will try to parse the phone number as a swissphone number (thus e.g.
00
as a prefix is recognised as theinternational dialing prefix).
note that the analyzer is (currently) not meant to find phone numbers in
large text documents - instead it should be used on fields which contain
just the phone number (though extra text will be ignored) and it
collects the whole content of the field into a
String
in memory,making it unsuitable for large field values.
closes #11326
Signed-off-by: Ralph Ursprung [email protected]
Related Issues
Resolves #11326
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.