Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement phone number analyzer #15915

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rursprung
Copy link
Contributor

@rursprung rursprung commented Sep 12, 2024

Description

this is largely based on elasticsearch-phone and internally uses
libphonenumber.
this intentionally only ports a subset of the features: only phone and
phone-search are supported right now, phone-email can be added
if/when there's a clear need for it.

using libphonenumber is required since parsing phone numbers is a
non-trivial task (even though it might seem trivial at first glance!),
as can be seen in the list falsehoods programmers believe about phone
numbers
.

this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (ZZ) had been used
which worked as long as international numbers were prefixed with + but
did not work when using local numbers (e.g. a number stored as
+4158... was not matched against a number entered as 004158... or
058...).

example configuration for an index:

{
  "index": {
    "analysis": {
      "analyzer": {
        "phone": {
          "type": "phone"
        },
        "phone-search": {
          "type": "phone-search"
        },
        "phone-ch": {
          "type": "phone",
          "phone-region": "CH"
        },
        "phone-search-ch": {
          "type": "phone-search",
          "phone-region": "CH"
        }
      }
    }
  }
}

this creates four analyzers: phone and phone-search which do not
explicitly specify a region and thus fall back to ZZ (unknown region,
regional version of international dialing prefix (e.g. 00 instead of
+ in most of europe) will not be recognised) and phone-ch and
phone-search-ch which will try to parse the phone number as a swiss
phone number (thus e.g. 00 as a prefix is recognised as the
international dialing prefix).

note that the analyzer is (currently) not meant to find phone numbers in
large text documents - instead it should be used on fields which contain
just the phone number (though extra text will be ignored) and it
collects the whole content of the field into a String in memory,
making it unsuitable for large field values.

closes #11326

Signed-off-by: Ralph Ursprung [email protected]

Related Issues

Resolves #11326

Check List

  • Functionality includes testing. TODO: do some (more) manual testing
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable. -- TODO if general PR is ok

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request Search:Relevance labels Sep 12, 2024
@rursprung rursprung force-pushed the implement-phone-number-analyzer branch from 74429fe to d844ea9 Compare September 12, 2024 16:56
Copy link
Contributor

❌ Gradle check result for 74429fe: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for d844ea9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@rursprung rursprung force-pushed the implement-phone-number-analyzer branch from d844ea9 to 24e60a5 Compare September 13, 2024 13:06
Copy link
Contributor

❌ Gradle check result for 24e60a5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@rursprung rursprung force-pushed the implement-phone-number-analyzer branch from 24e60a5 to f7669e2 Compare September 13, 2024 13:30
Copy link
Contributor

❕ Gradle check result for f7669e2: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.gateway.RecoveryFromGatewayIT.testShardStoreFetchMultiNodeMultiIndexesUsingBatchAction

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Sep 13, 2024

Codecov Report

Attention: Patch coverage is 97.18310% with 2 lines in your changes missing coverage. Please review.

Project coverage is 71.88%. Comparing base (330b249) to head (2b8f1d1).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
...arch/analysis/common/PhoneNumberTermTokenizer.java 97.87% 0 Missing and 1 partial ⚠️
...alysis/common/PhoneNumberTermTokenizerFactory.java 80.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #15915      +/-   ##
============================================
+ Coverage     71.87%   71.88%   +0.01%     
- Complexity    64285    64329      +44     
============================================
  Files          5278     5282       +4     
  Lines        300833   300904      +71     
  Branches      43473    43482       +9     
============================================
+ Hits         216224   216312      +88     
+ Misses        66812    66802      -10     
+ Partials      17797    17790       -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rursprung
Copy link
Contributor Author

rursprung commented Sep 13, 2024

testShardStoreFetchMultiNodeMultiIndexesUsingBatchAction

❕ Gradle check result for f7669e2: UNSTABLE

* **TEST FAILURES:**
      1 org.opensearch.gateway.RecoveryFromGatewayIT.testShardStoreFetchMultiNodeMultiIndexesUsingBatchAction

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

this is a flaky test: #14304

and the failure of the "mend security check" also seems to be random (but i don't have the rights to re-trigger it)

@rursprung
Copy link
Contributor Author

this should now be ready for review 🚀
CC a couple of people who were involved in the discussions (in no particular order):
@reta (for the things we discussed on slack about Strings & Streams)
@msfroh (i now ended up not using RemoveDuplicatesTokenFilter as using a Set internally remains much simpler and means that the tokenizer already returns unique tokens plus i ran into the error first position increment must be > 0 (got 0) for field 'phone' while trying to get it to work)
@dblock
@timsmithgenesys (as discussed in #11326 this is based on the old ES plugin from your company - thanks again for licensing it as Apache 2.0! ❤️)
@macohen (originally pointed out the old ES plugin)

@dblock
Copy link
Member

dblock commented Sep 13, 2024

Thanks! Probably @msfroh is our best reviewer for this.

@rursprung
Copy link
Contributor Author

could someone please add the backport 2.x label and re-run the changelog verifier? thanks!

@reta reta added the backport 2.x Backport to 2.x branch label Sep 13, 2024
Copy link
Collaborator

@reta reta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rursprung , I feel strongly that we should have it under separate plugin (analysis-phonenumber fe) and keep the analysis-common clean (not everyone needs this functionality)

@rursprung
Copy link
Contributor Author

Thanks @rursprung , I feel strongly that we should have it under separate plugin (analysis-phonenumber fe) and keep the analysis-common clean (not everyone needs this functionality)

@reta: i put it in analysis-common based on @msfroh's comment here: #11326 (comment) (seconded also by @dblock). i'm happy to move things around if this is the general consensus but would like to avoid doing that a few more times :)

@msfroh
Copy link
Collaborator

msfroh commented Sep 13, 2024

@reta -- do you mean a separate plug-in in a separate repo? Or just a separate plug-in in the core repo?

In my opinion, the overhead of maintaining a separate plugin build/distribution is not worth it for a single analyzer class of ~100 lines.

inspiration taken from [this SO answer][SO].

note that the stream is not parallelised to avoid the overhead of this
as the method is intended to be called primarily with shorter strings
where the time to set up would take longer than the actual check.

[SO]: https://stackoverflow.com/a/35150400

Signed-off-by: Ralph Ursprung <[email protected]>
this is largely based on [elasticsearch-phone] and internally uses
[libphonenumber].
this intentionally only ports a subset of the features: only `phone` and
`phone-search` are supported right now, `phone-email` can be added
if/when there's a clear need for it.

using `libphonenumber` is required since parsing phone numbers is a
non-trivial task (even though it might seem trivial at first glance!),
as can be seen in the list [falsehoods programmers believe about phone
numbers][falsehoods].

this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (`ZZ`) had been used
which worked as long as international numbers were prefixed with `+` but
did not work when using local numbers (e.g. a number stored as
`+4158...` was not matched against a number entered as `004158...` or
`058...`).

example configuration for an index:
```json
{
  "index": {
    "analysis": {
      "analyzer": {
        "phone": {
          "type": "phone"
        },
        "phone-search": {
          "type": "phone-search"
        },
        "phone-ch": {
          "type": "phone",
          "phone-region": "CH"
        },
        "phone-search-ch": {
          "type": "phone-search",
          "phone-region": "CH"
        }
      }
    }
  }
}
```
this creates four analyzers: `phone` and `phone-search` which do not
explicitly specify a region and thus fall back to `ZZ` (unknown region,
regional version of international dialing prefix (e.g. `00` instead of
`+` in most of europe) will not be recognised) and `phone-ch` and
`phone-search-ch` which will try to parse the phone number as a swiss
phone number (thus e.g. `00` as a prefix is recognised as the
international dialing prefix).

note that the analyzer is (currently) not meant to find phone numbers in
large text documents - instead it should be used on fields which contain
just the phone number (though extra text will be ignored) and it
collects the whole content of the field into a `String` in memory,
making it unsuitable for large field values.

closes opensearch-project#11326

[elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone
[libphonenumber]: https://github.com/google/libphonenumber
[falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md

Signed-off-by: Ralph Ursprung <[email protected]>
@reta
Copy link
Collaborator

reta commented Sep 13, 2024

separate plug-in in the core repo?

Thanks @msfroh , yes, I mean a separate plug-in in the core repo (under plugins), it is indeed small enough.

@rursprung rursprung force-pushed the implement-phone-number-analyzer branch from f7669e2 to 2b8f1d1 Compare September 13, 2024 15:53
@reta
Copy link
Collaborator

reta commented Sep 13, 2024

@reta: i put it in analysis-common based on @msfroh's comment here: #11326 (comment) (seconded also by @dblock). i'm happy to move things around if this is the general consensus but would like to avoid doing that a few more times :)

My apologies @rursprung , I missed that but would have commented there. If folks think analysis-common is the place - no objection (but in my option, it does not belong here, the phone numbers are relevant to small percentage of the applications and is not common case).

@msfroh
Copy link
Collaborator

msfroh commented Sep 13, 2024

Between analysis-common and a separate plugin in this repo, I don't have strong opinions. I definitely agree that it's not likely to be widely used, so a separate plugin in this repo gives people the choice to install it if they want it.

@rursprung
Copy link
Contributor Author

wouldn't the separate plugin in this repo cause quite a bit of overhead for the release process? (i'm not familiar with that part but would presume that it'd have to be patched in at several places to make sure that it gets published?)

having it in analysis-common would have one major benefit for me: it'd become available on all public cloud offerings (once they upgrade to that release) - and i need it on at least some of them 🙂.
also, i don't think that it's a less common use-case than many other analyzers currently in the analysis-common package and my gut feeling (without having done a market survey) is that it'll be used once it's available (and communicated) since there are many use-cases where you're ingesting phone numbers, e.g. as part of contact data (true, you might not need this for a web shop).

@reta
Copy link
Collaborator

reta commented Sep 13, 2024

wouldn't the separate plugin in this repo cause quite a bit of overhead for the release process? (i'm not familiar with that part but would presume that it'd have to be patched in at several places to make sure that it gets published?)

No, we build and release the core as a whole

having it in analysis-common would have one major benefit for me: it'd become available on all public cloud offerings (once they upgrade to that release) - and i need it on at least some of them 🙂.

It would be the same plugin as any other bundled with core and installable with opensearch-plugin tool. analysis-common is a module which means you cannot uninstall it, want it or not. For example, assume there is CVE published against libphonenumber. The mitigation with plugin is easy - uninstall it till fix available, with module it is impossible.

Copy link
Contributor

❕ Gradle check result for 2b8f1d1: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@rursprung
Copy link
Contributor Author

It would be the same plugin as any other bundled with core and installable with opensearch-plugin tool. analysis-common is a module which means you cannot uninstall it, want it or not. For example, assume there is CVE published against libphonenumber. The mitigation with plugin is easy - uninstall it till fix available, with module it is impossible.

you say "bundled" - does that mean that it'll be part of the standard (or even min?) opensearch distribution by default if it's a plugin in this repo and it can just optionally be uninstalled rather than being something which has to be installed explicitly?

@reta
Copy link
Collaborator

reta commented Sep 13, 2024

it's a plugin in this repo and it can just optionally be uninstalled rather than being something which has to be installed explicitly?

it would be a plugin which is not installed by default, has to be installed and uninstalled explicitly when there is a need to use it.

@rursprung
Copy link
Contributor Author

it would be a plugin which is not installed by default, has to be installed and uninstalled explicitly when there is a need to use it.

oh 🙁
well, i guess it's the same as an external plugin with the advantage that it's in this repo, making maintenance easier (releasing & publishing happens out of the box).

so, is it the final consensus of the reviewers that i should move it to a new analysis-phonenumber plugin within this repo? if so, i'll take care of it next week

@reta
Copy link
Collaborator

reta commented Sep 20, 2024

so, is it the final consensus of the reviewers that i should move it to a new analysis-phonenumber plugin within this repo? if so, i'll take care of it next week

+1 from me, thanks @rursprung

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch enhancement Enhancement or improvement to existing feature or request Search:Relevance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] new plugin with normalizer & analyzer for phone numbers
4 participants