Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement phone number analyzer #15915

Merged

Commits on Oct 3, 2024

  1. add Strings#isDigits API

    inspiration taken from [this SO answer][SO].
    
    note that the stream is not parallelised to avoid the overhead of this
    as the method is intended to be called primarily with shorter strings
    where the time to set up would take longer than the actual check.
    
    [SO]: https://stackoverflow.com/a/35150400
    
    Signed-off-by: Ralph Ursprung <[email protected]>
    rursprung committed Oct 3, 2024
    Configuration menu
    Copy the full SHA
    c09d348 View commit details
    Browse the repository at this point in the history
  2. add phone & phone-search analyzer + tokenizer

    this is largely based on [elasticsearch-phone] and internally uses
    [libphonenumber].
    this intentionally only ports a subset of the features: only `phone` and
    `phone-search` are supported right now, `phone-email` can be added
    if/when there's a clear need for it.
    
    using `libphonenumber` is required since parsing phone numbers is a
    non-trivial task (even though it might seem trivial at first glance!),
    as can be seen in the list [falsehoods programmers believe about phone
    numbers][falsehoods].
    
    this allows defining the region to be used when analysing a phone
    number. so far only the generic "unkown" region (`ZZ`) had been used
    which worked as long as international numbers were prefixed with `+` but
    did not work when using local numbers (e.g. a number stored as
    `+4158...` was not matched against a number entered as `004158...` or
    `058...`).
    
    example configuration for an index:
    ```json
    {
      "index": {
        "analysis": {
          "analyzer": {
            "phone": {
              "type": "phone"
            },
            "phone-search": {
              "type": "phone-search"
            },
            "phone-ch": {
              "type": "phone",
              "phone-region": "CH"
            },
            "phone-search-ch": {
              "type": "phone-search",
              "phone-region": "CH"
            }
          }
        }
      }
    }
    ```
    this creates four analyzers: `phone` and `phone-search` which do not
    explicitly specify a region and thus fall back to `ZZ` (unknown region,
    regional version of international dialing prefix (e.g. `00` instead of
    `+` in most of europe) will not be recognised) and `phone-ch` and
    `phone-search-ch` which will try to parse the phone number as a swiss
    phone number (thus e.g. `00` as a prefix is recognised as the
    international dialing prefix).
    
    note that the analyzer is (currently) not meant to find phone numbers in
    large text documents - instead it should be used on fields which contain
    just the phone number (though extra text will be ignored) and it
    collects the whole content of the field into a `String` in memory,
    making it unsuitable for large field values.
    
    this has been implemented in a new plugin which is however part of the
    central opensearch repository as it was deemed too big an overhead to
    have it in a separate repository but not important enough to bundle it
    directly in `analysis-common` (see the discussion on the issue and the
    PR for further details).
    
    note that the new plugin has been added to the exclude list of the
    javadoc check as this check is overzealous and also complains in many
    cases where it shouldn't (e.g. on overridden methods - which it should
    theoretically not do - or constructors which don't even exist). the
    check first needs to be improved before this exclusion could be removed.
    
    closes opensearch-project#11326
    
    [elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone
    [libphonenumber]: https://github.com/google/libphonenumber
    [falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md
    
    Signed-off-by: Ralph Ursprung <[email protected]>
    rursprung committed Oct 3, 2024
    Configuration menu
    Copy the full SHA
    a3ac6dc View commit details
    Browse the repository at this point in the history