Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPV6 recognizer not working properly #907

Closed
morrissharp opened this issue Aug 16, 2022 · 2 comments · Fixed by #958
Closed

IPV6 recognizer not working properly #907

morrissharp opened this issue Aug 16, 2022 · 2 comments · Fixed by #958
Labels
analyzer bug Something isn't working good first issue Good for newcomers

Comments

@morrissharp
Copy link

I was trying to use presidio to identify and remove IP addresses, and I ran into the following issue. It was recognizing '::' as a string containing an IP address, and '2345:0425:2CA1:0000:0000:0567:5673:23b5' was not being recognized as an IP address. I ran a couple of tests as follows:

analyzer = AnalyzerEngine()

results = analyzer.analyze(text='::',
        entities=['IP_ADDRESS'],
        language='en')
print(results)

results2 = analyzer.analyze(text='2345:0425:2CA1:0000:0000:0567:5673:23b5',
        entities=['IP_ADDRESS'],
        language='en')
print(results2)


results3 = analyzer.analyze(text='2345:0425:2CA1::0567:5673:23b5',
        entities=['IP_ADDRESS'],
        language='en')
print(results3)

Output:

[type: IP_ADDRESS, start: 0, end: 2, score: 0.6]
[]
[type: IP_ADDRESS, start: 13, end: 30, score: 0.6]

This made it seem like it is just identifying an IPV6 address as any element that contains two consecutive colons. I then checked the source code, and found this in the tests:

# IPv6 tests TODO IPv6 regex needs to be fixed

Can the IPv6 regex be fixed?

@omri374
Copy link
Contributor

omri374 commented Aug 22, 2022

Thanks for raising this. We'd be happy to review a PR if you're interested in contributing.

@omri374 omri374 added bug Something isn't working good first issue Good for newcomers analyzer labels Aug 22, 2022
@SharonHart
Copy link
Contributor

Seems like it was broken in #312
Issued a PR to fix the regex, although still not optimal (see the tests scenario comments)
IMO should be transitioned to use the core module for a much simpler implemnetation:
https://docs.python.org/3/library/ipaddress.html?highlight=ipaddress#convenience-factory-functions

@omri374 Your thoughts?

@SharonHart SharonHart linked a pull request Nov 29, 2022 that will close this issue
5 tasks
@SharonHart SharonHart mentioned this issue Nov 29, 2022
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analyzer bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants