Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cudf::strings::find_re API #16742

Merged
merged 26 commits into from
Oct 3, 2024

Conversation

davidwendt
Copy link
Contributor

Description

Adds the cudf::strings::find_re and str.find_re API to libcudf/pylibcudf/cudf. This function returns the character position where the pattern first matches in each row of the input column. If a match is not found, -1 is returned for that corresponding row.

Closes #16729

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) non-breaking Non-breaking change labels Sep 4, 2024
@davidwendt davidwendt self-assigned this Sep 4, 2024
@github-actions github-actions bot added the Python Affects Python cuDF API. label Sep 4, 2024
@davidwendt davidwendt changed the base branch from branch-24.10 to branch-24.12 September 23, 2024 23:32
@github-actions github-actions bot added the pylibcudf Issues specific to the pylibcudf package label Sep 23, 2024
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Sep 26, 2024
@davidwendt davidwendt marked this pull request as ready for review September 26, 2024 18:15
@davidwendt davidwendt requested review from a team as code owners September 26, 2024 18:15
Copy link
Contributor

@Matt711 Matt711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a pylibcudf test for find_re?

python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about stream test for the new API?

Copy link
Contributor

@Matt711 Matt711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding a test. I'll wait to approve until we decide whether to seperate find_re and find_all or combine them under a common file name as per this comment.

python/cudf/cudf/core/column/string.py Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved
python/pylibcudf/pylibcudf/strings/findall.pyx Outdated Show resolved Hide resolved
python/pylibcudf/pylibcudf/strings/findall.pyx Outdated Show resolved Hide resolved
Copy link
Contributor

@Matt711 Matt711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving python changes

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question about the pylibcudf API / code location. Otherwise LGTM.

cpp/src/strings/search/findall.cu Show resolved Hide resolved
@@ -38,3 +38,35 @@ cpdef Column findall(Column input, RegexProgram pattern):
)

return Column.from_libcudf(move(c_result))


cpdef Column find_re(Column input, RegexProgram pattern):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go in a new file, or should this file be renamed? It seems incorrect to have find_re defined in the findall file, since they're different algorithms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was told to keep match these with the header files. So find_re and findall are both declared in findall.hpp. They are common only in that both use regex I suppose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize find_re was in findall.hpp. Then yes, this is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(But it does raise the question of why find_re is in findall.hpp -- it doesn't "find all")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC they will be combined into a common file as per https://github.com/rapidsai/cudf/pull/16742/files#r1781466464

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The find_re has more in common with findall than just find based on the parameters and how they are used just in not what they return.

Perhaps find_re() would be better in contains.hpp which has contains_re()
Since find.hpp has contains() and find() (among other similar functions)
So contains.hpp would have contains_re() and find_re().
I could do that in a separate PR since this one is already approved.

@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 3faa3ee into rapidsai:branch-24.12 Oct 3, 2024
102 checks passed
@davidwendt davidwendt deleted the fea-strings-find-re branch October 3, 2024 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[FEA] Method to return starting index of regex match
4 participants