Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cudf::strings::findall_record API #9911

Merged
merged 17 commits into from
Jan 27, 2022

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Dec 15, 2021

Reference #9856 specifically #9856 (comment)

Adds cudf::strings::findall_record which was initially implemented in nvstrings but not ported over since LIST column types did not exist at the time and returning a vector of small columns was very inefficient. This API should also allow using the current python function cudf.str.findall() with the expand=False parameter more effectively. A follow-on PR will address these python changes.

This PR reorganizes the libcudf strings find source files into the cpp/src/strings/search subdirectory as well. Also, findall() has only a regex version so the _re suffix is dropped from the name in the libcudf implementation. The python changes in this PR address only the name change and the addition of the new API in the cython interface.

Depends on #9909 -- shares the cudf::strings::detail::count_matches() utility function.

@davidwendt davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) non-breaking Non-breaking change labels Dec 15, 2021
@davidwendt davidwendt self-assigned this Dec 15, 2021
@github-actions github-actions bot added CMake CMake build issue Python Affects Python cuDF API. labels Dec 15, 2021
@codecov
Copy link

codecov bot commented Dec 15, 2021

Codecov Report

Merging #9911 (1f02e97) into branch-22.04 (e24fa8f) will increase coverage by 0.05%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.04    #9911      +/-   ##
================================================
+ Coverage         10.37%   10.43%   +0.05%     
================================================
  Files               119      119              
  Lines             20149    20590     +441     
================================================
+ Hits               2091     2148      +57     
- Misses            18058    18442     +384     
Impacted Files Coverage Δ
python/cudf/cudf/errors.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/csv.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/hdf.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/abc.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/dlpack.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
... and 60 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85109e6...1f02e97. Read the comment docs.

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jan 6, 2022
@davidwendt davidwendt marked this pull request as ready for review January 6, 2022 18:09
@davidwendt davidwendt requested review from a team as code owners January 6, 2022 18:09
Copy link
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake changes LGTM

@vyasr
Copy link
Contributor

vyasr commented Jan 20, 2022

@davidwendt this appears to have fallen through the cracks on my part, apologies. Is this a must for 22.02?

@davidwendt
Copy link
Contributor Author

... this appears to have fallen through the cracks on my part, apologies. Is this a must for 22.02?

I don't think so. I'll move it.

@davidwendt davidwendt changed the base branch from branch-22.02 to branch-22.04 January 20, 2022 14:45
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to mark this as breaking since it's changing the name of a public API (findall_re -> findall).
Also, I tried to fix as many copyright years as I could but I probably missed a few (especially those that weren't part of the diff like the copyright line in contains.cpp).

cpp/include/cudf/strings/findall.hpp Show resolved Hide resolved
cpp/src/strings/count_matches.hpp Show resolved Hide resolved
cpp/src/strings/search/findall.cu Show resolved Hide resolved
cpp/src/strings/search/findall_record.cu Outdated Show resolved Hide resolved
cpp/src/strings/search/findall_record.cu Outdated Show resolved Hide resolved
cpp/src/strings/search/findall_record.cu Show resolved Hide resolved
cpp/tests/strings/findall_tests.cpp Show resolved Hide resolved
python/cudf/cudf/_lib/cpp/strings/findall.pxd Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/cpp/strings/findall.pxd Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/strings/findall.pyx Show resolved Hide resolved
@davidwendt davidwendt requested a review from vyasr January 26, 2022 20:12
@davidwendt davidwendt added breaking Breaking change and removed non-breaking Non-breaking change labels Jan 26, 2022
@davidwendt
Copy link
Contributor Author

@gpucibot merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants