Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement concat_json to join JSON strings given by strings column #2457

Open
wants to merge 23 commits into
base: branch-24.10
Choose a base branch
from

Conversation

ttnghia
Copy link
Collaborator

@ttnghia ttnghia commented Sep 30, 2024

This adds concat_json to join multiple JSON strings given by strings column into one unified JSON string. Along with the concatenated data buffer, the result also contains a vector that indicates whether each row in the input is null or empty, and the delimiter character used for concatenation.

karthikeyann and others added 21 commits August 29, 2024 23:35
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
# Conflicts:
#	thirdparty/cudf
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
@ttnghia ttnghia self-assigned this Sep 30, 2024
@ttnghia
Copy link
Collaborator Author

ttnghia commented Sep 30, 2024

build

@ttnghia
Copy link
Collaborator Author

ttnghia commented Oct 1, 2024

build

@ttnghia
Copy link
Collaborator Author

ttnghia commented Oct 1, 2024

build

@ttnghia ttnghia mentioned this pull request Oct 1, 2024
1 task

namespace detail {

constexpr bool not_whitespace(cudf::char_utf8 ch) { return ch > ' '; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

white space for JSON is exactly ' ' || '\r' || '\n" || '\t'

See https://www.json.org/json-en.html


while (itr < end) {
cudf::char_utf8 ch = 0;
auto const chr_width = cudf::strings::detail::to_char_utf8(itr, ch);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to parse it as UTF-8. We don't do it anywhere else in the JSON parsing code.

auto constexpr upper_level = std::numeric_limits<char>::max();
auto const num_chars = input.chars_size(stream);

rmm::device_uvector<uint32_t> d_histogram(num_levels, stream);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested the histogram code with a \n in it? When I did it the delimiter it returned was still \n and I don't know why. I think it would be better to write our own that is just a reduction into a 256-bit value that is a bitmask for which bytes showed up in the data. No need for a count because all we want is a boolean is it there or not value.

return {first_non_existing_char, true};
};

// Firstly, search from the `\n` character to the end of the histogram,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if the issues with not using \n as the delimiter are solved I would rather have us just start at 0 instead of trying really hard to use \n as the delimiter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants