Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

834 support string value type in frame #860

Open
wants to merge 51 commits into
base: main
Choose a base branch
from

Conversation

saminbassiri
Copy link
Contributor

This PR closes issue #834 by adding support for reading CSV files containing String values into Frames.

Changes:

  • General Support for String Values in Frames at Kernel Level:

    • String values can now be used as column value types within Frames, enhancing type support.
  • Support for Reading String Values in Frames at Kernel Level:

    • The system now supports reading String values into Frames at the kernel level, ensuring better handling of mixed data types.
  • Support for Reading String Values from CSV Files into Frames in DSL:

    • When reading a CSV file with String values, the value type for string columns should be set to str in the .meta file to ensure proper type recognition.

Testing:

  • New test cases have been added to validate the reading of CSV files containing String values into Frames and constructing Frames with string columns, ensuring correct behavior and performance.

saminbassiri and others added 30 commits September 8, 2024 19:09
- Renamed test CSV and meta data files.
- Removed unnecessary casts of ValueTypeUtils::default_value to the type it already has.
- Renamed ValueTypeUtils::default_value to defaultValue.
- oneHot-kernel
  - Removed superfluous additional convenience functions.
  - Release recoded intermediate.
  - Made index calculation view-aware.
  - Reduced the code duplication of the specializations for std::string and FixedStr16 by factoring out the code into a separate function template.
- Tidied up BinaryOpCode.h.
- Little formatting corrections.
- And some more minor things.
- The FixedStr16 buffer no longer requires a null-terminator.
- This change optimizes memory usage for FixedStr16 value type.
…enseMatrix_of_Strings"

This reverts commit b700c82, reversing
changes made to c11e6fe.
saminbassiri and others added 21 commits October 3, 2024 21:48
-Add EW_UNARY operation upper and lower on c_strings
-Add support for EW_UNARY operation upper and lower on dense matrix of
strings in DSL.
-Add support for EW_UNARY operation upper and lower on strings in DSL.
- Add String values in `mlirTypeForCode` in `TypeOpInterface.cpp` for proper type mapping
- Fix memory deallocation for String columns in the Frame constructor using a custom deleter
- Add String `ValueTypeCode` and `cppNameFor` for recognizing and naming the String type
- Add a test in `api/cli/io` with String values for reading a CSV file into a frame
@auge
Copy link
Contributor

auge commented Oct 15, 2024

Hi @saminbassiri ,

thanks for providing this PR, reading frames with strings now works 😃 !

However, I was following up on a small example use-case using SQL on dataframes, and this crashes. I'm not sure, but I guess it's still related to string processing.
Problems which are related to the SQL function can IMHO be addressed in a separate issue/PR.

data.csv and the corresponding .meta file can be taken from #834

df = readFrame("data.csv");

// works (even with slicing 😄 )
print(df[0:5,:]);

registerView("data", df);

// crashes
print(sql("SELECT data.temp FROM data WHERE data.temp>30.0; "));

// does not compile
#print(sql("SELECT data.city, data.temp FROM data WHERE data.temp>30; "));
#print(sql("SELECT data.city, MIN(data.temp), AVG(data.temp), MAX(data.temp) FROM data GROUP BY data.city ;"));

attached is the complete output of above script (I compiled your PR on my own using the daphne build container) - it starts with printing the first 5 sliced rows…:

Frame(5x2, [city:std::string, temp:double])
Algiers 3.4
St. John's 4.3
Dodoma 26.3
Toliara 17
Yellowknife 4
free(): invalid pointer
bin/daphne(+0x148b862)[0x561ff7b98862]
/lib/x86_64-linux-gnu/libc.so.6(+0x45320)[0x7f513a503320]
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x7f513a55cb1c]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x7f513a50326e]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x7f513a4e68ff]
/lib/x86_64-linux-gnu/libc.so.6(+0x297b6)[0x7f513a4e77b6]
/lib/x86_64-linux-gnu/libc.so.6(+0xa8fe5)[0x7f513a566fe5]
/lib/x86_64-linux-gnu/libc.so.6(+0xab37c)[0x7f513a56937c]
/lib/x86_64-linux-gnu/libc.so.6(__libc_free+0x7e)[0x7f513a56bd9e]
/home/ubuntu/daphne/bin/../lib/libAllKernels.so(+0x7892e5)[0x7f51355212e5]
/home/ubuntu/daphne/bin/../lib/libAllKernels.so(+0x92439b)[0x7f51356bc39b]
/home/ubuntu/daphne/bin/../lib/libAllKernels.so(_decRef__Structure+0x9e)[0x7f513568664e]
[0x7f513ac312ef]
[0x7f513ac313fd]
[0x7f513ac3141d]
bin/daphne(+0x1e05215)[0x561ff8512215]
bin/daphne(+0x14ad133)[0x561ff7bba133]
bin/daphne(+0x14b25ad)[0x561ff7bbf5ad]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x7f513a4e81ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7f513a4e828b]
bin/daphne(+0x148a325)[0x561ff7b97325]
[error]: Got an abort signal from the execution engine. Most likely an exception in a shared library. Check logs!
Execution error: Returning from signal 6

Also note that the last 2 lines do not compile, as daphne is not able to properly infer the string type:

[error]: Lowering pipeline error.{}
PassManager failed module lowering, responsible IR written to module_fail.log.
RewriteToCallKernelOpPass failed with the following message [ no kernel for operation `cast` available for the required input types `(!daphne.Frame<?x[?: !daphne.Unknown], ?>)` and output types `(!daphne.Matrix<?x?x!daphne.String>)` for backend `CPP`, registered kernels for this op:

when the dataframe had only numbers (instead of strings in the 1st column), the SQL function works.

KR, Benjamin

@pdamme pdamme self-requested a review October 17, 2024 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants