Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrap GMT's standard data type GMT_DATASET for table inputs #2729

Merged
merged 124 commits into from
Mar 7, 2024

Conversation

seisman
Copy link
Member

@seisman seisman commented Oct 9, 2023

Description of proposed changes

Similar to #2398, this PR contains some experimental codes to support GMT's standard data type GMT_DATASET for table inputs.

Here are some technical background. In GMT, table inputs are stored in a GMT_DATASET object. A GMT_DATASET object can contain several table files (stored in GMT_DATATABLE) and each table can contain several segments (stored in GMT_DATASEGMENT), and each data segment have M rows and N columns.

References: https://docs.generic-mapping-tools.org/latest/devdocs/api.html#struct-dataset

The ultimate goals are:


Currently, I'm doing too much work in this PR, for example:

  1. Wrapping the GMT C API GMT_Read_Virtualfile Merged to main in clib: Wrap the GMT API function GMT_Read_VirtualFile #2993
  2. Provide a new session method virtualfile_to_data to allow GMT modules write output into a virtual file. Merged to main in clib: Add the virtualfile_out method for creating output virtualfile #3057
  3. Wrapping the GMT_DATASET data type and related GMT_DATATABLE and GMT_DATASEGMENT
  4. Refactor many modules (e.g., grd2xyz, grdtrack) to show how this PR will change the way we handle table-like outputs.

Thus, we need to split this PR into a few smaller PRs if this PR is proven to be working.

@seisman seisman marked this pull request as draft October 9, 2023 11:06
@seisman seisman added the enhancement Improving an existing feature label Oct 9, 2023
pygmt/src/grd2xyz.py Outdated Show resolved Hide resolved
@seisman seisman added the needs review This PR has higher priority and needs review. label Oct 9, 2023
pygmt/clib/session.py Outdated Show resolved Hide resolved
pygmt/src/grdtrack.py Outdated Show resolved Hide resolved
@seisman
Copy link
Member Author

seisman commented Feb 28, 2024

In commit a21b2df, I've reverted all changes that I made to the module wrappers so that we can focus on wrapping GMT_DATASET in this PR.

The reverted changes will be included in PR #3083.

@seisman seisman added the needs review This PR has higher priority and needs review. label Feb 28, 2024
@seisman
Copy link
Member Author

seisman commented Feb 28, 2024

CodSpeed Performance Report

Merging #2729 will degrade performances by 90.44%

Comparing datatypes/gmtdataset (76a09f0) with main (ab39616)

Summary

⚡ 1 improvements ❌ 12 regressions ✅ 85 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main datatypes/gmtdataset Change
test_blockmean_input_xyz 412.7 ms 366.8 ms +12.54%
test_contour_matrix[DataFrame] 90.8 ms 281.6 ms -67.76%
test_contour_matrix[Dataset] 103.9 ms 294.7 ms -64.74%
test_contour_matrix[array] 88.6 ms 279.3 ms -68.3%
test_filter1d_format 199.6 ms 518.8 ms -61.52%
test_grd2xyz 43.1 ms 229.9 ms -81.28%
test_grdtrack_input_dataframe_and_dataarray 43.9 ms 233.8 ms -81.22%
test_grdvolume_no_outgrid 41.7 ms 294.9 ms -85.86%
test_project_input_matrix[DataFrame] 41.6 ms 422.7 ms -90.16%
test_project_input_matrix[Dataset] 50.7 ms 431.8 ms -88.25%
test_project_input_matrix[array] 40.3 ms 421.3 ms -90.44%
test_select_input_dataframe 184.5 ms 371.8 ms -50.37%
test_delaunay_triples_input_xyz 169.3 ms 357.8 ms -52.67%

This benchmark report looks scary, but it doesn't match my test in #2730 (comment). I guess it's because codspeed doesn't count the I/O time (via pd.read_csv).

@seisman seisman requested a review from a team March 6, 2024 06:20
@seisman
Copy link
Member Author

seisman commented Mar 6, 2024

Ping @GenericMappingTools/pygmt-maintainers for review on this PR. There are so many pending changes/PRs that rely on the changes in this PR.

pygmt/datatypes/dataset.py Outdated Show resolved Hide resolved
pygmt/datatypes/dataset.py Outdated Show resolved Hide resolved
pygmt/datatypes/dataset.py Outdated Show resolved Hide resolved
pygmt/datatypes/dataset.py Outdated Show resolved Hide resolved
pygmt/datatypes/dataset.py Show resolved Hide resolved
Copy link
Member

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super big thank you for pushing this through @seisman, especially in cherry-picking many of the changes into separate PRs to make this easier to review! I'm still a little unsure about the triple-nested classes as mentioned at #2729 (comment), but we can always un-nest them later if needed (since they're marked as private classes).

Won't hold this PR up any longer, excited to have PyGMT not store intermediate CSV files in the next release 🚀

@seisman seisman added final review call This PR requires final review and approval from a second reviewer and removed needs review This PR has higher priority and needs review. labels Mar 7, 2024
@seisman seisman removed the final review call This PR requires final review and approval from a second reviewer label Mar 7, 2024
@seisman seisman merged commit a5d8b14 into main Mar 7, 2024
13 of 16 checks passed
@seisman seisman deleted the datatypes/gmtdataset branch March 7, 2024 02:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improving an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants