Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pygmt.x2sys_cross: Refactor to use virtualfiles for output tables [BREAKING CHANGE: Dummy times in 3rd and 4th columns now have np.timedelta64 type] #3182

Merged
merged 43 commits into from
Jun 9, 2024

Conversation

seisman
Copy link
Member

@seisman seisman commented Apr 19, 2024

Description of proposed changes

This PR refactors the pygmt.x2sys_cross function to use virtualfiles for output. Need to note that x2sys_cross still uses temporary files in the tempfile_from_dftrack function.

Partially address #3160.

This PR introduces a breaking change: Previously, the dummy times in 3-4 columns (with column names i_1/i_2) were in np.object type, and now they have np.timedelta64 type.

@seisman seisman added enhancement Improving an existing feature needs review This PR has higher priority and needs review. labels Apr 19, 2024
@seisman seisman added this to the 0.12.0 milestone Apr 19, 2024
@seisman seisman marked this pull request as ready for review April 19, 2024 15:51
@seisman seisman requested a review from weiji14 April 19, 2024 15:51
@seisman seisman force-pushed the refactor/x2sys_cross branch 3 times, most recently from bc341f6 to ff290da Compare April 20, 2024 02:35
@seisman seisman marked this pull request as draft April 20, 2024 03:03
@seisman seisman removed the needs review This PR has higher priority and needs review. label Apr 20, 2024
@seisman seisman marked this pull request as ready for review April 20, 2024 03:36
Comment on lines 232 to 235
# Convert 3rd and 4th columns to datetimes.
# These two columns have names "t_1"/"t_2" or "i_1"/"i_2".
# "t_1"/"t_2" means they are datetimes and should be converted.
# "i_1"/"i_2" means they are dummy times (i.e., floating-point values).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I understanding the output correctly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never used x2sys, but here is my understanding of the C codes and the output:

  1. The 3rd and 4th columns are datetimes. They can be either absolute datetimes (e.g., 2023-01-01T01:23:45.678 or dummy datetimes (i.e., double-precision numbers), depending on whether the input tracks contain datetimes.
  2. Internally, absolute datetimes are also represented as double-precision numbers in GMT. So absolute datetimes and dummy datetimes are the same internally.
  3. When outputting to a file, GMT will convert double-precision numbers into absolute datetimes, since GMT know if the column has dummy datetimes or not.
  4. A GMT_DATASET container can only contain double-precision numbers and text strings. So when outputting to a virtual file, the 3rd and 4th columns always have double-precision numbers. If the column names are t_1/t_2, then we know they're absolute datetimes and should be converted; otherwise, they are just dummy datetimes and should not be converted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little unsure if i_1/i_2 are actually dummy datetimes. This is a sample output from x2sys_cross:

# Tag: X2SYS4ivlhlo4
# Command: x2sys_cross @tut_ship.xyz -Qi -TX2SYS4ivlhlo4 ->/tmp/lala.txt
# x	y	i_1	i_2	dist_1	dist_2	head_1	head_2	vel_1	vel_2	z_X	z_M
> @tut_ship 0 @tut_ship 0 NaN/NaN/1357.17 NaN/NaN/1357.17
251.004840022	20.000079064	18053.5647431	13446.6562433	333.339586673	229.636557499	269.996783034	270.023614846	NaN	NaN	192.232797243	-2957.22757183
251.004840022	20.000079064	18053.5647431	71783.6562433	333.339586673	1148.20975878	269.996783034	270.023614846	NaN	NaN	192.232797243	-2957.22757183
250.534946327	20.0000526811	18053.3762934	66989.0210846	332.869692978	1022.68273972	269.996783034	269.360150109	NaN	NaN	-57.6485957585	-2686.4268008
250.532033147	20.0000525175	18053.3751251	66988.9936489	332.866779797	1022.67977813	269.996783034	22.0133296951	NaN	NaN	-64.5973890802	-2682.04812157
252.068705	20.000075	13447.5	71784.5	230.700422496	1149.27362378	269.995072235	269.995072235	NaN	NaN	0	-3206.5

It seems like the i_1/i_2 values vary between rows, but I can't quite remember what they represent... maybe an index of some sort? I might need to inspect the C code to see what's going on, can you point me to where these i_1/i_2 columns are being output?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dummy times are just double-precision indexes from 0 to n (xref: https://github.com/GenericMappingTools/gmt/blob/b56be20bee0b8de22a682fdcd458f9b9eeb76f64/src/x2sys/x2sys.c#L533).

The column name i_1 or t_1 is controlled by the variable t_or_i in the C code (https://github.com/GenericMappingTools/gmt/blob/b56be20bee0b8de22a682fdcd458f9b9eeb76f64/src/x2sys/x2sys_cross.c#L998). From https://github.com/GenericMappingTools/gmt/blob/b56be20bee0b8de22a682fdcd458f9b9eeb76f64/src/x2sys/x2sys_cross.c#L568, it's clear that, if got_time is True, then the column is absolute time (GMT_IS_ABSTIME), otherwise it's double-precision numbers (GMT_IS_FLOAT).

We can keep the dummy times as double-precision numbers or think them as seconds since unix epoch and then convert them to absolute times.

Copy link
Member

@weiji14 weiji14 Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep the dummy times as double-precision numbers or think them as seconds since unix epoch and then convert them to absolute times.

Maybe convert the relative time to pandas.Timedelta or numpy.timedelta64? Xref #2848.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Done in 9d12ae1.

Copy link
Member

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 main changes happening in this PR:

  1. Adding the output_type="numpy" option
  2. Handling the different dtypes of the i_1/i_2 or t_1/t_2 columns

We can keep this as a single PR since it's hard to separate the two things, but might need to discuss the implementation a bit more.

def x2sys_cross(tracks=None, outfile=None, **kwargs):
def x2sys_cross(
tracks=None,
output_type: Literal["pandas", "numpy", "file"] = "pandas",
Copy link
Member

@weiji14 weiji14 Apr 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm not sure if we should support numpy output type for x2sys_cross because all 'columns' will need to be the same dtype in a np.ndarray. If there are datetime values in the columns, they will get converted to floating point (?), which makes it more difficult to use later. Try adding a unit test for numpy output_type and see if it makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are datetime values in the columns, they will get converted to floating point (?)

You're right. Datetimes are converted to floating points by df.to_numpy(). Will remove the numpy output type.

Comment on lines 232 to 235
# Convert 3rd and 4th columns to datetimes.
# These two columns have names "t_1"/"t_2" or "i_1"/"i_2".
# "t_1"/"t_2" means they are datetimes and should be converted.
# "i_1"/"i_2" means they are dummy times (i.e., floating-point values).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little unsure if i_1/i_2 are actually dummy datetimes. This is a sample output from x2sys_cross:

# Tag: X2SYS4ivlhlo4
# Command: x2sys_cross @tut_ship.xyz -Qi -TX2SYS4ivlhlo4 ->/tmp/lala.txt
# x	y	i_1	i_2	dist_1	dist_2	head_1	head_2	vel_1	vel_2	z_X	z_M
> @tut_ship 0 @tut_ship 0 NaN/NaN/1357.17 NaN/NaN/1357.17
251.004840022	20.000079064	18053.5647431	13446.6562433	333.339586673	229.636557499	269.996783034	270.023614846	NaN	NaN	192.232797243	-2957.22757183
251.004840022	20.000079064	18053.5647431	71783.6562433	333.339586673	1148.20975878	269.996783034	270.023614846	NaN	NaN	192.232797243	-2957.22757183
250.534946327	20.0000526811	18053.3762934	66989.0210846	332.869692978	1022.68273972	269.996783034	269.360150109	NaN	NaN	-57.6485957585	-2686.4268008
250.532033147	20.0000525175	18053.3751251	66988.9936489	332.866779797	1022.67977813	269.996783034	22.0133296951	NaN	NaN	-64.5973890802	-2682.04812157
252.068705	20.000075	13447.5	71784.5	230.700422496	1149.27362378	269.995072235	269.995072235	NaN	NaN	0	-3206.5

It seems like the i_1/i_2 values vary between rows, but I can't quite remember what they represent... maybe an index of some sort? I might need to inspect the C code to see what's going on, can you point me to where these i_1/i_2 columns are being output?

@seisman seisman added the needs review This PR has higher priority and needs review. label Apr 22, 2024
@seisman seisman removed this from the 0.12.0 milestone Apr 29, 2024
@seisman seisman marked this pull request as ready for review May 28, 2024 05:38
@seisman seisman added this to the 0.13.0 milestone May 28, 2024
pygmt/src/x2sys_cross.py Outdated Show resolved Hide resolved
@seisman seisman force-pushed the refactor/x2sys_cross branch 6 times, most recently from 13d36e4 to 5c7214d Compare May 28, 2024 12:37
@seisman seisman added the needs review This PR has higher priority and needs review. label May 28, 2024
pygmt/src/x2sys_cross.py Outdated Show resolved Hide resolved
pygmt/src/x2sys_cross.py Outdated Show resolved Hide resolved
@seisman seisman changed the title pygmt.x2sys_cross: Refactor to use virtualfiles for output tables pygmt.x2sys_cross: Refactor to use virtualfiles for output tables [BREAKING CHANGE: Dummy times in 3rd and 4th columns now have np.timedelta64 type May 28, 2024
@seisman seisman changed the title pygmt.x2sys_cross: Refactor to use virtualfiles for output tables [BREAKING CHANGE: Dummy times in 3rd and 4th columns now have np.timedelta64 type pygmt.x2sys_cross: Refactor to use virtualfiles for output tables [BREAKING CHANGE: Dummy times in 3rd and 4th columns now have np.timedelta64 type] May 28, 2024
@seisman seisman requested a review from weiji14 June 3, 2024 14:39
@weiji14
Copy link
Member

weiji14 commented Jun 5, 2024

I'll give this a proper review over the weekend, a bit busy this week with some deadlines 🫠

Copy link
Member

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks also for handling the output differences between macOS and Linux (xref #3194). Pre-approving as the main logic around timedelta conversion checks out ok. Suggestions below are mostly documentation related or minor.

pygmt/src/x2sys_cross.py Outdated Show resolved Hide resolved
pygmt/tests/test_x2sys_cross.py Outdated Show resolved Hide resolved
pygmt/src/x2sys_cross.py Outdated Show resolved Hide resolved
pygmt/tests/test_x2sys_cross.py Outdated Show resolved Hide resolved
@seisman seisman merged commit 844594f into main Jun 9, 2024
18 of 20 checks passed
@seisman seisman deleted the refactor/x2sys_cross branch June 9, 2024 14:03
@seisman seisman removed the needs review This PR has higher priority and needs review. label Jun 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improving an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants