-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
**Breaking**: data_kind: Now 'matrix' represents a 2-D numpy array and unrecognized data types fall back to 'vectors' #3351
Conversation
ea90891
to
c0a5bdb
Compare
c0a5bdb
to
78066a4
Compare
f160dd6
to
1c517e6
Compare
1c517e6
to
21f581a
Compare
52b711b
to
713d0a2
Compare
pygmt/helpers/utils.py
Outdated
return "geojson" | ||
|
||
# A 2-D numpy.ndarray | ||
if isinstance(data, np.ndarray) and data.ndim == 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably can change this line to:
if isinstance(data, np.ndarray) and data.ndim == 2: | |
if hasattr(data, "__array_interface__") and len(data.__array_interface__.shape) == 2: |
to support more array-like objects that implements the __array_interface__
protocol (https://numpy.org/doc/stable/reference/arrays.interface.html#object.__array_interface__), but it can be done in a future PR.
@@ -195,7 +195,7 @@ def x2sys_cross( | |||
match data_kind(track): | |||
case "file": | |||
file_contexts.append(contextlib.nullcontext(track)) | |||
case "matrix": | |||
case "vectors": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas.DataFrame now is "vectors" kind.
pygmt/helpers/utils.py
Outdated
case np.ndarray() if data.ndim == 2: # A 2-D numpy.ndarray object. | ||
kind = "matrix" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things:
- Should we extend
kind="matrix"
to 3-D numpy.ndarray objects? - I'm hesitant to match based on
np.ndarray()
class only, because while it's not widely advertised, checking fordata is not None
would have caught other array types implementing the__array_function__
protocol (see NEP18), but checking forisinstance(data, np.ndarray)
would exclude those. That said, I'm wondering if the fallback tokind="vectors"
might just work, but need to test this out more extensively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Should we extend
kind="matrix"
to 3-D numpy.ndarray objects?
It's unclear if we can pass a 3-D numpy array yet. Even if we can, it means more work, since in virtualfile_from_vectors
/virtualfile_from_matrix
, we explicitly check if the ndim=1 or 2. So, better to revisit the 3-D numpy array support in a separate PR if necessary.
- I'm hesitant to match based on
np.ndarray()
class only, because while it's not widely advertised, checking fordata is not None
would have caught other array types implementing the__array_function__
protocol (see NEP18), but checking forisinstance(data, np.ndarray)
would exclude those. That said, I'm wondering if the fallback tokind="vectors"
might just work, but need to test this out more extensively.
Yes, I was thinking about checking __array_interface__
in #3351 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 4a4f192, although other array-like objects are not tested.
if data.dtype.kind not in "iuf": | ||
_virtualfile_from = self.virtualfile_from_vectors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test coverage for these lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add a test for it. Before adding more tests, I'm wondering if we should split the big "test_clib_virtualfiles.py" file (with more than 500 lines) into separate smaller test files, i.e., one test file for each Session method.
test_clib_virtualfile_in
test_clib_virtualfile_from_vectors
test_clib_virtualfile_from_matrix
test_clib_open_virtualfile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've split it before in #2784, so yes, ok to split it again 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a test in cfa32ed to cover this line.
Co-authored-by: Wei Ji <[email protected]>
@@ -101,3 +103,27 @@ def test_virtualfile_in_fail_non_valid_data(data): | |||
z=data[:, 2], | |||
data=data, | |||
) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is added to address #3351 (comment).
This new test passes a string dtype numpy array into GMT C API, which contains longitude/latitude strings. The data kind is "matrix"
, but since the data dtype is not in iuf
, PyGMT will use virtualfile_from_vectors
rather than virtualfile_from_matrix
to pass the data into GMT C API. Ideally, we should check that the virtualfile_from_vectors
is called once and virtualfile_from_matrix
is not called, but I find it technically complicated with unittest.mock
, so I'll leave it untested.
The function is marked as xfail with GMT 6.5 due to a newly found upstream bug at GenericMappingTools/gmt#8600.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, and thanks for adding the test!
6b7a578
to
423e5dc
Compare
Need to wait for #3481.Description of proposed changes
As can be seen in the doctests added in #3480, currently,
matrix
andvectors
kinds are defined like below:vectors
kind meansdata=None
andrequired=True
. It means that the input is given via a series of vectors (e.g., x/y/z).matrix
represents any data types not recognized as arg/file/stringio/grid/image/geojson. The most common data types includepd.DataFrame
,pd.Series
,xr.Dataset
,np.ndarray
, and sequence of sequences.So,
matrix
here is an inclusive concept. However, in both Python and GMT,matrix
usually means a homogenous 2-D array. When passing a "matrix" data to GMT, we usually treat the "matrix" as a series of vectors. The only exception is when the "matrix" data is a strict matrix (a 2-D numpy array) withdata.dtype.kind in "iuf"
.pygmt/pygmt/clib/session.py
Lines 1742 to 1755 in d7560fa
I think it makes more sense to use a more strict definition for
matrix
and let all other data types bevectors
. I propose to change the definitions of data kinds to:matrix
: a 2-D homogenous numpy.ndarrayvectors
: fallbacks to this kind for any unrecognized data types, including a dictionary,pd.DataFrame
,pd.Series
,xr.Dataset
, nested lists/tuples, or any other unrecognized data types.The refactoring is mainly done in 808755d.
Of course,
data=None, required=True
is still recognized as "vectors". I also think we should give this case another special kind like"none"
, which makes more sense and can also simplify the codes. [This is done in a separate PR at #3482, but I'm OK to merge #3482 into this PR first before merging this PR into main].Other data kinds are unchanged.
This is a breaking change since some previous "matrix" data are now recognized "vectors" and previously "vectors" is "none". But I guess very few users are using this function, so the breaking change should have minimal effects.