Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG-REPORT] vaex (arrow) not accepting '2021-01' as valid string for timestamp #1907

Closed
yohplala opened this issue Feb 11, 2022 · 5 comments

Comments

@yohplala
Copy link
Contributor

yohplala commented Feb 11, 2022

Description
Converting strings like '2021-01' (no day) to timestamps thanks to astype('datetime64'), is not accepted by vaex, timestamps are not recognized, but they are by numpy. Is this to be expected?

import vaex
import numpy as np

dates = ["2021-09"]

np_ar = np.array(dates).astype('datetime64') # ok
vdf = vaex.from_arrays(dates=dates)
vdf["dates"] = vdf["dates"].astype('datetime64') # nook

vdf
File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status                                   
File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status                  
pyarrow.lib.ArrowInvalid: Failed to parse string: '2021-09' as a  scalar of type timestamp[ns]

# Forcing to month resolution does not wok either
vdf["dates"] = vdf["dates"].astype('datetime64[M]') # nook

Software information

  • Vaex version (import vaex; vaex.__version__):
    {'vaex': '4.7.0',
    'vaex-core': '4.7.0.post1',
    'vaex-viz': '0.5.1',
    'vaex-hdf5': '0.11.1',
    'vaex-server': '0.8.0',
    'vaex-astro': '0.9.0',
    'vaex-jupyter': '0.7.0',
    'vaex-ml': '0.16.0',
    'vaex-graphql': '0.2.0'}

  • Vaex was installed via: pip / conda-forge / from source, from source actually

  • OS: Ubuntu 20.04

@JovanVeljanoski
Copy link
Member

JovanVeljanoski commented Feb 11, 2022

This may look like a bug, but perhaps it is unsupported feature (by arrow?).
Here is how it works:

  • strings are stored as arrow string arrays internally in vaex
  • so when you convert them to datetime, since we are starting from strings (arrow) we use arrow datetime support
    • if conversion is successful we cast those to numpy.datetime64/timedelta64 when using dt or td methods (during evaluation, lazily)
  • Arrow does not seem to support timestamp of the format you require, see example below
import vaex
import numpy as np
import pyarrow as pa

x = np.array(["2021-09-01"], dtype=np.datetime64)
pa.array(x)  # Works as expected i.e. this is Date32Array in arrow

x = np.array(["2021-09"], dtype=np.datetime64)
pa.array(x)  # Does not work - no corresponding pyarrow unit

So we are facing the same problem (from the other direction): when we have a numpy.datetime64[M] array, there is no corresponding arrow dtype to cast to, hence the error.

(from my understanding from looking at the source - i could be wrong)

@JovanVeljanoski
Copy link
Member

This was partially discussed here: #1704

@yohplala
Copy link
Contributor Author

(from my understanding from looking at the source - i could be wrong)

Thanks @JovanVeljanoski , yes this is my understanding.
Ok, i used the following workaround, applicable specifically to this problem.

import vaex
dates = ["2021-09"]
vdf = vaex.from_arrays(dates=dates)
vdf_01 = vaex.from_arrays(suff=vaex.vconstant("-01", length=len(vdf)))
vdf = vdf.join(vdf_01)
vdf["dates"] = vdf["dates"].str.cat(vdf["suff"]).astype("datetime64")

Thanks again, closing the ticket then,
Bests!

@JovanVeljanoski
Copy link
Member

If you start from an in-memory dataframe, it is better to handle this outside of vaex.

@maartenbreddels
Copy link
Member

I think the issue here is that we don't support casting arrow strings to datetime.

And we can do this:

x = np.array(["2021-09"], dtype=np.datetime64).astype('datetime64[ns]')
vaex.array_types.to_numpy(pa.array(x)).astype('datetime64[M]')

Again, with the proper unit tests I'm happy to support this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants