Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE-REQUEST] Datetime date object are output by vaex, but cannot be used as input? #1906

Closed
yohplala opened this issue Feb 10, 2022 · 4 comments
Labels

Comments

@yohplala
Copy link
Contributor

Description
In this 'exercise', I am filtering a list of data files based on the date available in the file names. The date used for the filtering is actually obtained as the max date from another file list.

I am taking this filtering date from vaex, and vaex outputs it as a datetime date object. Ok, no specific requirement from me.

I would like then to re-use it for the filtering, but this time, vaex does not accept it.
Should this be expected?

import numpy as np
import vaex as vx

# Step 1: Get the filtering date
# Input data.
# file list from which obtaining the date used later for filtering
vdf_max_date = vx.from_arrays(fnames =
               ['topic1-2021-03-01.zip','topic1-2021-03-02.zip', 'topic1-2021-03-03.zip'])

# Extraction of latest date.
max_date = vdf_max_date["fnames"].str.split('.').str.split('-',1).apply(
           lambda x: np.datetime64(x[0][1])).max().item()
max_date
Out[30]: datetime.date(2021, 3, 3)

So far, so good.

# Step 2: Filter 2nd dataset
# Input data.
# file list to filter with 'max_date'
vdf_to_filter = vx.from_arrays(fnames = ['topic1-2021-03-02.zip','topic1-2021-03-03.zip', 'topic1-2021-03-04.zip'])

# filtering
vdf_to_filter['dates'] = vdf_to_filter["fnames"].str.split('.').str.split('-',1).apply(
        lambda x: np.datetime64(x[0][1]))
trimmed_list = vdf_to_filter[vdf_to_filter['dates'] >= max_date]['fnames'].tolist()

So far... not so good now...

# Tail of error message.

  File "/home/yoh/anaconda3/lib/python3.9/ast.py", line 50, in parse
    return compile(source, filename, mode, flags,

  File "<unknown>", line 1
    (dates >= 2021-03-03)
                    ^
SyntaxError: leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers

So, is this to be expected?
I would expect item() to provide me data in a format I can re-use within vaex world.

Additional context
As a workaround, I can turn max_date into a numpy datetime64, which is what I will use for now.

# filtering
vdf_to_filter['dates'] = vdf_to_filter["fnames"].str.split('.').str.split('-',1).apply(
        lambda x: np.datetime64(x[0][1]))
trimmed_list = vdf_to_filter[vdf_to_filter['dates'] >= np.datetime64(max_date)]['fnames'].tolist()
trimmed_list
Out[33]: ['topic1-2021-03-03.zip', 'topic1-2021-03-04.zip']

So maybe item() should output data in numpy format?

Thanks in advance for any feedback!
Bests

@JovanVeljanoski
Copy link
Member

Hey!

So this is not a problem of vaex actually.. that is how numpy works.
So look at this example

import numpy as np

x = np.array(np.datetime64('2020-11-11'))  # all numpy right?

# but now..
x.item() # returns datetime.date()

Having said that, your way of getting the max i think is very inefficient. Keep in mind - avoid apply as much as possible. With apply you are not really using vaex, but some external code.

Consider this example:

import vaex
df = vaex.from_arrays(fnames=['topic1-2021-03-01.zip','topic1-2021-03-02.zip', 'topic1-2021-03-03.zip'])

date_array = df.fnames.str.lstrip('topic1-').str.rstrip('.zip').astype('datetime64')  # Converting to proper time format
max_date = np.datetime64(date_array.max().item(), 'ns')   # From there get the max, and get it into a numpy format

The above i believe should be faster. Does this help?

Notes:

  • For future issues please be more to the point, and try to isolate the issue to get it to us faster. If we have to read a very long text.. it is sometimes difficult to work through.
  • Although your way of describing issues is much preferred over too short issues with not enough details
  • The convention we prefer is import vaex. The name is not that long, and unlike other packages you don't really use it that often once you open / initialize your dataframes.

@yohplala
Copy link
Contributor Author

Dear Jovan,
Thanks very much,
Yes, your feedbacks helps a lot! Thanks again!
So I get my ticket is not a feature request.
Ok, closing then.

PS: yes, I will take into account your notes as well, but the 1 and 2 are difficult, It is often either one or the other, but the right balance in-between is difficult. I find this could nearly been said subjective to every people. But, yes, ok, I will try to be more concise next time!

@maartenbreddels
Copy link
Member

(dates >= 2021-03-03)

the issue here seems to be that we don't support Python datetime objects, which I think we should.

But even without the .item(), the filter doesn't seem to work.

I think we should translate this to unittests, and fix them!

Anyone wants to take a look at how we can add to tests/datetime_test.py?

@JovanVeljanoski
Copy link
Member

Closed via #1921

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants