Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read in date-time columns #43

Open
firekg opened this issue Mar 9, 2023 · 3 comments
Open

read in date-time columns #43

firekg opened this issue Mar 9, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@firekg
Copy link

firekg commented Mar 9, 2023

A lot of data have date (or data-time) columns. Right now lace treats it as categorical (as they are str). This is not ideal both in terms of the number of date/time it can represent and in terms of its actual semantics (more like a continuous variable).

@BaxterEaves BaxterEaves added the enhancement New feature or request label Mar 9, 2023
@BaxterEaves
Copy link
Contributor

I think we should definitely handle dates and date times. Internally, we'd have to represent date times as some sort of collection columns that break down the components. For example, we'd have

  • day of week: categorical?
  • day of month: categorical?
  • year: count?
  • etc

Then there is the cyclic nature of dates and times. sunday is close to saturday, but sunday = 0 and saturday = 6.

We should think about how we would represent this. We also need to represent it so that it can be exactly converted back into a date(time). Please add suggestions. If we need to add a new model (e.g. cyclic) to make this work, feel free to propose it. We can add another issue for that.

@joshualeond
Copy link

Hi! I came here to create a similar issue around date-time columns but I think this captures it.

I've been experimenting with the Lace package and I'm really enjoying it. Most of the data I work with is time-series sensor data so am looking for a recommendation on how to optimally prepare that data for Lace.

Is it better to leave as is (categorical as noted above), convert to a sequential integer index, or to break out into several features like augment_timeseries_signature or tsfresh?

Maybe it all depends on my use-case but wanted to get your thoughts.

Thanks!

@BaxterEaves
Copy link
Contributor

Hi @joshualeond - glad you're enjoying lace!

The rows of the table are modeled as independent observations, so the way we typically do timeseries is by keeping a certain amount of history and lookahead in the columns. For example, for sensor data a row might look like this

time_at_t0, t_minus_n, .., t_minus_2, t0, t_plus_1, ..., t_plus_m, ...

here i've used n to represent the number of timesteps back and m for the number of timesteps forward. You can of course use whatever granularity of data you like.

The best way to represent a datetime depends on your application. You might represent it as the number of hours since an experiment started, or you can break it into several features depending on what components of the datetime share information with the things you're interested in. You could do a categorical day of the week or a float proportion of the week. It all depends. If the cyclic nature of days/weeks/months/years is important, you can use sin and cos on the proportion*2pi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants