Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeseries Utility Functions for Data Frame Objects #26

Merged
merged 27 commits into from
Aug 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
19b0212
ZenithClown Oct 18, 2023
014d2a3
ZenithClown Oct 18, 2023
9f487d7
πŸ’£ rename featuring.py to ts_featuring.py
ZenithClown Apr 11, 2023
094a670
ZenithClown Oct 17, 2023
f624487
added rolling window observation
ZenithClown Oct 17, 2023
77154b2
πŸ›  add method selection b/w adfuller and kpss
ZenithClown Oct 17, 2023
fd78e90
ZenithClown Oct 16, 2023
5b26109
ZenithClown Oct 16, 2023
e277949
ZenithClown Oct 16, 2023
9da65eb
ZenithClown Oct 16, 2023
816bc93
πŸ› πŸš§ merge all the time series funcs in ts_utils
ZenithClown Oct 18, 2023
ad9a540
πŸ“ƒπŸš§ add theory for stationarity in time series
ZenithClown Oct 18, 2023
c61329c
Rename stationarity readme file
ZenithClown Oct 18, 2023
9e2e019
add stationary test conclusions
ZenithClown Oct 18, 2023
fdce754
πŸ“ƒπŸš§ created documentation for ma models
ZenithClown Oct 18, 2023
0750027
β³πŸ“ƒ optimize, add document on moving average model
ZenithClown Oct 19, 2023
f1a02a8
🩹 fix _base not found error
ZenithClown Oct 19, 2023
2a0062e
πŸ“ƒπŸš§ keep one readme file per gist
ZenithClown Oct 20, 2023
8c18dde
πŸ“ function documentation for stationarity
ZenithClown Aug 14, 2024
e5e0a40
πŸ’£ moving document code-archived/ds-gringotts#15
ZenithClown Aug 14, 2024
06b7989
✨ create pandaswizard/timeseries module
ZenithClown Aug 26, 2024
201b850
Merge remote-tracking branch 'gist/master' into featuring/timeseries/…
ZenithClown Aug 26, 2024
72e2656
πŸ›  git move gist/master to pdw/timeseries
ZenithClown Aug 26, 2024
6b038b7
πŸ’£ add caution block for code migration preparations #29
ZenithClown Aug 26, 2024
8e4ebb1
✨ initialize timeseries under pandaswizard
ZenithClown Aug 26, 2024
4154492
Merge branch 'featuring/timeseries/init' into featuring/timeseries/me…
ZenithClown Aug 26, 2024
61976e0
πŸ“ create section for time series documentation
ZenithClown Aug 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ The above function calculates the 50th percentile, i.e., the median of the featu
aggregate.md
wrappers.md
functions.md
timeseries.md
```

---
Expand Down
59 changes: 59 additions & 0 deletions docs/timeseries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Time Series Utilities

<div align = "justify">

Feature engineering, time series stationarity checks are few of the use-cases that are compiled in this gists. Check
individual module defination and functionalities as follows.

## Stationarity & Unit Roots

Stationarity is one of the fundamental concepts in time series analysis. The
**time series data model works on the principle that the [_data is stationary_](https://www.analyticsvidhya.com/blog/2021/04/how-to-check-stationarity-of-data-in-python/)
and [_data has no unit roots_](https://www.analyticsvidhya.com/blog/2018/09/non-stationary-time-series-python/)**, this means:
* the data must have a constant mean (across all periods),
* the data should have a constant variance, and
* auto-covariance should not be dependent on time.

Let's understand the concept using the following example, for more information check [this link](https://www.analyticsvidhya.com/blog/2018/09/non-stationary-time-series-python/).

![Non-Stationary Time Series](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/09/ns5-e1536673990684.png)

<div align = "center">

| ADF Test | KPSS Test | Series Type | Additional Steps |
| :---: | :---: | :---: | --- |
| βœ… | βœ… | _stationary_ | |
| ❌ | ❌ | _non-stationary_ | |
| βœ… | ❌ | _difference-stationary_ | Use differencing to make series stationary. |
| ❌ | βœ… | _trend-stationary_ | Remove trend to make the series _strict stationary. |

</div>

```{eval-rst}
.. automodule:: pandaswizard.timeseries.stationarity
:members:
:undoc-members:
:show-inheritance:
```

## Time Series Featuring

Time series analysis is a special segment of AI/ML application development where a feature is dependent on time. The code here
is desgined to create a *sequence* of `x` and `y` data needed in a time series problem. The function is defined with two input
parameters (I) **Lootback Period (T) `n_lookback`**, and (II) **Forecast Period (H) `n_forecast`** which can be visually
presented below.

<div align = "center">

![prediction-sequence](https://i.stack.imgur.com/YXwMJ.png)

</div>

```{eval-rst}
.. automodule:: pandaswizard.timeseries.ts_featuring
:members:
:undoc-members:
:show-inheritance:
```

</div>
1 change: 1 addition & 0 deletions pandaswizard/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,4 @@
from pandaswizard import window
from pandaswizard import wrappers
from pandaswizard import functions
from pandaswizard import timeseries
29 changes: 29 additions & 0 deletions pandaswizard/timeseries/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# -*- encoding: utf-8 -*-

"""
A Set of Utility Functions for a Time Series Data for ``pandas``

A time series analysis is a way of analyzing a sequence of data which
was collected over a period of time and the data is collected at a
specific intervals. The :mod:`pandas` provides various functions to
manipulate a time object which is a wrapper over the ``datetime``
module and is a type of ``pd.Timestamp`` object.

The module provides some utility functions for a time series data
and tests like "stationarity" which is a must in EDA!

.. caution::
The time series module consists of functions which was earlier
developed in `GH/sqlparser <https://gist.github.com/ZenithClown/f99d7e1e3f4b4304dd7d43603cef344d>`_.

.. note::
More details on why the module was merged is
available `GH/#24 <https://github.com/sharkutilities/pandas-wizard/issues/24>`_.

.. note::
Code migration details are mentioned
`GH/#27 <https://github.com/sharkutilities/pandas-wizard/issues/27>`_.
"""

from pandaswizard.timeseries.stationarity import *
from pandaswizard.timeseries.ts_featuring import *
86 changes: 86 additions & 0 deletions pandaswizard/timeseries/stationarity.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# -*- encoding: utf-8 -*-

"""
Stationarity Checking for Time Series Data

A functional approach to check stationarity using different models
and the function attrbutes are as defined below.
(`More Information <http://ds-gringotts.readthedocs.io/>`_)
"""

from statsmodels.tsa.stattools import kpss # kpss test
from statsmodels.tsa.stattools import adfuller # adfuller test

def checkStationarity(frame : object, feature: str, method : str = "both", verbose : bool = True, **kwargs) -> bool:
"""
Performs ADF Test to Determine Data Stationarity

Given an univariate series formatted as `frame.set_index("data")`
the series can be tested for stationarity using the Augmented
Dickey Fuller (ADF) and Kwiatkowski-Phillips-Schmidt-Shin (KPSS)
test. The function also returns a `dataframe` of rolling window
for plotting the data using `frame.plot()`.

:type frame: pd.DataFrame
:param frame: The dataframe that (ideally) contains a single
univariate feature (`feature`), else for a
dataframe containing multiple series only the
`feature` series is worked upon.

:type feature: str
:param feature: Name of the feature, i.e. the column name
in the dataframe. The `rolling` dataframe returns
a slice of `frame[[feature]]` along with rolling
mean and standard deviation.

:type method: str
:param method: Select any of the method ['ADF', 'KPSS', 'both'],
using the `method` parameter, name is case
insensitive. Defaults to `both`.
"""

results = dict() # key is `ADF` and/or `KPSS`
stationary = dict()

if method.upper() in ["ADF", "BOTH"]:
results["ADF"] = adfuller(frame[feature].values) # should be send like `frame.col.values`
stationary["ADF"] = True if (results["ADF"][1] <= 0.05) & (results["ADF"][4]["5%"] > results["ADF"][0]) else False

if verbose:
print(f"Observations of ADF Test ({feature})")
print("===========================" + "=" * len(feature))
print(f"ADF Statistics : {results['ADF'][0]:,.3f}")
print(f"p-value : {results['ADF'][1]:,.3f}")

critical_values = {k : round(v, 3) for k, v in results["ADF"][4].items()}
print(f"Critical Values : {critical_values}")

# always print if data is stationary/not
print(f"[ADF] Data is :", "\u001b[32mStationary\u001b[0m" if stationary else "\x1b[31mNon-stationary\x1b[0m")

if method.upper() in ["KPSS", "BOTH"]:
results["KPSS"] = kpss(frame[feature].values) # should be send like `frame.col.values`
stationary["KPSS"] = False if (results["KPSS"][1] <= 0.05) & (results["KPSS"][3]["5%"] > results["KPSS"][0]) else True

if verbose:
print(f"Observations of KPSS Test ({feature})")
print("============================" + "=" * len(feature))
print(f"KPSS Statistics : {results['KPSS'][0]:,.3f}")
print(f"p-value : {results['KPSS'][1]:,.3f}")

critical_values = {k : round(v, 3) for k, v in results["KPSS"][3].items()}
print(f"Critical Values : {critical_values}")

# always print if data is stationary/not
print(f"[KPSS] Data is :", "\x1b[31mNon-stationary\x1b[0m" if stationary else "\u001b[32mStationary\u001b[0m")

# rolling calculations for plotting
rolling = frame.copy() # enable deep copy
rolling = rolling[[feature]] # only keep single feature, works if multi-feature sent
rolling.rename(columns = {feature : "original"}, inplace = True)

rolling_ = rolling.rolling(window = kwargs.get("window", 12))
rolling["mean"] = rolling_.mean()["original"].values
rolling["std"] = rolling_.std()["original"].values

return results, stationary, rolling
202 changes: 202 additions & 0 deletions pandaswizard/timeseries/ts_featuring.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
# -*- encoding: utf-8 -*-

"""
A Set of Methodologies involved with Feature Engineering

Feature engineering or feature extraction involves transforming the
data by manipulation (addition, deletion, combination) or mutation of
the data set in hand to improve the machine learning model. The
project mainly deals with, but not limited to, time series data that
requires special treatment - which are listed over here.

Feature engineering time series data will incorporate the use case of
both univariate and multivariate data series with additional
parameters like lookback and forward tree. Check documentation of the
function(s) for more information.
"""

import numpy as np
import pandas as pd


class DataObjectModel(object):
"""
Data Object Model (`DOM`) for AI-ML Application Development

Data is the key to an artificial intelligence application
development, and often times real world data are gibrish and
incomprehensible. The DOM is developed to provide basic use case
like data formatting, seperating `x` and `y` variables etc. such
that a feature engineering function or a machine learning model
can easily get the required information w/o much needed code.

# Example Use Cases
The following small use cases are possible with the use of the
DOM in feature engineering:

1. Formatting a Data to a NumPy ND-Array - an iterable/pandas
object can be converted into `np.ndarray` which is the base
data type of the DOM.

```python
np.random.seed(7) # set seed for duplication
data = pd.DataFrame(
data = np.random.random(size = (9, 26)),
columns = list("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
)

dom = DataObjectModel(data)
print(type(dom.data))
>> <class 'numpy.ndarray'>
```

2. Breaking an Array of `Xy` into Individual Component - for
instance a dataframe/tabular data has `X` features along side
`y` in column. The function considers the data and breaks it
into individual components.

```python
X, y = dom.create_xy(y_index = 1)

# or if `y` is group of elements then:
X, y = dom.create_xy(y_index = (1, 4))
```
"""

def __init__(self, data: np.ndarray) -> None:
self.data = self.__to_numpy__(data) # also check integrity

def __to_numpy__(self, data: object) -> np.ndarray:
"""Convert Meaningful Data into a N-Dimensional Array"""

if type(data) == np.ndarray:
pass # data is already in required type
elif type(data) in [list, tuple]:
data = np.array(data)
elif type(data) == pd.DataFrame:
# often times a full df can be passed, which is a ndarray
# thus, the df can be easily converted to an np ndarray:
data = data.values
else:
raise TypeError(
f"Data `type == {type(data)}` is not convertible.")

return data


def create_xy(self, y_index : object = -1) -> tuple:
"""
Breaks the Data into Individual `X` and `y` Components

From a tabular or ndimensional structure, the code considers
`y` along a given axis (`y_index`) and returns two `ndarray`
which can be treated as `X` and `y` individually.

The function uses `np.delete` command to create `X` feature
from the data. (https://stackoverflow.com/a/5034558/6623589).

This function is meant for multivariate dataset, and is only
applicable when dealing with multivariate time series data.
The function can also be used for any machine learning model
consisting of multiple features (even if it is a time series
dataset).

:type y_index: object
:param y_index: Index/axis of `y` variable. If the type is
is `int` then the code assumes one feature,
and `y_.shape == (-1, 1)` and if the type
of `y_index` is `tuple` then
`y_.shape == (-1, (end - start - 1))` since
end index is exclusive as in `numpy` module.
"""

if type(y_index) in [list, tuple]:
x_ = self.data
y_ = self.data[:, y_index[0]:y_index[1]]
for idx in range(*y_index)[::-1]:
x_ = np.delete(x_, obj = idx, axis = 1)
elif type(y_index) == int:
y_ = self.data[:, y_index]
x_ = np.delete(self.data, obj = y_index, axis = 1)
else:
raise TypeError("`type(y_index)` not in [int, tuple].")

return x_, y_


class CreateSequence(DataObjectModel):
"""
Create a Data Sequence Typically to be used in LSTM Model

LSTM Model, or rather any time series data, requires a specific
sequence of data consisting of `n_lookback` i.e. length of input
sequence (or lookup values) and `n_forecast` values, i.e., the
length of output sequence. The function tries to provide single
approach to break data into sequence of `x_train` and `y_train`
for training in neural network.
"""

def __init__(self, data: np.ndarray) -> None:
super().__init__(data)


def create_series(
self,
n_lookback : int,
n_forecast : int,
univariate : bool = True,
**kwargs
) -> tuple:
"""
Create a Sequence of `x_train` and `y_train` for training a
neural network model with time series data. The basic
approach in building the function is taken from:
https://stackoverflow.com/a/69912334/6623589

UPDATE [22-02-2023] : The function is now modified such that
it now can also return a sequence for multivariate time-series
analysis. The following changes has been added:
* πŸ’£ refactor function name to generalise between univariate
and multivariate methods.
* πŸ”§ univariate feature can be called directly as this is the
default code behaviour.
* πŸ›  to get multivariate functionality, use `univariate = False`
* πŸ›  by default the last column (-1) of `data` is considered
as `y` feature by slicing `arr[s:e, -1]` but this can be
configured using `kwargs["y_feat_"]`
"""

x_, y_ = [], []
n_record = self.__check_univariate_get_len__(univariate) \
- n_forecast + 1

y_feat_ = kwargs.get("y_feat_", -1)

for idx in range(n_lookback, n_record):
x_.append(self.data[idx - n_lookback : idx])
y_.append(
self.data[idx : idx + n_forecast] if univariate
else self.data[idx : idx + n_forecast, y_feat_]
)

x_, y_ = map(np.array, [x_, y_])

if univariate:
# the for loop is so designed it returns the data like:
# (<records>, n_lookback, <features>) however,
# for univariate the `<features>` dimension is "squeezed"
x_, y_ = map(lambda arr : np.squeeze(arr), [x_, y_])

return [x_, y_]


def __check_univariate_get_len__(self, univariate : bool) -> int:
"""
Check if the data is a univariate one, and if `True` then
return the length, i.e. `.shape[0]`, for further analysis.
"""

if (self.data.ndim != 1) and (univariate):
raise TypeError("Wrong dimension for univariate series.")

return self.data.shape[0]
Loading
Loading