Skip to content

Latest commit

 

History

History
67 lines (49 loc) · 4.31 KB

dataset_design.md

File metadata and controls

67 lines (49 loc) · 4.31 KB

📦 Dataset Design

⏬ Data Download

To get started with the datasets, download the all_data.zip file from either Google Drive or Baidu Netdisk. After downloading, unzip the files into the datasets/ directory:

cd /path/to/BasicTS # not BasicTS/basicts
unzip /path/to/all_data.zip -d datasets/

These datasets are preprocessed and ready for immediate use.

💿 Data Format

Each dataset contains at least two essential files: data.dat and desc.json:

  • data.dat: This file stores the raw time series data in numpy.memmap format with a shape of [L, N, C].

    • L: Number of time steps. Typically, the training, validation, and test sets are split along this dimension.
    • N: Number of time series, also referred to as the number of nodes.
    • C: Number of features. Usually, this includes [target feature, time of day, day of week, day of month, day of year], with the target feature being mandatory and the others optional.
  • desc.json: This file contains metadata about the dataset, including:

    • Dataset name
    • Domain of the dataset
    • Shape of the data
    • Number of time slices
    • Number of nodes (i.e., the number of time series)
    • Feature descriptions
    • Presence of prior graph structures
    • Regular settings:
      • Input and output lengths
      • Ratios for training, validation, and test sets
      • Whether normalization is applied individually to each channel (i.e., time series)
      • Whether to re-normalize during evaluation
      • Evaluation metrics
      • Handling of outliers

🧑‍💻 Dataset Class Design

In time series forecasting, datasets are typically generated from raw time series data using a sliding window approach. As illustrated above, the raw time series is split into training, validation, and test sets along the time dimension, and samples are generated using a sliding window of size inputs + targets. Most datasets adhere to this structure.

BasicTS provides a built-in Dataset class called TimeSeriesForecastingDataset, designed specifically for time series data. This class generates samples in the form of a dictionary containing two objects: inputs and target. inputs represents the input data, while target represents the target data. Detailed documentation can be found in the class's comments.

🧑‍🍳 How to Add or Customize Datasets

If your dataset follows the structure described above, you can preprocess your data into the data.dat and desc.json format and place it in the datasets/ directory, e.g., datasets/YOUR_DATA/{data.dat, desc.json}. BasicTS will then automatically recognize and utilize your dataset.

For reference, you can review the scripts in scripts/data_preparation/, which are used to process datasets from raw_data.zip (Google Drive, Baidu Netdisk).

If your dataset does not conform to the standard format or has specific requirements, you can define your own dataset class by inheriting from torch.utils.data.Dataset. In this custom class, the __getitem__ method should return a dictionary containing inputs and target.

🧑‍💻 Explore Further