Skip to content

Commit

Permalink
basic docs
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Oct 24, 2024
1 parent b8cef87 commit 980378e
Showing 1 changed file with 109 additions and 0 deletions.
109 changes: 109 additions & 0 deletions docs/source/video_load.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Load video data

<Tip warning={true}>

Video support is experimental and is subject to change.

</Tip>

Video datasets have [`Video`] type columns, which contain `decord` objects.

<Tip>

To work with video datasets, you need to have the `vision` dependency installed. Check out the [installation](./installation#vision) guide to learn how to install it.

</Tip>

When you load an video dataset and call the video column, the videos are decoded as `decord` Videos:

```py
>>> from datasets import load_dataset, Video

>>> dataset = load_dataset("path/to/video/folder", split="train")
>>> dataset[0]["video"]
<decord.video_reader.VideoReader at 0x1652284c0>
```

<Tip warning={true}>

Index into an video dataset using the row index first and then the `video` column - `dataset[0]["video"]` - to avoid decoding and resampling all the video objects in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.

</Tip>

For a guide on how to load any type of dataset, take a look at the <a class="underline decoration-sky-400 decoration-2 font-semibold" href="./loading">general loading guide</a>.

## Local files

You can load a dataset from the video path. Use the [`~Dataset.cast_column`] function to accept a column of video file paths, and decode it into a `decord` video with the [`Video`] feature:
```py
>>> from datasets import Dataset, Video

>>> dataset = Dataset.from_dict({"video": ["path/to/video_1", "path/to/video_2", ..., "path/to/video_n"]}).cast_column("video", Video())
>>> dataset[0]["video"]
<decord.video_reader.VideoReader at 0x1657d0280>
```

If you only want to load the underlying path to the video dataset without decoding the video object, set `decode=False` in the [`Video`] feature:

```py
>>> dataset = dataset.cast_column("video", Video(decode=False))
>>> dataset[0]["video"]
{'bytes': None,
'path': 'path/to/video/folder/video0.mp4'}
```

## VideoFolder

You can also load a dataset with an `VideoFolder` dataset builder which does not require writing a custom dataloader. This makes `VideoFolder` ideal for quickly creating and loading video datasets with several thousand videos for different vision tasks. Your video dataset structure should look like this:

```
folder/train/dog/golden_retriever.mp4
folder/train/dog/german_shepherd.mp4
folder/train/dog/chihuahua.mp4
folder/train/cat/maine_coon.mp4
folder/train/cat/bengal.mp4
folder/train/cat/birman.mp4
```

Load your dataset by specifying `videofolder` and the directory of your dataset in `data_dir`:

```py
>>> from datasets import load_dataset

>>> dataset = load_dataset("videofolder", data_dir="/path/to/folder")
>>> dataset["train"][0]
{"video": <decord.video_reader.VideoReader at 0x161715e50>, "label": 0}

>>> dataset["train"][-1]
{"video": <decord.video_reader.VideoReader at 0x16170bd90>, "label": 1}
```

Load remote datasets from their URLs with the `data_files` parameter:

```py
>>> dataset = load_dataset("videofolder", data_files="https://foo.bar/videos.zip", split="train")
```

Some datasets have a metadata file (`metadata.csv`/`metadata.jsonl`) associated with it, containing other information about the data like bounding boxes, text captions, and labels. The metadata is automatically loaded when you call [`load_dataset`] and specify `videofolder`.

To ignore the information in the metadata file, set `drop_labels=False` in [`load_dataset`], and allow `VideoFolder` to automatically infer the label name from the directory name:

```py
>>> from datasets import load_dataset

>>> dataset = load_dataset("videofolder", data_dir="/path/to/folder", drop_labels=False)
```

## WebDataset

The [WebDataset](https://github.com/webdataset/webdataset) format is based on a folder of TAR archives and is suitable for big video datasets.
Because of their size, WebDatasets are generally loaded in streaming mode (using `streaming=True`).

You can load a WebDataset like this:

```python
>>> from datasets import load_dataset

>>> dataset = load_dataset("webdataset", data_dir="/path/to/folder", streaming=True)
```

0 comments on commit 980378e

Please sign in to comment.