GRASP: A novel benchmark for evaluating language GRounding And Situated Physics understanding in multimodal language models [Paper]

Introduction

GRASP is a novel language grounding and intuitive physics benchmark for evaluating video-based multimodal large language models. The benchmark comprises two levels and is modeled in the Unity simulator. The first level of GRASP tests for basic visual understanding of multimodal LLMs, specifically it tests for the understanding of shapes, colors, movement, ordering of objects, and relational positions. Level 1 lays the groundwork for higher-order reasoning required in Level 2 of GRASP where we take inspiration from research on infant cognition regarding intuitive physics concepts. These concepts include continuity, solidity, inertia, gravity, collision, object permanence, support, and unchangeableness.

In this repository, we publish all benchmark resources:

Benchmark videos and code for evaluation of models.
Unity builds of benchmark tests for the generation of additional videos.
Unity source code for extension of the benchmark.

Setup 🔨

Create a conda environment:

conda create --name grasp python=3.9

Install PyTorch (adjust CUDA version if necessary):

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

Install the GRASP python package:

git clone https://github.com/i-machine-think/grasp.git
cd grasp
pip3 install -e .

Evaluate models 📊

We currently provide some configurations (in configs/models) and setup instructions for the following Video-LLMs:

PandaGPT
VTimeLLM
Video-ChatGPT
Video-LLaMA
Video-LLaMA 2

Further, we provide configurations for different tests in configs/tests:

level1_binary.yaml: Binary Level 1 with default prompting.
level1_binary_cot.yaml: Binary Level 1 with chain-of-thought prompting.
level1_binary_oneshot.yaml: Binary Level 1 with one-shot prompting.
level1_open.yaml: Level 1 with open questions.
level2.yaml: Level 2 with default prompting.
level2_text.yaml: Level 2 with scene descriptions instead of videos.

Models can be evaluated on our benchmark:

Our benchmark videos can be downloaded from this Google Drive. Specifically, download videos.zip and unzip it in the data directory.
Create necessary CSV files for the benchmark:

python3 tools/prepare_data.py

Evaluate a model on a specific test:

python3 tools/run_test.py <model_config> <test_config>

This will create a folder in results and dump the outputs into a CSV file.

Compute the accuracy of previously collected outputs:

python3 tools/compute_accuracy.py <results_dir>

This will print a table with results to the terminal and will also write a table to a text file in the results directory.

Collect additional videos 📹

We provide access to the Unity scene builds in the aforementioned Google Drive. These are compiled Unity scenes that we used to generate the benchmark videos. Videos can be generated as follows:

Download and unzip build.zip from the Google Drive.
The video generation requires the installation of Unity's ML-Agents Python package:

git clone --branch release_20 https://github.com/Unity-Technologies/ml-agents.git
cd ml-agents
python -m pip install ./ml-agents-envs

We recommend using xvfb-run to generate the videos so that the simulation does not actually need to be rendered on screen. Install xvfb-run with:

sudo apt install xvfb-run

Generate new videos:

xvfb-run python3 tools/generate_videos.py --dir <builds_dir> --scenes <scene1,scene2,...,sceneN> --txt-file <labels.txt> --N <number of videos> --out <output_dir>

The scenes for which videos should be collected can be either implicitly specified with --dir, in which case videos are collected for all scenes contained in that directory, or explicitly listed with --scenes. The --txt-file flag specifies from which file the labels should be read during video annotation: Some benchmark tests require some text labels for evaluation. Currently, these are only necessary for the object ordering tests and the files are 2obj_ordering.txt, 3obj_ordering.txt, and 4obj_ordering.txt. Whenever --txt-file is used, it is only possible to specify a single scene!

The benchmark videos were collected using:

xvfb-run python3 tools/generate_videos.py --dir build/Level2 # By default 128 videos are generated and saved to data/videos
xvfb-run python3 tools/generate_videos.py --dir build/Level1
xvfb-run python3 tools/generate_videos.py --scenes build/Level1/TwoObjectOrdering --txt-file 2obj_ordering.txt
xvfb-run python3 tools/generate_videos.py --scenes build/Level1/ThreeObjectOrdering --txt-file 3obj_ordering.txt
xvfb-run python3 tools/generate_videos.py --scenes build/Level1/FourObjectOrdering --txt-file 4obj_ordering.txt

Create additional scenes 🎮

Our entire Unity source code can be found in the GRASP directory. This contains the scenes and scripts for all tests in Levels 1 and 2 of GRASP. We also provide some instructions on how to add further tests in Unity. We encourage you to create pull requests with the addition of new tests!

Citation 📖

If you use GRASP in your work, please cite using:

@article{jassim2024grasp,
  title={GRASP: A novel benchmark for evaluating language GRounding And Situated Physics understanding in multimodal language models},
  author={Jassim, Serwan and Holubar, Mario and Richter, Annika and Wolff, Cornelius and Ohmer, Xenia and Bruni, Elia},
  journal={arXiv preprint arXiv:2311.09048},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

GRASP: A novel benchmark for evaluating language GRounding And Situated Physics understanding in multimodal language models [Paper]

Introduction

Setup 🔨

Evaluate models 📊

Collect additional videos 📹

Create additional scenes 🎮

Citation 📖

Files

README.md

Latest commit

History

README.md

File metadata and controls

GRASP: A novel benchmark for evaluating language GRounding And Situated Physics understanding in multimodal language models [Paper]

Introduction

Setup 🔨

Evaluate models 📊

Collect additional videos 📹

Create additional scenes 🎮

Citation 📖