[paper] [Citations] [EUFCC-CIR]
The EUFCC-340K dataset was compiled using the Europeana portal's REST API, the Europeana portal aggregates cultural heritage collections across Europe. The dataset contains annotated images derived from diverse cultural artifacts, multimedia content, and traditional records from European institutions. Metadata for each item offers rich details for analysis, facilitated by a hierarchical labeling structure aligned with the Getty "Art & Architecture Thesaurus (AAT)".
Initial data collection involved keyword searches and filtering for broad categories. Results were filtered to include only entries with available thumbnails and tagged with Reusability: OPEN
ensuring that the dataset comprises images suitable for open research and application. Mapping Europeana concepts to Getty AAT facilitated structured labeling under four facets: Materials, Object Types, Disciplines, and Subjects. Manual curation ensured dataset quality, although some noisy annotations may remain. Each record includes information about the data provider.
The dataset comprises 346,324 annotated images from 358 data providers. Some providers offer extensive inventories, while others contribute minimally. Statistics show variations in tag frequencies per image and across facets. The dataset is partially annotated, with varying levels of detail. Test subsets, including "Outer" and "Inner" test sets, were designated to challenge models and ensure representation. The split strategy aimed for balance and diversity, considering tag frequencies and minimum thresholds for each category. In the following figure we highlight two examples of the dataset's complexity and diversity in terms of facets and hierarchical structures:
The following figure illustrates the dataset statistics:
Iniside data/ folder you can found the dataset in the following structure:
data/
├── train.csv
├── val.csv
├── test_id.csv
├── test_ood.csv
With the corresponding annotations for train, validation, inner (id) test and outer (ood) test, respectively.
In each file you can find:
idInSource
: unique identifier for each imageobjectTypes.hierarchy
: hierarchical structure for object typessubjects.hierarchy
: hierarchical structure for subjectsmaterials.hierarchy
: hierarchical structure for materialsclassifications.hierarchy
: hierarchical structure for classificationsrepository.keeper
: museum or institution that holds the image#portraitMedia.original
: URL to the original image to be downloaded
To download the images if desired just call download_images function from the downloader.py file. For instance, to download the training images:
from downloader import download_images
import pandas as pd
df = pd.read_csv('train.csv')
download_images(df='data/train.csv', root_dir='data/train')
The preceding data adheres to a hierarchical and multi-label structure. This implies that a single image may be associated with multiple tags simultaneously (multi-label), and these tags are organized hierarchically. In our dataset, the multi-label structure is delineated by '$', while the hierarchical structure is delineated by '|'.
To clarify, the annotation for the following image would be:
objectTypes.hierarchy
: medal $ textilematerials.hierarchy
: Animal material | processed animal material | leather | suede $ paper | cardboard | cardstock $ metal | non-ferreous metal | silver
Inisde labels/ folder you can found our specific labels used, for each of the attributes (materials, ObjectTypes, etc.). These labels have been created using the following code:
from hierarchy import build_trees, print_trees
# build tree
trees = build_trees(df)
# print trees if desired
print_trees(df)
Note that to do this we've merged a priori all the labels (from train.csv, validation.csv, etc.) and subsequently applied post-processing to prevent instances with minimal representation.
We provide the results of our baselines in the following table:
Method | Test ID | Test ID | Test ID | Test ID | Test OOD | Test OOD | Test OOD | Test OOD |
---|---|---|---|---|---|---|---|---|
R-Prec | Acc@1 | Acc@10 | AvgRP | R-Prec | Acc@1 | Acc@10 | AvgRP | |
Multi-label | 0.76 | 0.83 | 0.99 | 4.8 | 0.67 | 0.70 | 0.89 | 27.7 |
Softmax | 0.77 | 0.88 | 0.99 | 5.4 | 0.67 | 0.69 | 0.87 | 25.8 |
Softmax (WCE) | 0.63 | 0.74 | 0.96 | 14.6 | 0.53 | 0.53 | 0.80 | 40.7 |
H-Softmax (levels) | 0.68 | 0.85 | 0.99 | 24.6 | 0.66 | 0.70 | 0.88 | 14.9 |
H-Softmax (nodes) | 0.73 | 0.85 | 0.99 | 11.3 | 0.64 | 0.66 | 0.86 | 17.6 |
Ensemble: ML+S | 0.81 | 0.88 | 0.99 | 3.8 | 0.67 | 0.70 | 0.89 | 27.2 |
CLIP zero-shot | 0.35 | 0.37 | 0.66 | 66.7 | 0.20 | 0.20 | 0.46 | 143.4 |
CLIP 1 tag | 0.64 | 0.71 | 0.94 | 13.2 | 0.74 | 0.75 | 0.90 | 9.6 |
CLIP all tags | 0.61 | 0.65 | 0.88 | 26.4 | 0.51 | 0.57 | 0.87 | 48.2 |
CLIP 1 tag prompt | 0.70 | 0.76 | 0.93 | 14.1 | 0.69 | 0.72 | 0.91 | 10.1 |
Ens: ML+S+CLIP | 0.80 | 0.87 | 0.99 | 5.0 | 0.68 | 0.70 | 0.92 | 13.1 |
For more information about the baselines, the architecture used, the training process, the loss procedures and the evaluation metrics please refer to the paper (currently under review).
Some qualitative samples of our model are shown below, where the green values are the correct labels, the orange values are the predicted values that are consistent but not in the ground truth set, and the red values are the incorrect predictions:
This work has been supported by the Ramon y Cajal research fellowship RYC2020- 030777-I / AEI / 10.13039/501100011033, the CERCA Programme / Generalitat de Catalunya, and ACCIO INNOTEC 2021 project Coeli-IA (ACE034/21/000084).
@article{net2024eufcc,
title={EUFCC-340K: A Faceted Hierarchical Dataset for Metadata Annotation in GLAM Collections},
author={Net, Francesc and Folia, Marc and Casals, Pep and Bagdanov, Andrew D and Gomez, Lluis},
journal={arXiv preprint arXiv:2406.02380},
year={2024}
}
An extension of the EUFCC-340K dataset for Composed Image Retrieval:
EUFCC-CIR: A Composed Image Retrieval Dataset for GLAM Collections