ArtiFact: A Large-Scale Multi-Modal Cultural Heritage Dataset

651,045 museum records pairing artwork images with structured metadata, collected from the Metropolitan Museum of Art, the Art Institute of Chicago, and the Rijksmuseum.

651K Artwork records
3 Institutions
24 Schema columns
130K Error-annotated records
7 Error categories
🤗 HuggingFace GitHub Cite

What is ArtiFact?

Multi-modal data has become a central focus of contemporary database research, driving advances in data integration, retrieval, semantic query processing, and data quality management. Despite this growing interest, the community lacks large-scale real-world datasets with multi-modal records. We present ArtiFact, a large-scale multi-modal cultural heritage dataset containing 651,045 museum records collected from three major institutions: the Metropolitan Museum of Art (MET), the Art Institute of Chicago (AIC), and the Rijksmuseum in Amsterdam, each pairing images with textual descriptions and structured data. To construct the dataset, we developed a unified ETL pipeline that combines rule-based normalization with LLM-assisted semantic parsing to standardize heterogeneous museum records across institutions. ArtiFact supports a broad range of downstream tasks, including semantic query processing, multi-modal retrieval, and multi-modal error detection. For the latter, the dataset incorporates a curated taxonomy of realistic errors, allowing evaluation of multi-modal data cleaning methods.

Dataset Construction & Error Taxonomy

ArtiFact is built through a 5-step Extract-Transform-Load (ETL) pipeline that collects, structures, normalizes, and deduplicates records from three institutions, then supports downstream tasks including error detection and semantic querying.

Data Collection & Downstream Tasks

Overview of the ArtiFact dataset generation process: collect, structure, normalize, and deduplicate records, and support downstream tasks including cross-modal error detection and semantic query processing.

Overview of the dataset generation process: an ETL pipeline collects, structures & prepares, maps to a shared schema, and performs semantic data unification across three cultural heritage institutions; the preprocessed dataset supports downstream tasks including cross-modal error detection and semantic query processing.

Get the Data

ArtiFact is freely available on HuggingFace Datasets and GitHub. Use the datasets library to load it programmatically in seconds.

🤗

HuggingFace Dataset

Download directly from the HuggingFace Hub.

🤗 deem-data/ArtiFact

GitHub Repository

Source code for the ETL pipeline, error injection framework, and downstream tasks.

OlgaOvcharenko/ArtiFact
from datasets import load_dataset

# Load the full dataset
ds = load_dataset("deem-data/ArtiFact")

# Inspect available splits and columns
print(ds)

# Access the first record
print(ds["train"][0])

For licensing inquiries, contact ovcharenko@tu-berlin.de.

Scale & Sources

ArtiFact aggregates open-access records from three of the world's premier cultural institutions, retaining only records with public-domain image URLs.

651,045
Total artwork records
3
Contributing museums
1,700+
Canonical material & technique terms
168K
Records processed with LLM parsing

The Metropolitan Museum of Art

Collected via the MET REST API across all 19 curatorial departments.

Art Institute of Chicago

Collected via open-access JSON-LD files using the International Image Interoperability Framework (IIIF).

Rijksmuseum Amsterdam

Collected via OAI-PMH and recursive JSON-LD linked data parsing.

Unified 24-Column Schema

Records from all three institutions are integrated into a single standardized schema spanning identifiers, temporal attributes, physical properties, cultural/geographic context, and artist metadata.

Identifiers

  • object_ID
  • object_name
  • title
  • description
  • subjects
  • inscriptions
  • image_url

Temporal

  • date_begin
  • date_end
  • date_begin_bce
  • date_end_bce
  • period
  • dynasty
  • reign

Physical

  • materials
  • techniques
  • dimensions_json

Cultural / Geographic

  • culture
  • location

Artist

  • artist_name
  • artist_role
  • artist_nationality
  • artist_date_begin
  • artist_date_end

Realistic Injected Errors

To support multi-modal error detection benchmarking, ArtiFact includes a curated error taxonomy applied to 130,209 records with synthetically injected errors across 7 categories and 19 subcategories, informed by domain experts at the MET, AIC, and Rijksmuseum.

Physical Errors

Injects material or technique anachronisms based on world knowledge, including interchanges between visually similar materials that frequently co-occur with the object type.

20,484 records

Culture Errors

Swaps culture labels between adjacent cultures or within the same continent — plausible but factually incorrect.

10,780 records

Temporal Errors

Shifts the creation date range by a random historical offset of ±100, ±200, or ±300 years.

8,992 records

Identity Errors

Introduces artist attribution errors ranging from random swaps to swaps between artists sharing nationality, historical era, and primary specialization.

26,952 records

Geographic Errors

Exchanges locations between neighboring countries, maintaining a lexical overlap constraint to prevent trivially detectable swaps.

13,476 records

Spatial Errors

Scales dimension units by 10× and swaps aspect ratios between height and width.

15,837 records

Visual Errors

Swaps images identified via CLIP embeddings as visual twins — adversarial pairs from different contexts that are visually similar but contextually distinct.

33,688 records

Error Taxonomy Examples

Examples of the seven injected error types in ArtiFact: physical, culture, temporal, identity, geographic, spatial, and visual errors.

Seven error categories are injected into 130,209 records to create a realistic evaluation benchmark for cross-modal error detection. Errors span physical (material/technique), cultural, temporal, identity (artist), geographic, spatial (dimensions), and visual (image swap) dimensions.

Downstream Tasks

ArtiFact is designed to support a broad range of multi-modal data management research. We showcase two representative downstream tasks: cross-modal error detection and semantic query processing.

🔍

Cross-Modal Error Detection

Detect inconsistencies between images and metadata using the 130K-record annotated benchmark.

💬

Semantic Query Processing

A mini-benchmark of 5 semantic queries combining relational predicates over structured data with semantic operators over text and images.

🖼️

Multi-Modal Retrieval

The paired image–metadata structure supports cross-modal retrieval tasks: retrieve artworks by image similarity, by structured attributes, or by combined semantic-visual queries.

Cite ArtiFact

If you use ArtiFact in your research, please cite our paper:

@article{duarte2026artifact,
  title     = {{ArtiFact}: A Large-Scale Multi-Modal Cultural Heritage Dataset},
  author    = {Duarte, Luciano and Ovcharenko, Olga and Schelter, Sebastian},
  year      = {2026},
  url       = {https://github.com/OlgaOvcharenko/ArtiFact}
}

Authors

Luciano Duarte

BIFOLD & TU Berlin

duarte.castineira@tu-berlin.de

Olga Ovcharenko

BIFOLD & TU Berlin

ovcharenko@tu-berlin.de

Sebastian Schelter

BIFOLD & TU Berlin

schelter@tu-berlin.de