ArtiFact — A Large-Scale Multi-Modal Cultural Heritage Dataset

Overview

What is ArtiFact?

Multi-modal data has become a central focus of contemporary database research, driving advances in data integration, retrieval, semantic query processing, and data quality management. Despite this growing interest, the community lacks large-scale real-world datasets with multi-modal records. We present ArtiFact, a large-scale multi-modal cultural heritage dataset containing 651,045 museum records collected from three major institutions: the Metropolitan Museum of Art (MET), the Art Institute of Chicago (AIC), and the Rijksmuseum in Amsterdam, each pairing images with textual descriptions and structured data. To construct the dataset, we developed a unified ETL pipeline that combines rule-based normalization with LLM-assisted semantic parsing to standardize heterogeneous museum records across institutions. ArtiFact supports a broad range of downstream tasks, including semantic query processing, multi-modal retrieval, and multi-modal error detection. For the latter, the dataset incorporates a curated taxonomy of realistic errors, allowing evaluation of multi-modal data cleaning methods.

How It Works

Dataset Construction & Error Taxonomy

ArtiFact is built through a 5-step Extract-Transform-Load (ETL) pipeline that collects, structures, normalizes, and deduplicates records from three institutions, then supports downstream tasks including error detection and semantic querying.

Data Collection & Downstream Tasks

Overview of the ArtiFact dataset generation process: collect, structure, normalize, and deduplicate records, and support downstream tasks including cross-modal error detection and semantic query processing.

Overview of the dataset generation process: an ETL pipeline ① collects, ② structures & prepares, ③ maps to a shared schema, and ④ performs semantic data unification across three cultural heritage institutions; the preprocessed dataset supports ⑤ downstream tasks including cross-modal error detection and semantic query processing.

Access

Get the Data

ArtiFact is freely available on HuggingFace Datasets and GitHub. Use the datasets library to load it programmatically in seconds.

🤗

HuggingFace Dataset

Download directly from the HuggingFace Hub.

🤗 deem-data/ArtiFact

GitHub Repository

Source code for the ETL pipeline, error injection framework, and downstream tasks.

OlgaOvcharenko/ArtiFact

from datasets import load_dataset

# Load the full dataset
ds = load_dataset("deem-data/ArtiFact")

# Inspect available splits and columns
print(ds)

# Access the first record
print(ds["train"][0])

For licensing inquiries, contact ovcharenko@tu-berlin.de.

Dataset

Scale & Sources

ArtiFact aggregates open-access records from three of the world's premier cultural institutions, retaining only records with public-domain image URLs.

651,045

Total artwork records

Contributing museums

1,700+

Canonical material & technique terms

168K

Records processed with LLM parsing

The Metropolitan Museum of Art

Collected via the MET REST API across all 19 curatorial departments.

Art Institute of Chicago

Collected via open-access JSON-LD files using the International Image Interoperability Framework (IIIF).

Rijksmuseum Amsterdam

Collected via OAI-PMH and recursive JSON-LD linked data parsing.

Schema

Unified 24-Column Schema

Records from all three institutions are integrated into a single standardized schema spanning identifiers, temporal attributes, physical properties, cultural/geographic context, and artist metadata.

Identifiers

object_ID
object_name
title
description
subjects
inscriptions
image_url

Temporal

date_begin
date_end
date_begin_bce
date_end_bce
period
dynasty
reign

Physical

materials
techniques
dimensions_json

Cultural / Geographic

culture
location

Artist

artist_name
artist_role
artist_nationality
artist_date_begin
artist_date_end

Error Taxonomy

Realistic Injected Errors

To support multi-modal error detection benchmarking, ArtiFact includes a curated error taxonomy applied to 130,209 records with synthetically injected errors across 7 categories and 19 subcategories, informed by domain experts at the MET, AIC, and Rijksmuseum.

Physical Errors

Injects material or technique anachronisms based on world knowledge, including interchanges between visually similar materials that frequently co-occur with the object type.

20,484 records

Culture Errors

Swaps culture labels between adjacent cultures or within the same continent — plausible but factually incorrect.

10,780 records

Temporal Errors

Shifts the creation date range by a random historical offset of ±100, ±200, or ±300 years.

8,992 records

Identity Errors

Introduces artist attribution errors ranging from random swaps to swaps between artists sharing nationality, historical era, and primary specialization.

26,952 records

Geographic Errors

Exchanges locations between neighboring countries, maintaining a lexical overlap constraint to prevent trivially detectable swaps.

13,476 records

Spatial Errors

Scales dimension units by 10× and swaps aspect ratios between height and width.

15,837 records

Visual Errors

Swaps images identified via CLIP embeddings as visual twins — adversarial pairs from different contexts that are visually similar but contextually distinct.

33,688 records

Error Taxonomy Examples

Examples of the seven injected error types in ArtiFact: physical, culture, temporal, identity, geographic, spatial, and visual errors.

Seven error categories are injected into 130,209 records to create a realistic evaluation benchmark for cross-modal error detection. Errors span physical (material/technique), cultural, temporal, identity (artist), geographic, spatial (dimensions), and visual (image swap) dimensions.

Applications

Downstream Tasks

ArtiFact is designed to support a broad range of multi-modal data management research. We showcase two representative downstream tasks: cross-modal error detection and semantic query processing.

🔍

Cross-Modal Error Detection

Detect inconsistencies between images and metadata using the 130K-record annotated benchmark.

💬

Semantic Query Processing

A mini-benchmark of 5 semantic queries combining relational predicates over structured data with semantic operators over text and images.

🖼️

Multi-Modal Retrieval

The paired image–metadata structure supports cross-modal retrieval tasks: retrieve artworks by image similarity, by structured attributes, or by combined semantic-visual queries.

ArtiFact: A Large-Scale Multi-Modal Cultural Heritage Dataset