less than 1 minute read

AToMiC (v0.1) Dataset is avaliable on HuggingFace Hub

See about or our white paper for more details about the task.

Purpose

Multimedia retrieval evaluation and tool developement

Dataset descriptions

split # Texts # Images # Qrels
Training 5,030,748 3,723,512 5,030,748
Validation 38,859 30,365 38,859
Test 30,938 20,732 30,938
Total 5,100,545 3,774,609 5,100,545
  • Format:
    • Texts: parquet
    • Images parquet with embedded images
    • Qrels: space separated TREC Qrel format
  • Source:
    • Image–Text tuples (Qrels) from WIT
    • Images from Wikimedia
  • Language: English

Requirements

Code snippets:

from datasets import load_dataset

dataset = load_dataset(
		"TREC-AToMiC/AToMiC-Images-v0.1",
		split='train'
	  )
print(dataset)

Other processing usages, see HuggingFace Datasets usage