AToMiC (v0.1) Dataset Released
AToMiC (v0.1) Dataset is avaliable on HuggingFace Hub
See about or our white paper for more details about the task.
Purpose
Multimedia retrieval evaluation and tool developement
Dataset descriptions
split | # Texts | # Images | # Qrels |
---|---|---|---|
Training | 5,030,748 | 3,723,512 | 5,030,748 |
Validation | 38,859 | 30,365 | 38,859 |
Test | 30,938 | 20,732 | 30,938 |
Total | 5,100,545 | 3,774,609 | 5,100,545 |
- Format:
- Texts: parquet
- Images parquet with embedded images
- Qrels: space separated TREC Qrel format
- Source:
- Image–Text tuples (Qrels) from WIT
- Images from Wikimedia
- Language: English
Requirements
Code snippets:
from datasets import load_dataset
dataset = load_dataset(
"TREC-AToMiC/AToMiC-Images-v0.1",
split='train'
)
print(dataset)
Other processing usages, see HuggingFace Datasets usage