Jekyll2023-11-17T18:54:21+00:00https://trec-atomic.github.io/feed.xmlAToMiCTREC-AToMiC Official WebsiteAToMiCtrec-atomic-organizers@googlegroups.comTREC 2023 AToMiC - Deadline Extension2023-08-01T08:00:00+00:002023-08-01T08:00:00+00:00https://trec-atomic.github.io/annoucements/deadline-extend<p>Dear Time-Warriors,</p>
<p>📢 Deadline Extension: We’ve heard your requests! Due to busy schedules, the TREC submission deadline has been extended to August 7th, 9:00 am (EST), aligning with the NIST office hours.
Need support? Reach out to us - let’s conquer this challenge together! 👊</p>
<p>Best regards,
TREC-AToMiC Organizers</p>
<h2 id="useful-links">Useful links</h2>
<ul>
<li><a href="https://ir.nist.gov/trecsubmit/atomic.html">Submission Form</a></li>
<li><a href="/trec-2023-guidelines/">Guidelines</a></li>
<li><a href="https://huggingface.co/datasets/TREC-AToMiC/TREC-2023-Text-to-Image">Task-1 Query</a></li>
<li><a href="https://huggingface.co/datasets/TREC-AToMiC/TREC-2023-Image-to-Text">Task-2 Query</a></li>
<li><a href="https://huggingface.co/datasets/TREC-AToMiC/AToMiC-Baselines/tree/main/trec2023/runs">Baseline runfiles</a></li>
<li><a href="https://huggingface.co/datasets/TREC-AToMiC/AToMiC-Baselines/tree/main/indexes">Prebuilt Index</a></li>
<li><a href="https://huggingface.co/datasets/TREC-AToMiC/AToMiC-Baselines/tree/main/topics">Prebuilt Embeddings</a></li>
<li><a href="https://huggingface.co/datasets/TREC-AToMiC/AToMiC-Baselines/tree/main/dev_set">Developement Data</a></li>
</ul>AToMiCtrec-atomic-organizers@googlegroups.comDear Time-Warriors,TREC 2023 AToMiC - Test queries2023-06-29T08:00:00+00:002023-06-29T08:00:00+00:00https://trec-atomic.github.io/annoucements/test-queries<p>We are pleased to announce the release of the <strong>test</strong> topics for the TREC-AToMiC task.
These topics have been carefully selected from the AToMiC text collection, and we invite participants to submit their runfiles based on these test topics.</p>
<h2 id="topic-selection-task-1">Topic Selection (Task-1)</h2>
<p>In order to ensure a diverse range of topics, we employed two primary criteria for topic selection:</p>
<ul>
<li>
<p>Unillustrated sections in enwiki:
We identified sections within the English Wikipedia (enwiki) that do not have accompanying images but potentially have matching images in other Wikipedia pages from foreign languages.</p>
</li>
<li>
<p>Vital articles at level 3:
We referred to <a href="https://en.wikipedia.org/wiki/Wikipedia:Vital_articles">Wikipedia:Vital_articles</a> and selected articles classified as level 3 vital articles. These articles serve as quality control monitors for Wikipedia.</p>
</li>
</ul>
<p>Additionally, we considered the following sub-criteria:</p>
<ul>
<li>Article Quality:
We sampled 100 topics from C-class articles, 50 topics from B-class articles, and 50 topics from Featured or Good Articles.</li>
<li>AToMiC Text Collection Coverage:
The coverage of the AToMiC text collection for the test topics is as follows: 180 topics belong to the <em>other</em> set with no sparse labels at all, 19 topics are from the <em>training</em> set, and 1 topic represents the <em>test</em> set. Note that choosing examples from the training set is deliberate as the sparse labels may not represent the entire possibility of images that would be associated with the topic</li>
</ul>
<h2 id="topic-selection-task-2">Topic Selection (Task-2)</h2>
<p>For this year’s evaluation, we will be focusing solely on Task-2 (See <a href="/trec-2023-guidelines/">guidelines</a> for more detailed instructions): No additional annotations are required.
The selection of image topics is carried out using our baseline runfiles, namely fusion-all, fusion-ViTs, and fusion-SPLADE-ViTg. From each text topic, we utilize these baseline runfiles to identify the top-20 images.
To ensure diversity and avoid duplication, we perform further deduplication on the selected images, resulting in a final set of 200 images.</p>
<p>We believe that this selection process ensures a diverse and representative set of topics for the TREC-AToMiC task.
We look forward to receiving your runfiles based on these test topics.
Please don’t hesitate to reach out if you have any questions or need further clarification.</p>
<p>Thank you for your participation!</p>
<h2 id="download-links">Download Links</h2>
<p>The dataset is available on HuggingFace 🤗:</p>
<h3 id="task-1">Task-1</h3>
<ul>
<li><a href="https://huggingface.co/datasets/TREC-AToMiC/TREC-2023-Text-to-Image">Query dataset</a></li>
<li><a href="https://huggingface.co/datasets/TREC-AToMiC/AToMiC-Baselines/tree/main/trec2023/topics">Query embeddings</a></li>
<li><a href="https://huggingface.co/datasets/TREC-AToMiC/AToMiC-Baselines/tree/main/trec2023/runs">Baseline runfiles</a></li>
</ul>
<h3 id="task-2">Task-2</h3>
<ul>
<li><a href="https://huggingface.co/datasets/TREC-AToMiC/TREC-2023-Image-to-Text">Query dataset</a></li>
</ul>AToMiCtrec-atomic-organizers@googlegroups.comWe are pleased to announce the release of the test topics for the TREC-AToMiC task. These topics have been carefully selected from the AToMiC text collection, and we invite participants to submit their runfiles based on these test topics.TREC 2023 AToMiC - Development queries2023-06-02T08:00:00+00:002023-06-02T08:00:00+00:00https://trec-atomic.github.io/annoucements/dev-queries<p>Release of the development topics for the TREC-AToMiC task. These topics are an addition on top of the validation set of AToMiC and aim to be closer to what you should expect for the task. The main difference is that they have a pooled set of annotations, leading to a richer annotation compared to the validation set. However, in order to achieve this richer annotation, there are way less queries (only 13) that have been selected to showcase different attributes of retrievers.</p>
<p>We note that the topics do not represent exactly what will be the final task (we will aim for topics that are more important to wikipedia), but more in a way that they were: a) Easy to annotate and b) Could show some important factor, such as topics from the AToMiC training set, but that the linked image is very different from others that may be found in the original AToMiC corpus.</p>
<h2 id="download-links">Download links</h2>
<ul>
<li><a href="https://huggingface.co/datasets/TREC-AToMiC/Development-Set-2023">Query dataset</a></li>
<li><a href="https://huggingface.co/datasets/TREC-AToMiC/AToMiC-Baselines/blob/main/dev_set/devset_qrel.csv">Qrel file</a></li>
<li><a href="https://huggingface.co/datasets/TREC-AToMiC/AToMiC-Baselines/tree/main/dev_set/runs">Run files for baselines (8 models)</a></li>
<li><a href="https://huggingface.co/datasets/TREC-AToMiC/AToMiC-Baselines/blob/main/dev_set/evaluate.py">Template evaluation script</a></li>
</ul>
<h2 id="baseline-results">Baseline results</h2>
<p>In order to annotate the queries we ran 5 baselines (4 Multi-Modal and 1 textual only) and created 3 ensembles (all multi-modal, best multi-modal + text, all 5 baselines). Everything is ran in the TREC task scenario (“AToMiC large”). We took the top-10 of each of the 8 runs and annotated it, leading to 533 annotations (average of 41 unique results per query). We run several metrics and present the results below:</p>
<p><img src="https://github.com/TREC-AToMiC/trec-atomic.github.io/assets/1783724/52d3b427-d982-493c-ab3d-f0a83eed00ff" alt="image" /></p>
<p>The first thing we notice is that the models are able to perform better than expected on this set of topics. Indeed, if we compare the RR@10 of ViT-G on the validation of AToMiC (0.074) with the one we obtained it is clear that the sparse annotations do not suffise for this task (we talk in more detail below looking at each topic individually). More-so, the gap between textual only models (such as SPLADE) and Multi-Modal is greatly reduced, especially on precision based metrics. Finally, we are able to see improvements using an ensemble of multi-modal and text-only models.</p>
<h2 id="overall-look-into-the-development-topics">Overall look into the development topics</h2>
<h3 id="tvcinema">TV/Cinema</h3>
<p>We choose the five following topics: Goldfinger (Cast and Plot), Space Jam (Cast), Friends (Premise), How I met your Mother (Premise)</p>
<p>Here we want to see if the model can find simple information (e.g. photos of cast members that we now have photos on wikimedia), but also look into something more complicated (plot points for GoldFinger). There’s also the idea of checking if the retriever will look into the Section as it is the goal of AToMiC and not pages (thus Cast and Plot should have very different results for Goldfinger).</p>
<h3 id="soccer">Soccer</h3>
<p>Three topics: Andrea Barzagli (Return to the national team: Euro 2012, 2013 Confederations Cup and 2014 World Cup), Manchester United (2013->Present and Ferguson years (1986–2013)).</p>
<p>Again, we want to make sure that models are looking more at the section level (including years that things happened) than the passage level. All topics here were selected knowing that images for those topics exist in other languages</p>
<h3 id="transportation">Transportation</h3>
<p>Again three topics: Emirates Airline (fleet), Flixbus (Europe) and List_of_Cadillac_vehicles</p>
<p>We chose those topics because they are easy for MultiModal models (Salient Points on images/ Require OCR), but are not always easy for text-only models (e.g. Some flixbus images describe only the bus model, but not the fact that it belongs to Flixbus).</p>
<h3 id="geographyhistory">Geography/History</h3>
<p>Finally we also take two completely different topics: NIST (World Trade Center collapse investigation) and Mutawakkilite_Kingdom_of_Yemen (introduction). The goal here was to pick something that not only was different from the ones before, but that also contain images that are not that present in traditional multimodal evaluation (e.g. Country maps, Schematics).</p>
<h2 id="detailled-look-on-the-topics">Detailled look on the topics</h2>
<p>Coming soon!</p>AToMiCtrec-atomic-organizers@googlegroups.comRelease of the development topics for the TREC-AToMiC task. These topics are an addition on top of the validation set of AToMiC and aim to be closer to what you should expect for the task. The main difference is that they have a pooled set of annotations, leading to a richer annotation compared to the validation set. However, in order to achieve this richer annotation, there are way less queries (only 13) that have been selected to showcase different attributes of retrievers.TREC 2023 AToMiC Track Guidelines2023-05-08T08:00:00+00:002023-05-08T08:00:00+00:00https://trec-atomic.github.io/TREC-guidelines<p>Welcome to the TREC 2023 AToMiC (Authoring Tools for Multimedia Content) track.
This page provides essential guidelines about the track, including important dates, registration instructions, tasks, datasets, submission requirements, and evaluation methods.</p>
<h2 id="important-dates">Important Dates</h2>
<ul>
<li>Development topics released: June 2nd, 2023</li>
<li>Test topics released: <del>June 26th, 2023</del> June 29th, 2023</li>
<li>Submission deadline: <del>July 24th, 2023</del> Aug 7th 9:00 (EST), 2023</li>
<li>Assessment period: August 2023</li>
</ul>
<h2 id="registration">Registration</h2>
<p>To participate in TREC, please register at the TREC <a href="https://ir.nist.gov/trecsubmit.open/application.html">website</a></p>
<h2 id="submission-form">Submission Form</h2>
<p>Submissions for the AToMiC track are open at <a href="https://ir.nist.gov/trecsubmit/atomic.html">website</a>.
Feel free to reach out if any questions.</p>
<h2 id="introduction">Introduction</h2>
<p>Multimedia content creation involves understanding the connections between elements encoded in different modalities.
Visual elements such as photos, graphics, and diagrams are often used to supplement textual information, serving different purposes such as decoration, complementing, or transforming the meaning of the content.
However, searching for and adding a suitable visual element to articles can be a time-consuming task.
Our goal is to build automatic tools that can enrich multimedia content, making information easier and faster to comprehend, and ultimately breaking down the barrier to accessing information.</p>
<p>Image understanding is primarily concerned with concrete descriptions of depicted scenes and entities, their attributes and relations, as well as the events they participate in.
Current search engines often index attribute metadata and textual descriptions of images.
However, attribute metadata often fail to cover the salient aspects of the image content.
While textual descriptions often provide precise information, they are frequently scarce in terms of availability.
In contexts that require a precise understanding of images, search is frequently limited by the availability of text descriptions or the cost of obtaining accurate image captions.</p>
<h2 id="datasets">Datasets</h2>
<h3 id="resources">Resources</h3>
<ul>
<li>🤗 Dataset: <a href="https://huggingface.co/datasets/TREC-AToMiC/AToMiC-Texts-v0.2.1">Text Collection</a></li>
<li>🤗 Dataset: <a href="https://huggingface.co/datasets/TREC-AToMiC/AToMiC-Images-v0.2">Image Collection</a></li>
<li>🤗 Dataset: <a href="https://huggingface.co/datasets/TREC-AToMiC/AToMiC-Qrels-v0.2">Sparse Judgments</a></li>
<li>Looking for more information?
<ul>
<li>Check our <a href="https://arxiv.org/abs/2304.01961">paper</a> (accepted to SIGIR23)</li>
<li>Check our repo: <a href="https://github.com/TREC-AToMiC/AToMiC">AToMiC</a></li>
</ul>
</li>
<li>Development topics/qrels: check our latest <a href="/annoucements/dev-queries/">post</a></li>
<li><em>** TREC2023 topics: are now available <a href="/annoucements/test-queries/">here</a> **</em></li>
</ul>
<h2 id="tasks-ad-hoc-retrieval">Tasks: <em>ad hoc</em> retrieval</h2>
<p>TREC 2023 AToMiC Track features an image suggestion task as the primary task, while keeping image promotion as the secondary task.</p>
<h3 id="task-1-image-suggestion-primary">Task 1: Image Suggestion (Primary)</h3>
<p>The goal of this task is to find relevant images from a predefined <a href="https://huggingface.co/datasets/TREC-AToMiC/AToMiC-Images-v0.2">image collection</a>, given a specific section of an article.
The task involves creating high-quality representations of information presented in images, so that appropriate images can be attached to the corresponding article sections.</p>
<p>The participants are expected to use the following fields:</p>
<ul>
<li>Topics (texts): the five textual fields <code class="language-plaintext highlighter-rouge">page_title</code>, <code class="language-plaintext highlighter-rouge">section_title</code>, <code class="language-plaintext highlighter-rouge">hierachy</code>, <code class="language-plaintext highlighter-rouge">context_page_description</code>, <code class="language-plaintext highlighter-rouge">context_section_description</code>.
Other fields such as <code class="language-plaintext highlighter-rouge">page_url</code>, <code class="language-plaintext highlighter-rouge">media</code> and <code class="language-plaintext highlighter-rouge">category</code> are free to use, but we encourage participants provide descriptions when submitting their results.</li>
<li>Items (images): the pixel values in the <code class="language-plaintext highlighter-rouge">image</code> field and the corresponding textual descriptions such as <code class="language-plaintext highlighter-rouge">caption_reference_description</code>, <code class="language-plaintext highlighter-rouge">caption_alt_text_description</code>, <code class="language-plaintext highlighter-rouge">caption_attribution_description</code>.
Other fields such as <code class="language-plaintext highlighter-rouge">language</code> and <code class="language-plaintext highlighter-rouge">image_url</code> are free to use, but we encourage participants to provide descriptions when submitting their results.</li>
</ul>
<h3 id="task-2-image-promotion-secondary">Task 2: Image Promotion (Secondary)</h3>
<p>This task is the inverse of the image suggestion task.
Given a specific image represented by its pixel values, the goal is to identify a section of an article where the image can be appropriately attached.
The participants are expected to retrieve relevant items from a predefined <a href="https://huggingface.co/datasets/TREC-AToMiC/AToMiC-Texts-v0.2.1">text collection</a>.
The objective is to prevent attaching the image to an irrelevant article, which could introduce noise when comprehending multimedia articles.
Although not the primary focus of this track, models built for this purpose could support authoring tools for crafting precise captions in multimedia content.</p>
<p>The participants are expected to use the following fields:</p>
<ul>
<li>
<p>Topics (images): the pixel values in the <code class="language-plaintext highlighter-rouge">image</code> field and the corresponding textual descriptions such as <code class="language-plaintext highlighter-rouge">caption_reference_description</code>, <code class="language-plaintext highlighter-rouge">caption_alt_text_description</code>, <code class="language-plaintext highlighter-rouge">caption_attribution_description</code>.
Other fields such as <code class="language-plaintext highlighter-rouge">language</code> and <code class="language-plaintext highlighter-rouge">image_url</code> are free to use, but we encourage participants to provide descriptions when submitting their results.</p>
</li>
<li>
<p>Items (texts): the five textual fields <code class="language-plaintext highlighter-rouge">page_title</code>, <code class="language-plaintext highlighter-rouge">section_title</code>, <code class="language-plaintext highlighter-rouge">hierachy</code>, <code class="language-plaintext highlighter-rouge">context_page_description</code>, <code class="language-plaintext highlighter-rouge">context_section_description</code>.
Other fields such as <code class="language-plaintext highlighter-rouge">page_url</code>, <code class="language-plaintext highlighter-rouge">media</code> and <code class="language-plaintext highlighter-rouge">category</code> are free to use, but we encourage participants provide descriptions when submitting their results.</p>
</li>
</ul>
<p><strong>Note</strong>:
The evaluation of this task will use the annotations collected from the image suggestion task.
The assumption is that the two tasks are correlated, and the annotations could be transferred to this task.</p>
<h2 id="submission">Submission</h2>
<p>Submissions for the ad-hoc retrieval task should be in standard TREC 6-column format.
The six columns are <code class="language-plaintext highlighter-rouge">TopicID</code>, <code class="language-plaintext highlighter-rouge">Q0</code>, <code class="language-plaintext highlighter-rouge">ItemID</code>, <code class="language-plaintext highlighter-rouge">Rank</code>, <code class="language-plaintext highlighter-rouge">Score</code>, and <code class="language-plaintext highlighter-rouge">RunID</code> separated by spaces.
A submission should include at least one result for every topic in the test set.
Participants may submit up to <strong>5</strong> runs, each a separate 6-column file consisting of ranked results for all topics in the test set.
Participants are asked to return a ranked list of at most <strong>1,000</strong> items for each topic.</p>
<p>The six column format:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>TopicID Q0 ItemID Rank Score RunID
</code></pre></div></div>
<ul>
<li><code class="language-plaintext highlighter-rouge">TopicID</code> contains the unique topic (query) identifiers. A submission should include at least one result for every topic in the test set.</li>
<li><code class="language-plaintext highlighter-rouge">Q0</code> contains a static string “Q0”. It’s a placeholder that we’re not using.</li>
<li><code class="language-plaintext highlighter-rouge">ItemId</code> contains the unique item identifiers for the retrieved candidates.</li>
<li><code class="language-plaintext highlighter-rouge">Rank</code> is the rank at which the candidate is retrieved.</li>
<li><code class="language-plaintext highlighter-rouge">Score</code> shows the score (interger or floating point) corresponding to the <code class="language-plaintext highlighter-rouge">TopicID</code> and <code class="language-plaintext highlighter-rouge">ItemID</code> pair. This score <strong>must be in descending order</strong>.</li>
<li><code class="language-plaintext highlighter-rouge">RunID</code> is the tag for your submission run.</li>
</ul>
<p>Example submission:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>topic_1 Q0 item_1 1 2.73 runid1
topic_1 Q0 item_5 2 2.71 runid1
topic_1 Q0 item_9 3 2.61 runid1
topic_2 Q0 item_9 1 7.15 runid1
topic_2 Q0 item_15 2 0.89 runid1
</code></pre></div></div>
<h3 id="submission-types">Submission Types</h3>
<p>There are two types of submissions:</p>
<ol>
<li>
<p><strong>Automatic</strong>:
The main type of submission, where there is <strong>no manual intervention</strong> in running the test topics.
The ideal case is that you only look at the test topics to check that they ran properly (i.e., no bugs), and then you submit your automatic runs.
You should not adjust your runs, rewrite the query, retrain your model, or make any other sorts of manual adjustments after you see the test topics.</p>
</li>
<li>
<p><strong>Manual</strong>:
If you want to have a human in the loop for your run or do anything else that uses the test topics to adjust your model or ranking, you can mark your run as manual.
Manual runs are still welcome, but these are distinct from our main scenario, which is a system that responds to unseen topics automatically.</p>
</li>
</ol>
<h2 id="evaluation">Evaluation</h2>
<p>Unlike other machine learning competitions, the TREC 2023 AToMiC Track will not have a public/private leaderboard analogy.
The annotations will be constructed <em>after</em> collecting all the submissions.
We will use depth pooling and construct the item pools for the queries.
Items in these pools will then be annotated by NIST assessors with graded relevance judgments.
The final evaluation results will be announced at the TREC 2023 workshop.</p>
<p>In addition to the <a href="https://huggingface.co/datasets/TREC-AToMiC/AToMiC-Qrels-v0.2">sparse judgments</a>, a set of development topics with judgments will be released for early evaluation before submitting the test runs.
However, these judgments are not guaranteed to be similar to the final assessments by NIST annotators.</p>
<h2 id="additional-resources">Additional Resources</h2>
<p>Contributing more resources and suggtsions are welcome.</p>
<h3 id="prebuilt-indexes">Prebuilt indexes</h3>
<p>The prebuilt indexes are now avaiable on <a href="https://huggingface.co/datasets/TREC-AToMiC/AToMiC-Baselines">🤗 datasets</a>.
You can use the following command to download the prebuilt indexes:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wget https://huggingface.co/datasets/TREC-AToMiC/AToMiC-Baselines/resolve/main/indexes/ViT-L-14.laion2b_s32b_b82k.image.faiss.flat.tar.gz
wget https://huggingface.co/datasets/TREC-AToMiC/AToMiC-Baselines/resolve/main/indexes/ViT-L-14.laion2b_s32b_b82k.text.faiss.flat.tar.gz
</code></pre></div></div>
<h2 id="contact-information">Contact Information</h2>
<p>If any question, comments, or suggestions for organizers:</p>
<ul>
<li>Email <a href="mailto:jheng-hong.yang@uwaterloo.ca">Jheng-Hong Yang</a> or <a href="mailto:trec-atomic-organizers@googlegroups.com">AToMiC organizers</a></li>
</ul>
<p>Discuss with other participants:</p>
<ul>
<li>Mail loop for further annocements: <a href="https://groups.google.com/g/atomic-participants">Google group</a></li>
<li>Chit-chat & quick discussion: <a href="https://discord.gg/pgDMArnGAH">Discord</a></li>
</ul>
<h2 id="organizers">Organizers</h2>
<ul>
<li>Jheng-Hong Yang, Univeristy of Waterloo</li>
<li>Carlos Lassance, Naver Labs Europe</li>
<li>Rafael S. Rezende, Naver Labs Europe</li>
<li>Krishna Srinivasan, Google Research</li>
<li>Miriam Redi, Wikimedia Foundation</li>
<li>Stéphane Clinchant, Naver Labs Europe</li>
<li>Jimmy Lin, University of Waterloo</li>
</ul>
<p>Thank you for your interest in the TREC 2023 AToMiC Track. We look forward to your participation.</p>AToMiCtrec-atomic-organizers@googlegroups.comWelcome to the TREC 2023 AToMiC (Authoring Tools for Multimedia Content) track. This page provides essential guidelines about the track, including important dates, registration instructions, tasks, datasets, submission requirements, and evaluation methods.AToMiC Text Collection Update2023-05-01T08:00:00+00:002023-05-01T08:00:00+00:00https://trec-atomic.github.io/annoucements/update-text-collection<ul>
<li>Text collection update: We have addressed the missing entity issues in our text collection and have released an updated version, <code class="language-plaintext highlighter-rouge">AToMiC-Texts-v0.2.1</code>. For those interested in participating in the TREC 2023 evaluation, please use this updated version. If you wish to reproduce the results presented in our SIGIR paper, please use <code class="language-plaintext highlighter-rouge">AToMiC-Texts-v0.2</code>. We have created a spreadsheet highlighting the differences in retrieval effectiveness between the two versions, which can be found <a href="https://docs.google.com/spreadsheets/d/1wSi_79Qx3GA1WAirwvoapiWJ4m2bPRM_rtUWRZ2qRIo/edit?usp=sharing">here</a>.</li>
</ul>
<h2 id="changes">Changes</h2>
<ul>
<li>Fix missing entity issues in <code class="language-plaintext highlighter-rouge">AToMiC-Text-v0.2</code>.</li>
</ul>
<p>A passage in <code class="language-plaintext highlighter-rouge">AToMiC-Texts-v0.2</code></p>
<blockquote>
<p>text_id:
projected-08555460-002
context_page_description:
The Boeing EC-135 is a retired family of aircraft derived from the . During the , the EC-135 was best known for being modified to perform the mission where one EC-135 was always airborne 24 hours a day to serve as flying command post for the in the event of nuclear war. Various other EC-135 aircraft sat on airborne and ground alert throughout the Cold War, with the last EC-135C being retired in 1998. The EC-135N variant served as the tracking aircraft for the .\n\nThe Boeing E-6B “TACAMO” replaced the EC-135C.</p>
</blockquote>
<p>The same passage in <code class="language-plaintext highlighter-rouge">AToMiC-Texts-v0.2.1</code>:</p>
<blockquote>
<p>text_id:
projected-08555460-002
context_page_description:
The Boeing EC-135 is a retired family of command and control aircraft derived from the Boeing C-135 Stratolifter. During the Cold War, the EC-135 was best known for being modified to perform the Looking Glass mission where one EC-135 was always airborne 24 hours a day to serve as flying command post for the Strategic Air Command in the event of nuclear war. Various other EC-135 aircraft sat on airborne and ground alert throughout the Cold War, with the last EC-135C being retired in 1998. The EC-135N variant served as the tracking aircraft for the Apollo program.\n\nThe Boeing E-6B “TACAMO” replaced the EC-135C.</p>
</blockquote>AToMiCtrec-atomic-organizers@googlegroups.comText collection update: We have addressed the missing entity issues in our text collection and have released an updated version, AToMiC-Texts-v0.2.1. For those interested in participating in the TREC 2023 evaluation, please use this updated version. If you wish to reproduce the results presented in our SIGIR paper, please use AToMiC-Texts-v0.2. We have created a spreadsheet highlighting the differences in retrieval effectiveness between the two versions, which can be found here.AToMiC (v0.2) Dataset Released2023-04-05T08:00:00+00:002023-04-05T08:00:00+00:00https://trec-atomic.github.io/annoucements/paper-accepted<p>The AToMiC dataset for the TREC 2023 evaluation is now available at the following locations:</p>
<ul>
<li><a href="https://github.com/TREC-AToMiC/AToMiC">GitHub</a></li>
<li><a href="https://huggingface.co/TREC-AToMiC">HuggingFace Hub</a></li>
</ul>
<p>To aid exploration of the dataset, we have included notebooks <a href="https://github.com/TREC-AToMiC/AToMiC/tree/main/notebooks">here</a>.
Additionally, the resource paper that accompanies the dataset is now available on <a href="https://arxiv.org/abs/2304.01961">arXiv</a>.</p>
<h2 id="changes">Changes</h2>
<ul>
<li>Expanded the text collection by including text-only samples (English Wikipedia articles) without any associated images. The previous version (v0.1) only contained paired image-text examples.</li>
<li>Expanded the image collection by incorporating images from non-English languages. The previous version only included images attached to English articles.</li>
</ul>AToMiCtrec-atomic-organizers@googlegroups.comThe AToMiC dataset for the TREC 2023 evaluation is now available at the following locations: GitHub HuggingFace HubAToMiC (v0.1) Dataset Released2022-12-02T08:00:00+00:002022-12-02T08:00:00+00:00https://trec-atomic.github.io/annoucements/huggingface-hub-released<p>AToMiC (v0.1) Dataset is avaliable on <a href="https://huggingface.co/TREC-AToMiC">HuggingFace Hub</a></p>
<p>See <a href="/about/">about</a> or our <a href="/assets/pdf/mm_track.pdf">white paper</a> for more details about the task.</p>
<h2 id="purpose">Purpose</h2>
<p>Multimedia retrieval evaluation and tool developement</p>
<h2 id="dataset-descriptions">Dataset descriptions</h2>
<table>
<thead>
<tr>
<th>split</th>
<th style="text-align: right"># Texts</th>
<th style="text-align: right"># Images</th>
<th style="text-align: right"># Qrels</th>
</tr>
</thead>
<tbody>
<tr>
<td>Training</td>
<td style="text-align: right">5,030,748</td>
<td style="text-align: right">3,723,512</td>
<td style="text-align: right">5,030,748</td>
</tr>
<tr>
<td>Validation</td>
<td style="text-align: right">38,859</td>
<td style="text-align: right">30,365</td>
<td style="text-align: right">38,859</td>
</tr>
<tr>
<td>Test</td>
<td style="text-align: right">30,938</td>
<td style="text-align: right">20,732</td>
<td style="text-align: right">30,938</td>
</tr>
<tr>
<td>Total</td>
<td style="text-align: right">5,100,545</td>
<td style="text-align: right">3,774,609</td>
<td style="text-align: right">5,100,545</td>
</tr>
</tbody>
</table>
<ul>
<li>Format:
<ul>
<li>Texts: parquet</li>
<li>Images parquet with embedded images</li>
<li>Qrels: space separated TREC Qrel format</li>
</ul>
</li>
<li>Source:
<ul>
<li>Image–Text tuples (Qrels) from <a href="https://github.com/google-research-datasets/wit">WIT</a></li>
<li>Images from Wikimedia</li>
</ul>
</li>
<li>Language: English</li>
</ul>
<h2 id="requirements">Requirements</h2>
<ul>
<li><a href="https://github.com/huggingface/datasets">HuggingFace Datasets >= 2.6.0</a></li>
</ul>
<h2 id="code-snippets">Code snippets:</h2>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">datasets</span> <span class="kn">import</span> <span class="n">load_dataset</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">load_dataset</span><span class="p">(</span>
<span class="s">"TREC-AToMiC/AToMiC-Images-v0.1"</span><span class="p">,</span>
<span class="n">split</span><span class="o">=</span><span class="s">'train'</span>
<span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">dataset</span><span class="p">)</span>
</code></pre></div></div>
<p>Other processing usages, see <a href="https://huggingface.co/docs/datasets/main/en/process">HuggingFace Datasets usage</a></p>AToMiCtrec-atomic-organizers@googlegroups.comAToMiC (v0.1) Dataset is avaliable on HuggingFace Hub