The R2VQ task is structured as question answering pairs, querying how well a system understands the semantics of recipes, derived rom a collection of cooking recipes and videos. Each question belongs to a “question family” reflecting a specific reasoning competence. The associated R2VQ dataset designed for testing competence-based comprehension of machines over a multimodal recipe collection.

Data components

For the purposes of our work, we will build the R2VQ dataset, a dataset consisting of a collection of recipes sourced from https://recipes. fandom.com/wiki/Recipes_Wiki and foodista.com, and labeled according to three distinct annotation layers: (i) Cooking Role Labeling (CRL), (ii) Semantic Role Labeling (SRL), and (iii) aligned image frames taken from cooking videos in the YouCook2 dataset (Zhou et al., 2018). The R2VQ corpus will consist of 1,000 recipes, with an estimated average of 10 ingredients, 8 sentences, and an average step length of 35 tokens per recipe. 800 recipes will be used for training, 100 recipes each for validation and testing.

Cooking Role Labeling

Each recipe is annotated at the span-level for cooking-related actions and the associated ingredients and props (tools, containers, habitats). The ingredients can be either labeled as explicit (those listed in the ingredients section of the recipe) or implicit (the intermediate outputs of applying a cooking action to a set of explicit ingredients). Additionally, we include a variety of attributes that allow for iterative state tracking over the recipe text: each cooking event is directionally associated with the cooking action preceding it, allowing for a trace of ingredients and props as they are modified by each action, as well as coreference grounding for implicit ingredients (e.g. the implicit ingredient marinade is associated with the cooking event combine(vinegar,soy_sauce,oil)). Props are also annotated for orientation, which provides additional contextual information for downstream visualization and semantic reasoning tasks. Finally, cooking events that implicate props not explicitly mentioned in the text are marked to reflect that these additional props are necessary to complete the action.

Semantic Role Labeling

One of the three layers with which steps in R2VQ are annotated is the Semantic Role Labeling (SRL) layer. In the context of our evaluation exercise, we employ SRL, i.e., the task of automatically identifying and labeling argument structures, in its span-based approach, hence tagging the whole span of arguments in given sentences. We chose VerbAtlas (Di Fabio et al., 2019 - http://verbatlas.org/) as our reference inventory of semantic roles and first labeled the 5 recipes that make up the trial data automatically, by means of a state-of-the-art system (Conia and Navigli, 2020). Subsequently, we had one expert annotator validate both predicates and argument labels to ensure data quality.

Video Component

When sourcing the video data, 1,100 videos in the YouCook2 training and validation splits were used. A related dataset released in 2018, the YouCook2- BoundingBox (Zhou et al., 2018) dataset is a step forward in the right direction of visual-semantic grounding in a multimodal dataset, but still lacks the ability to display the full notion of competency outlined in our paper. The dataset contains 15,000 video-description pairs, annotated with the bounding boxes of the 67 most frequent objects. Of the 60,663 seconds (17 hours) of video data annotated with bounding boxes in the validation set of YouCook2-BB, 150,647 objects were annotated. Of the 150,647 annotated objects, 26,094 of those objects are occluded from view.

It is important to note that objects are only annotated when explicitly mentioned in a given text description. As a result, the competency-based inference that the action “Beat the eggs” requires a “fork” is not accounted for in the YouCook2- BoundingBox dataset. Our alignment of image trigrams with contextualized cooking-actions and annotation of all props in the frames account for these sort of competency-based inferences.

For each cooking action in the R2VQ dataset where a visual counterpart can be found, a video id and timestamp pointing to a YouTube video is included. Video data was sourced from both the YouCook2 dataset, and ad hoc videos found by querying the YouTube API with a given recipe’s title. We use the S3D MIL-NCE model for text-to-video retrieval, using spans of text containing a cooking-action as input.

Question Families

We adopt the concept of “question families” as out-lined in the CLEVR dataset (Johnson et al., 2017). While some question families naturally transfer over from the VQA domain (e.g., integer comparison, counting), other concepts such as ellipsis and object lifespan must be employed to cover the full extent of competency within procedural texts.

We start by creating text templates for each of question family we identified. Actual questions will be created through the combination of templates and random entities/relations from the annotation. Word inflection is applied to ensure the grammaticality of the questions. Each template is also associated with a functional program. It contains a set of functions that allow to query and filter the annotated recipe to get the answer to that template-based question.

Sample Question-answer pairs from the trial data are as follows:

  • Cardinality
    # question cq5 = How many times the spatula is used?
    # answer cq5 = 4
  • Ellipsis
    # question eq12 = What should be tossed in the saute pan?
    # answer eq12 = pancetta, asparagus, parmesan cheese and pasta
  • Implicit Argument Identification
    # question iq12 = What is used to roll the dough?
    # answer iq12 = rolling pin
  • Object Lifespan
    # question oq1 = Are the dough in step 3 and step 7 identical?
    # answer oq1 = False
  • Event Ordering
    # question eoq1 = Creaming butter with the mixer and beating butter, which comes first?
    # answer eoq1 = the first event
  • Attribute
    # question attrq01 = How would you beat the flour?
    # answer attrq01 = gradually
  • Temporal
    # question tempq20 = How long should you simmer dumplings?
    # answer tempq20 = for about 15 minutes
  • Result
    # question resq00 = To what extent should you knead the dough?
    # answer resq00 = until it becomes smooth