Visual Question Answering – VizWiz (original) (raw)

Answer Visual Questions from People Who Are Blind

Two rows with six visual questions and the corresponding answer for each visual question.

Overview

We propose an artificial intelligence challenge to design algorithms that answer visual questions asked by people who are blind. For this purpose, we introduce the visual question answering (VQA) dataset coming from this population, which we call VizWiz-VQA. It originates from a natural visual question answering setting where blind people each took an image and recorded a spoken question about it, together with 10 crowdsourced answers per visual question. Our proposed challenge addresses the following two tasks for this dataset: predict the answer to a visual question and (2) predict whether a visual question cannot be answered. Ultimately, we hope this work will educate more people about the technological needs of blind people while providing an exciting new opportunity for researchers to develop assistive technologies that eliminate accessibility barriers for blind people.

Dataset

Updated version as of January 10, 2023:

New, larger version as of January 1, 2020:

Deprecated version as of December 31, 2019:

New versus deprecated: the deprecated version has 12 digits for filenames (e.g., “VizWiz_val_000000028000.jpg”) while the new version has 8 digits for filenames (e.g., “VizWiz_val_00028000.jpg”) .

New dataset only: VizWiz_train_00023431.jpg — VizWiz_train_00023958.jpg are images which include private information that is obfuscated using the ImageNet mean. The questions do not ask about the obfuscated private information in these images.

Dataset files to download are as follows:

"answerable": 0,
"image": "VizWiz_val_00028000.jpg",
"question": "What is this?"
"answer_type": "unanswerable",
"answers": [
    {"answer": "unanswerable", "answer_confidence": "yes"},
    {"answer": "chair", "answer_confidence": "yes"},
    {"answer": "unanswerable", "answer_confidence": "yes"},
    {"answer": "unanswerable", "answer_confidence": "no"},
    {"answer": "unanswerable", "answer_confidence": "yes"},
    {"answer": "text", "answer_confidence": "maybe"},
    {"answer": "unanswerable", "answer_confidence": "yes"},
    {"answer": "bottle", "answer_confidence": "yes"},
    {"answer": "unanswerable", "answer_confidence": "yes"},
    {"answer": "unanswerable", "answer_confidence": "yes"}
]

These files show two ways to assign answer-type: train.json, val.json. “answer_type” is the answer type for the most popular answer in the deprecated version and the most popular answer type for all answers’ answer types in the new version.

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Challenge

Our proposed challenge is designed around the VizWiz-VQA dataset and addresses the following two tasks:

Tasks

Task 1: Predict Answer to a Visual Question

Given an image and question about it, the task is to predict an accurate answer. Inspired by the VQA challenge, we use the following accuracy evaluation metric:

Figure of the mathematical formula which is accuracy equals minimum of either one or the number of humans that provided that answer divided by three.

Evaluation metric is the minimum between 1 and the number of people who provided the answer minus 1.

Following the VQA challenge, we average over all 10 choose 9 sets of human annotators. The team which achieves the maximum average accuracy for all test visual questions wins this challenge.

Task 2: Predict Answerability of a Visual Question

Given an image and question about it, the task is to predict if the visual question cannot be answered (with a confidence score in that prediction). The confidence score provided by a prediction model is for ‘answerable’ and should be in [0,1]. We use Python’s average precision evaluation metric which computes the weighted mean of precisions under a precision-recall curve. The team that achieves the largest average precision score for all test visual questions wins this challenge.

Submission Instructions

Evaluation Servers

Teams participating in the challenge must submit results for the full 2018 VizWiz-VQA test dataset (i.e., 8,000 visual questions) to our evaluation servers, which are hosted on EvalAI (challenges 2020, 2021, 2022, 2023, and 2024 are deprecated). As done for prior challenges (e.g., VQA, COCO), we created different partitions of the test dataset to support different evaluation purposes:

Uploading Submissions to Evaluation Servers

To submit results, each team will first need to create a single account on EvalAI. On the platform, then click on the “Submit” tab in EvalAI, select the submission phase (“test-dev”, “test-challenge”, or “test-standard”), select the results file (i.e., JSON file) to upload, fill in required metadata about the method, and then click “Submit”. The evaluation server may take several minutes to process the results. To have the submission results appear on the public leaderboard, when submitting to “test-standard”, check the box under “Show on Leaderboard”.

To view the status of a submission, navigate on the EvalAI platform to the “My Submissions” tab and choose the phase to which the results file was uploaded (i.e., “test-dev”, “test-challenge”, or “test-standard”). One of the following statuses should be shown: “Failed” or “Finished”. If the status is “Failed”, please check the “Stderr File” for the submission to troubleshoot. If the status is “Finished”, the evaluation successfully completed and the evaluation results can be downloaded. To do so, select “Result File” to retrieve the aggregated accuracy score for the submission phase used (i.e., “test-dev”, “test-challenge”, or “test-standard”).

The submission process is identical when submitting results to the “test-dev”, “test-challenge”, and “test-standard” evaluation servers. Therefore, we strongly recommend submitting your results first to “test-dev” to verify you understand the submission process.

Submission Results Formats

Use the following JSON formats to submit results for both challenge tasks:

Task 1: Predict Answer to a Visual Question

results = [result]

result = { "image": string, # e.g., 'VizWiz_test_00020000.jpg' "answer": string }

Task 2: Predict Answerability of a Visual Question

results = [result]

result = { "image": string, # e.g., 'VizWiz_test_00020000.jpg' "answerable": float #confidence score, 1: answerable, 0: unanswerable }

Leaderboards

New dataset (as of January 1, 2020): the Leaderboard pages for both tasks can be found here (Leaderboard 2020 can be found here).

Deprecated dataset (as of December 31, 2019): the Leaderboard pages for both tasks can be found here.

Rules

Code

The code for algorithm results reported in our CVPR 2018 publication is located here.

Publications

The new dataset is described in the following publications, while the deprecated version is based only on the first publication:

Contact Us

For questions about the VizWiz-VQA 2024 challenge that is part of the Visual Question Answering and Dialog workshop, please contact Everley Tseng at Yu-Yun.Tseng@colorado.edu or Chongyan Chen at chongyanchen_hci@utexas.edu.

For general questions, please review our FAQs page for answered questions and to post unanswered questions.

For questions about code, please send them to Qing Li at liqing@ucla.edu.

For other questions, comments, or feedback, please send them to Danna Gurari at danna.gurari@colorado.edu.