A Dataset For Web-Based Structual Reading Comprehension (original) (raw)

What is WebSRC?

WebSRC is a novel Web-based Structural Reading Comprehension dataset. It consists of 0.44M question-answer pairs, which are collected from 6.5K web pages with corresponding HTML source code, screenshots and metadata. Each question in WebSRC requires a certain structural understanding of a web page to answer, and the answer is either a text span on the web page or yes/no.

News

Apr 2, 2025 The train, dev, and test set of our dataset can now be accessed through Huggingface. Aug 27, 2022 To access the test set, see here Sept 1, 2021 The full dataset and baseline is available in: Extraction code: 2mys Download from Amazon.com Aug 25, 2021 Our paper is accepted by EMNLP 2021, the updated version of dataset will be available soon. Baseline is available on WebSRC-Baseline

If you have any questions about this dataset, please contact chenlusz@sjtu.edu.cn or galaxychen@sjtu.edu.cn

Leaderboard

Three metrics, i.e. Exact Match (EM), F1 score, and Path Overlap Score (POS), are used to evaluate on the test set of WebSRC. Please refer to the paper to find more details about evaluation metrics.

Rank	Model	EM	F1	POS
1 Sep 4, 2023	SageGPT-small-v0.2 4paradigm.Inc	89.11	92.15	N/A
2 Jan 12, 2024	ScreenAI 5B Google Research (Baechler et al.)	84.02	87.24	N/A
3 Oct 12, 2022	DocPrompt (ErnieLayout-Large) BAIDU-Document Intelligence (Wu et al.) code demo	77.35	85.04	N/A
4 Mar 01, 2022	TIE (MarkupLM-Large) Shanghai Jiao Tong University (Zhao et al., NAACL'22) code	76.25	80.51	89.50
5 Mar 01, 2022	TIE (MarkupLM-Large)-3_seeds_ave Shanghai Jiao Tong University (Zhao et al., NAACL'22) code	75.87	80.19	89.73
6 Nov 30, 2021	MarkupLM-Large MSRA+SJTU (Li et al., ACL'22) code	69.87	77.94	88.09
7 Nov 30, 2021	MarkupLM-Large (3_seeds_ave) MSRA+SJTU (Li et al., ACL'22) code	69.09	76.45	87.24
8 Sept 01, 2021	V-PLM (ELECTRA) Shanghai Jiao Tong University (Chen et al., EMNLP'21) code	68.07	75.25	84.96
9 Sept 01, 2021	V-PLM (BERT) Shanghai Jiao Tong University (Chen et al., EMNLP'21) code	54.84	62.80	76.39