Add README file specific to ScreenSpot · EvolvingLMMs-Lab/lmms-eval@2f3811c (original) (raw)

Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
1 +# SceenSpot
2 +
3 +## GUI Grounding Benchmark: ScreenSpot
4 +
5 +ScreenSpot is an evaluation benchmark for GUI grounding, comprising over 1200 instructions from iOS, Android, macOS, Windows and Web environments, along with annotated element types (Text or Icon/Widget).
6 +
7 +This evaluation allows for both:
8 +- `screenspot_rec_test`: the original evaluation of `{img} {instruction} --> {bounding box}` called grounding or Referring Expression Completion (REC);
9 +- `screenspot_reg_test`: the new evaluation of `{img} {bounding box} --> {instruction}` called instruction generation or Referring Expression Generation (REG).
10 +
11 +### REC Metrics
12 +
13 +REC/Grounding requires that a model outputs a bounding box for the target element in the image. The evaluation metrics are:
14 +- `IoU`: Intersection over Union (IoU) between the predicted bounding box and the ground truth bounding box.
15 +- `ACC@IoIU`: We use `IoU` to create `ACC@IoU` metrics at different IoU thresholds where an output with an IoU above the threshold is considered correct.
16 +- `CENTER ACC`: The predicted bounding box is considered correct if the center of the predicted bounding box is within the ground truth bounding box. This is what's reported in the paper.
17 +
18 +### REG Metrics
19 +
20 +REG/Generation requires that a model outputs the instruction that describes the target element in the image. Currently, this element will be highlighted in red in the image. The evaluation metrics are:
21 +- `CIDEr`: The CIDEr metric is used to evaluate the quality of the generated instruction. As the paper doesn't consider this task, we have selected this metric as a standard for evaluating the quality of the generated instruction. This matches with what other works like ScreenAI have done for instruction generation for RICO datasets.
22 +
23 +## Baseline Scores
24 +
25 +As a Baseline, here is how LLaVA-v1.5-7b performs on the ScreenSpot dataset:
26 +- `IoU`: 0.051
27 +- `ACC@0.1`: 0.195
28 +- `ACC@0.3`: 0.042
29 +- `ACC@0.5`: 0.006
30 +- `ACC@0.7`: 0.000
31 +- `ACC@0.9`: 0.000
32 +- `CENTER ACC`: 0.097
33 +- `CIDEr`: 0.097
34 +
35 +## References
36 +
37 +- ArXiv: [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935)
38 +- GitHub: [njucckevin/SeeClick](https://github.com/njucckevin/SeeClick)
39 +
40 +```bibtex
41 +@misc{cheng2024seeclick,
42 + title={SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents},
43 + author={Kanzhi Cheng and Qiushi Sun and Yougang Chu and Fangzhi Xu and Yantao Li and Jianbing Zhang and Zhiyong Wu},
44 + year={2024},
45 + eprint={2401.10935},
46 + archivePrefix={arXiv},
47 + primaryClass={cs.HC}
48 +}
49 +```