@@ -0,0 +1,49 @@ |
|
|
|
1 |
+# SceenSpot |
|
2 |
+ |
|
3 |
+## GUI Grounding Benchmark: ScreenSpot |
|
4 |
+ |
|
5 |
+ScreenSpot is an evaluation benchmark for GUI grounding, comprising over 1200 instructions from iOS, Android, macOS, Windows and Web environments, along with annotated element types (Text or Icon/Widget). |
|
6 |
+ |
|
7 |
+This evaluation allows for both: |
|
8 |
+- `screenspot_rec_test`: the original evaluation of `{img} {instruction} --> {bounding box}` called grounding or Referring Expression Completion (REC); |
|
9 |
+- `screenspot_reg_test`: the new evaluation of `{img} {bounding box} --> {instruction}` called instruction generation or Referring Expression Generation (REG). |
|
10 |
+ |
|
11 |
+### REC Metrics |
|
12 |
+ |
|
13 |
+REC/Grounding requires that a model outputs a bounding box for the target element in the image. The evaluation metrics are: |
|
14 |
+- `IoU`: Intersection over Union (IoU) between the predicted bounding box and the ground truth bounding box. |
|
15 |
+- `ACC@IoIU`: We use `IoU` to create `ACC@IoU` metrics at different IoU thresholds where an output with an IoU above the threshold is considered correct. |
|
16 |
+- `CENTER ACC`: The predicted bounding box is considered correct if the center of the predicted bounding box is within the ground truth bounding box. This is what's reported in the paper. |
|
17 |
+ |
|
18 |
+### REG Metrics |
|
19 |
+ |
|
20 |
+REG/Generation requires that a model outputs the instruction that describes the target element in the image. Currently, this element will be highlighted in red in the image. The evaluation metrics are: |
|
21 |
+- `CIDEr`: The CIDEr metric is used to evaluate the quality of the generated instruction. As the paper doesn't consider this task, we have selected this metric as a standard for evaluating the quality of the generated instruction. This matches with what other works like ScreenAI have done for instruction generation for RICO datasets. |
|
22 |
+ |
|
23 |
+## Baseline Scores |
|
24 |
+ |
|
25 |
+As a Baseline, here is how LLaVA-v1.5-7b performs on the ScreenSpot dataset: |
|
26 |
+- `IoU`: 0.051 |
|
27 |
+- `ACC@0.1`: 0.195 |
|
28 |
+- `ACC@0.3`: 0.042 |
|
29 |
+- `ACC@0.5`: 0.006 |
|
30 |
+- `ACC@0.7`: 0.000 |
|
31 |
+- `ACC@0.9`: 0.000 |
|
32 |
+- `CENTER ACC`: 0.097 |
|
33 |
+- `CIDEr`: 0.097 |
|
34 |
+ |
|
35 |
+## References |
|
36 |
+ |
|
37 |
+- ArXiv: [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935) |
|
38 |
+- GitHub: [njucckevin/SeeClick](https://github.com/njucckevin/SeeClick) |
|
39 |
+ |
|
40 |
+```bibtex |
|
41 |
+@misc{cheng2024seeclick, |
|
42 |
+ title={SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents}, |
|
43 |
+ author={Kanzhi Cheng and Qiushi Sun and Yougang Chu and Fangzhi Xu and Yantao Li and Jianbing Zhang and Zhiyong Wu}, |
|
44 |
+ year={2024}, |
|
45 |
+ eprint={2401.10935}, |
|
46 |
+ archivePrefix={arXiv}, |
|
47 |
+ primaryClass={cs.HC} |
|
48 |
+} |
|
49 |
+``` |