Add README file specific to ScreenSpot · EvolvingLMMs-Lab/lmms-eval@2f3811c (original) (raw)

Original file line number	Diff line number	Diff line change
@@ -0,0 +1,49 @@
	1	+# SceenSpot
	2	+
	3	+## GUI Grounding Benchmark: ScreenSpot
	4	+
	5	+ScreenSpot is an evaluation benchmark for GUI grounding, comprising over 1200 instructions from iOS, Android, macOS, Windows and Web environments, along with annotated element types (Text or Icon/Widget).
	6	+
	7	+This evaluation allows for both:
	8	+- `screenspot_rec_test`: the original evaluation of `{img} {instruction} --> {bounding box}` called grounding or Referring Expression Completion (REC);
	9	+- `screenspot_reg_test`: the new evaluation of `{img} {bounding box} --> {instruction}` called instruction generation or Referring Expression Generation (REG).
	10	+
	11	+### REC Metrics
	12	+
	13	+REC/Grounding requires that a model outputs a bounding box for the target element in the image. The evaluation metrics are:
	14	+- `IoU`: Intersection over Union (IoU) between the predicted bounding box and the ground truth bounding box.
	15	+- `ACC@IoIU`: We use `IoU` to create `ACC@IoU` metrics at different IoU thresholds where an output with an IoU above the threshold is considered correct.
	16	+- `CENTER ACC`: The predicted bounding box is considered correct if the center of the predicted bounding box is within the ground truth bounding box. This is what's reported in the paper.
	17	+
	18	+### REG Metrics
	19	+
	20	+REG/Generation requires that a model outputs the instruction that describes the target element in the image. Currently, this element will be highlighted in red in the image. The evaluation metrics are:
	21	+- `CIDEr`: The CIDEr metric is used to evaluate the quality of the generated instruction. As the paper doesn't consider this task, we have selected this metric as a standard for evaluating the quality of the generated instruction. This matches with what other works like ScreenAI have done for instruction generation for RICO datasets.
	22	+
	23	+## Baseline Scores
	24	+
	25	+As a Baseline, here is how LLaVA-v1.5-7b performs on the ScreenSpot dataset:
	26	+- `IoU`: 0.051
	27	+- `ACC@0.1`: 0.195
	28	+- `ACC@0.3`: 0.042
	29	+- `ACC@0.5`: 0.006
	30	+- `ACC@0.7`: 0.000
	31	+- `ACC@0.9`: 0.000
	32	+- `CENTER ACC`: 0.097
	33	+- `CIDEr`: 0.097
	34	+
	35	+## References
	36	+
	37	+- ArXiv: [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935)
	38	+- GitHub: [njucckevin/SeeClick](https://github.com/njucckevin/SeeClick)
	39	+
	40	+```bibtex
	41	+@misc{cheng2024seeclick,
	42	+ title={SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents},
	43	+ author={Kanzhi Cheng and Qiushi Sun and Yougang Chu and Fangzhi Xu and Yantao Li and Jianbing Zhang and Zhiyong Wu},
	44	+ year={2024},
	45	+ eprint={2401.10935},
	46	+ archivePrefix={arXiv},
	47	+ primaryClass={cs.HC}
	48	+}
	49	+```