Benchmarking AI for Earth Observation (original) (raw)
Towards LLM Agents for Earth Observation
♠Cornell University, ♥Columbia University
UnivEARTH is a benchmark dataset designed to evaluate the capabilities of AI systems for Earth Observation.
140
High-Quality Questions
17
Satellites and Sources
Abstract
Earth Observation (EO) provides critical planetary data for environmental monitoring, disaster management, climate science, and other scientific domains. Here we ask: Are AI systems ready for reliable Earth Observation?
We introduce UnivEARTH, a benchmark of 140 yes/no questions from NASA Earth Observatory articles across 13 topics and 17 satellite sensors. Using Google Earth Engine API as a tool, LLM agents can only achieve an accuracy of 33% because the code fails to run over 58% of the time. We improve the failure rate for open models by fine-tuning synthetic data, allowing much smaller models (Llama-3.1-8B) to achieve comparable accuracy to much larger ones (e.g., DeepSeek-R1).
Taken together, our findings identify significant challenges to be solved before AI agents can automate earth observation, and suggest paths forward.
Dataset Description
UnivEARTH (pronounced "universe") is a benchmark dataset designed to evaluate the capabilities of AI systems for Earth Observation. It consists of 140 high-quality yes/no questions spanning 13 diverse topics and 17 different satellite sensors and datasets. The questions are derived from NASA Earth Observatory articles and focus on comparative relationships in Earth observation data.
Dataset Curation
Task Description
Dataset Gallery
Intended Uses
- Benchmarking language models for Earth observation tasks
- Evaluating AI systems' ability to ground answers in satellite imagery
- Assessing models' capability to generate code for accessing and analyzing Earth observation data
- Supporting research in scientific AI assistants for environmental monitoring, disaster management, and climate science
Limitations
- The current benchmark comprises 140 questions, which could be expanded in future versions
- Questions are in yes/no format only
- The benchmark currently does not explicitly include questions where the ground truth answer is "inconclusive"
Citation
@article{kao2025univearth,
title = {Towards LLM Agents for Earth Observation: The UnivEARTH Dataset},
author = {Kao, Chia Hsiang and Zhao, Wenting and Revankar, Shreelekha and Speas, Samuel and
Bhagat, Snehal and Datta, Rajeev and Phoo, Cheng Perng and Mall, Utkarsh and
Vondrick, Carl and Bala, Kavita and Hariharan, Bharath},
journal = {arXiv preprint},
year = {2025},
eprint = {arXiv:2504.12110},
}