Benchmarking AI for Earth Observation (original) (raw)

Towards LLM Agents for Earth Observation

♠Cornell University, ♥Columbia University

UnivEARTH Dataset Examples

UnivEARTH is a benchmark dataset designed to evaluate the capabilities of AI systems for Earth Observation.

140

High-Quality Questions

17

Satellites and Sources

Abstract

Earth Observation (EO) provides critical planetary data for environmental monitoring, disaster management, climate science, and other scientific domains. Here we ask: Are AI systems ready for reliable Earth Observation?

We introduce UnivEARTH, a benchmark of 140 yes/no questions from NASA Earth Observatory articles across 13 topics and 17 satellite sensors. Using Google Earth Engine API as a tool, LLM agents can only achieve an accuracy of 33% because the code fails to run over 58% of the time. We improve the failure rate for open models by fine-tuning synthetic data, allowing much smaller models (Llama-3.1-8B) to achieve comparable accuracy to much larger ones (e.g., DeepSeek-R1).

Taken together, our findings identify significant challenges to be solved before AI agents can automate earth observation, and suggest paths forward.

Dataset Description

UnivEARTH (pronounced "universe") is a benchmark dataset designed to evaluate the capabilities of AI systems for Earth Observation. It consists of 140 high-quality yes/no questions spanning 13 diverse topics and 17 different satellite sensors and datasets. The questions are derived from NASA Earth Observatory articles and focus on comparative relationships in Earth observation data.

Dataset Curation

Dataset Curation Process

Task Description

Task Workflow

Dataset Gallery

Intended Uses

Benchmarking language models for Earth observation tasks
Evaluating AI systems' ability to ground answers in satellite imagery
Assessing models' capability to generate code for accessing and analyzing Earth observation data
Supporting research in scientific AI assistants for environmental monitoring, disaster management, and climate science

Limitations

The current benchmark comprises 140 questions, which could be expanded in future versions
Questions are in yes/no format only
The benchmark currently does not explicitly include questions where the ground truth answer is "inconclusive"

Citation

@article{kao2025univearth,
  title   = {Towards LLM Agents for Earth Observation: The UnivEARTH Dataset},
  author  = {Kao, Chia Hsiang and Zhao, Wenting and Revankar, Shreelekha and Speas, Samuel and 
             Bhagat, Snehal and Datta, Rajeev and Phoo, Cheng Perng and Mall, Utkarsh and 
             Vondrick, Carl and Bala, Kavita and Hariharan, Bharath},
  journal = {arXiv preprint},
  year    = {2025},
  eprint = {arXiv:2504.12110},
}