Lightweight, Pre-trained Transformers for Remote Sensing Timeseries (original) (raw)

Gabriel Tseng1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT Ruben Cartuyvels1,313{}^{1,3}start_FLOATSUPERSCRIPT 1 , 3 end_FLOATSUPERSCRIPT Ivan Zvonkov44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT Mirali Purohit55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPT
David Rolnick1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT Hannah Kerner55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Mila – Quebec AI Institute
22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT McGill University
33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT KU Leuven
44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT University of Maryland, College Park
55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPT Arizona State University

Abstract

Machine learning methods for satellite data have a range of societally relevant applications, but labels used to train models can be difficult or impossible to acquire. Self-supervision is a natural solution in settings with limited labeled data, but current self-supervised models for satellite data fail to take advantage of the characteristics of that data, including the temporal dimension (which is critical for many applications, such as monitoring crop growth) and availability of data from many complementary sensors (which can significantly improve a model’s predictive performance). We present Presto (the Pretrained Remote Sensing Transformer), a model pre-trained on remote sensing pixel-timeseries data. By designing Presto specifically for remote sensing data, we can create a significantly smaller but performant model. Presto excels at a wide variety of globally distributed remote sensing tasks and performs competitively with much larger models while requiring far less compute. Presto can be used for transfer learning or as a feature extractor for simple models, enabling efficient deployment at scale.

1 Introduction

Machine learning is increasingly being applied to the remote sensing domain, in particular to understand the evolution of the Earth’s surface over time [Brown et al., 2022, Voosen, 2020, Abys et al., 2024, Wang et al., 2020b]. These applications can have important societally beneficial outcomes, ranging from tracking progress on sustainable development goals [Ferreira et al., 2020] to improved weather forecasting [English et al., 2013, Voosen, 2020] to disaster management [Kansakar and Hossain, 2016]. However, labeled datasets often contain labels that are few, sparse, and unreliable [Bressan et al., 2022], especially for under-resourced geographies, leading to poor global generalization [Yifang et al., 2015, Kerner et al., 2020, Nakalembe et al., 2021]. This has spurred the investigation of self-supervised learning algorithms for remote sensing data.

Current self-supervised approaches for remote sensing data have drawn from methods in computer vision, yielding models that treat remote sensing data as single-timestep images [Jean et al., 2019, Manas et al., 2021, Ayush et al., 2021]. Such models (i) cannot benefit from patterns that emerge when an area is monitored over time, which is especially important for agriculture and other seasonal landcover, (ii) typically only consider a single satellite product (such as Sentinel-2 multispectral data), despite there being hundreds of publicly available satellite data products [GEE, ], (iii) are typically large and computationally expensive [Reed et al., 2022, Cong et al., 2022, Fuller et al., 2023], making the deployment of these models at scale challenging, and (iv) cannot natively handle the labels for many remote sensing datasets, which are points or irregularly shaped polygons [Rao et al., 2020, Batjes et al., 2017], requiring additional methods to handle these labels[Wang et al., 2020a].

We introduce the Pretrained Remote Sensing Transformer (Presto), a lightweight model designed to ingest pixel-timeseries inputs from a variety of Earth observation sensors and data products. Presto operates on individual pixels, using the temporal and multimodal structure of the data instead of the image structure. To learn powerful representations of remote sensing data that can be adapted to a wide range of tasks, Presto leverages a self-supervised masked autoencoding approach, reconstructing unobserved timepoints and sensory modalities. This allows Presto to be robust to missing data and to flexibly accommodate diverse input formats. We find Presto excels even in image-based tasks where the temporal dimension is completely absent.

Presto addresses the following requirements, which are critical to the useful deployment of pre-trained models in the remote sensing context:

Refer to caption

Figure 1: Presto learns from structurally-masked remote sensing pixel-timeseries. We construct a multi-sensor remote sensing pixel-timeseries, and randomly select one of the four masking strategies described in Section 3.3. The encoder-decoder model is trained to reconstruct the original timeseries. At fine-tuning time, we discard the decoder and only use the encoder’s output. The downstream task may have incomplete inputs (missing timesteps or sensors) since the encoder is specifically trained on such inputs. Presto receives both static-in-time and dynamic-in-time inputs and the location metadata of each pixel timeseries.

•
Computational efficiency: When deployed, models built for remote sensing data are typically used to make contiguous geospatial predictions over millions (or billions) of samples to form a predicted map. The computational performance of models is therefore one of the primary considerations at deployment time. Van Tricht [2021], Hengl et al. [2017] and Robinson et al. [2019] are all global- or large- scale map making efforts that prioritized efficiency over accuracy when deploying remote sensing models at scale. Presto is competitive with ViT or ResNet based models, despite having up to 1000×1000\times1000 × fewer trainable parameters and requiring orders of magnitude fewer FLOPs at inference time.
•
Ability to process inputs of varying shapes: Different downstream tasks may require very different remote sensing inputs. For example, for crop mapping and yield estimation, Sainte Fare Garnot et al. [2020] and You et al. [2017] discarded all spatial information in the inputs in favor of emphasizing temporal patterns. We test Presto on a wide range of downstream inputs (for example, with spatial information present or absent, and with single or multiple timesteps of data), and find it is competitive with models designed specifically for those inputs.
•
Ability to process a range of remote sensing datasets: For fuel moisture estimation, Rao et al. [2020] found that the inclusion of derived products in addition to raw inputs significantly improved performance. Presto can ingest a range of static-in-time and dynamic-in-time raw input data as well as derived product inputs widely used in Earth observation (such as NDVI [Rouse et al., 1974]).
•
Ability to handle missing data: The coverage of remote sensing products is often spatially and temporally incomplete. For example, certain regions experience very high (>90%absentpercent90>90\%> 90 %) cloud coverage, reducing the utility of optical measurements such as Sentinel-2 imagery [Sudmanns et al., 2019]. Because Presto ingests a variety of remote sensing inputs, it can leverage alternative data sources if one is missing (for instance, relying on Sentinel-1, which sees through clouds, if Sentinel-2 images are cloudy).

Our results support the surprising conclusion that a pixel-based approach can in some cases match or outperform sophisticated computer vision-based approaches. We hypothesize that this is possible because (i) Presto learns from many semantically dense data sources, allowing it to extract informative patterns from pixel-timeseries, and (ii) many remote sensing tasks require significantly smaller receptive fields than those provided by computer vision-based models. Brown et al. [2022] leveraged such properties to train a model 100×100\times100 × smaller than standard models while achieving state-of-the-art land-cover segmentation results.

Architectures for Remote Sensing

When processing remote sensing timeseries, transformers have been extensively investigated either as unmodified architectures [Rußwurm and Körner, 2020] or as architectures designed for specific tasks [Sainte Fare Garnot et al., 2020, Tarasiou et al., 2023]. Recurrent networks have also been investigated [Kerner et al., 2020, Rußwurm and Körner, 2020]. When treating remote sensing data as single or few (up to 3) timestep images, architectures from computer vision are commonly used, ranging from ResNets [Manas et al., 2021, Ayush et al., 2021, Rußwurm et al., 2020] to Vision Transformers [Cong et al., 2022, Reed et al., 2022, Fuller et al., 2023].

Self-supervised learning for Remote Sensing

While contrastive learning has been investigated for remote sensing [Manas et al., 2021], recent self-supervised learning research has focused on masked autoencoders [Yuan et al., 2022, Cong et al., 2022, Reed et al., 2022, Fuller et al., 2023]. However, these approaches (i) focus on learning from raw satellite data products (ignoring derived products such as elevation) and typically only ingest data from a single sensor (the exception being the CROMA model of Fuller et al. [2023], which ingests both Sentinel-1 and Sentinel-2 data), (ii) ingest very few or no timesteps (Reed et al. [2022] and Fuller et al. [2023] ingest only one timestep while Cong et al. [2022] ingest up to three timesteps), (iii) expect data in a certain size (for instance, ViT based models require spatial dimensions to be present), so that missing data is not handled natively, and (iv) generally yield larger models ranging from 2.5 million parameters [Yuan and Lin, 2020] to over 300 million parameters for ViT-based methods, making their deployment in compute-constrained settings challenging.

3 Method

We aim to learn a model, f𝑓fitalic_f, which can learn useful representations in a self-supervised manner given unlabelled remote sensing pixel-timeseries data while meeting the usability requirements outlined in Section 1. This model can then be applied to a wide variety of downstream remote sensing tasks. These downstream tasks may contain input data from a range of sensors with differing numbers of timesteps.

Our approach is based on the masked autoencoding framework [He et al., 2022], in which the network architecture includes both an encoder (f𝑓fitalic_f) and a decoder (g𝑔gitalic_g). During pre-training, part of the input is masked out and the encoder embeds the remaining (non-masked) part of the input. The decoder aims to reconstruct the masked-out part of the input, given the encoder’s output. At fine-tuning time, we discard g𝑔gitalic_g and only use f𝑓fitalic_f (either as a feature extractor or a fine-tuneable model) for downstream tasks. In the sections below, we discuss how Presto customizes this general framework for multi-sensor remote sensing timeseries data. An overview of the Presto pre-training methodology is shown in Figure 1, and full pre-training details are in Section A.1.

3.1 Pre-training Data

Self-supervised models for remote sensing must generalize to a wide range of geographies and tasks [Lacoste et al., 2023]. We therefore aimed to collect a globally representative pre-training dataset. We followed the sampling strategy of Brown et al. [2022] to construct a dataset of 21.5M pixel samples, each with a resolution of 10m per pixel. Appendix A.1.1 describes the pre-training dataset construction process in detail. Presto was trained on pixel-timeseries of 12-month contiguous intervals, sampled from a 2-year period from the beginning of 2020 until the end of 2021, with each month represented by one timestep (similar to the approach adopted by Tseng et al. [2021]). Derived data products that result from the analysis of lower level data (e.g., Parkinson et al. [2006]) can significantly improve model performance [Rao et al., 2020, Hengl et al., 2017]. We therefore pre-trained Presto on a diverse set of directly-sensed and derived Earth observation products which we pre-processed and exported using Google Earth Engine [Gorelick et al., 2017].

A pre-training batch contained several pixel-timeseries samples, each of which is a concatenation of dynamic-in-time datapoints with each timestep representing a month (yielding T=12𝑇12T=12italic_T = 12 timesteps in total). The following dynamic-in-time data products were used, yielding 15151515 channels: (i) Sentinel-2 (S2) multispectral data, (ii) Sentinel-1 (S1) radar data, (iii) ERA5 climate reanalysis data, (iv) NDVI [Rouse et al., 1974] derived from Sentinel-2 data and (v) land cover classes 𝒱𝒱\mathcal{V}caligraphic_V from Dynamic World. To every pixel-timeseries we appended two static-in-time products: (i) topography data from the SRTM digital elevation model [90m Digital Elevation Data, 2003] and (ii) location coordinates of each pixel. Hence, one pre-training sample x𝑥xitalic_x, comprising a pixel-timeseries t∈[ℝT×15;𝒱T×1]𝑡superscriptℝ𝑇15superscript𝒱𝑇1t\in[\mathbb{R}^{T\times 15};\mathcal{V}^{T\times 1}]italic_t ∈ [ blackboard_R start_POSTSUPERSCRIPT italic_T × 15 end_POSTSUPERSCRIPT ; caligraphic_V start_POSTSUPERSCRIPT italic_T × 1 end_POSTSUPERSCRIPT ] and static variables s∈ℝ1×5𝑠superscriptℝ15s\in\mathbb{R}^{1\times 5}italic_s ∈ blackboard_R start_POSTSUPERSCRIPT 1 × 5 end_POSTSUPERSCRIPT, is summarized as follows:

x=[{tiS1;tiS2;tiERA5;tiNDVI;tiDW\|i=1,…,12};sTG;sLoc]𝑥conditional-setsuperscriptsubscript𝑡𝑖S1superscriptsubscript𝑡𝑖S2superscriptsubscript𝑡𝑖ERA5superscriptsubscript𝑡𝑖NDVIsuperscriptsubscript𝑡𝑖DW𝑖1…12superscript𝑠TGsuperscript𝑠Locx=\Big{[}\big{\{}t_{i}^{\text{S1}};\ t_{i}^{\text{S2}};\ t_{i}^{\text{ERA5}};% \ t_{i}^{\text{NDVI}};\ t_{i}^{\text{DW}}\	\ i=1,...,12\big{\}};\ s^{\text{TG% }};\ s^{\text{Loc}}\Big{]}italic_x = [ { italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT S1 end_POSTSUPERSCRIPT ; italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT S2 end_POSTSUPERSCRIPT ; italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ERA5 end_POSTSUPERSCRIPT ; italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT NDVI end_POSTSUPERSCRIPT ; italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT DW end_POSTSUPERSCRIPT	italic_i = 1 , … , 12 } ; italic_s start_POSTSUPERSCRIPT TG end_POSTSUPERSCRIPT ; italic_s start_POSTSUPERSCRIPT Loc end_POSTSUPERSCRIPT ]	(1)

From now on, we use “pixel-timeseries” to refer to both the dynamic and the static variables.

3.2 Encoding and tokenization

Refer to caption

Figure 2: Presto learns to reconstruct channels that are completely masked in a spatially cohesive manner. In this experiment, we masked only the Sentinel-2 RGB channels; Presto was able to reconstruct these channels even when they were absent from the input. The reconstructions are spatially consistent even though Presto only receives single pixel inputs.

We transformed the pixel-timeseries x𝑥xitalic_x into a number of tokens (each represented by an embedding e𝑒eitalic_e) to be processed by the Presto transformer. Per timestep 0≤i<T0𝑖𝑇0\leq i<T0 ≤ italic_i < italic_T, we split the input variables into channel groups 𝒞𝒞\mathcal{C}caligraphic_C according to their type of sensor or source: e.g., the S1 bands form one channel group. We describe these groups in more detail in Appendix A.1.3. Each real-valued channel group represents a different sensor, native spatial resolution or (in the case of Sentinel-2 channel-groups) region of the electromagnetic spectrum. We projected each channel group to a common latent space of dimension desubscript𝑑𝑒d_{e}italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT by separate learned linear projections h𝒞superscriptℎ𝒞h^{\mathcal{C}}italic_h start_POSTSUPERSCRIPT caligraphic_C end_POSTSUPERSCRIPT: e.g., eiS1=hS1⁢(tiS1)superscriptsubscript𝑒𝑖S1superscriptℎS1superscriptsubscript𝑡𝑖S1e_{i}^{\text{S1}}=h^{\text{S1}}(t_{i}^{\text{S1}})italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT S1 end_POSTSUPERSCRIPT = italic_h start_POSTSUPERSCRIPT S1 end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT S1 end_POSTSUPERSCRIPT ). The Dynamic World classes are categorical, so we embedded them by indexing them into an embedding matrix.

Table 1: We evaluated Presto on a wide variety of downstream tasks, including segmentation (seg.), multi-label (ml) scene classification (class.) and regression (reg.) tasks. There is diversity in terms of data composition, geographic area and training set size. Input shape describes the shape of a single sample, in terms of [Height, Width, Timesteps, Channels]. We bold the temporal dimension, to highlight time-series versus single-timestep inputs.

llrrr\CodeBefore\rowcolorgray!202,3,4,6,7,10\BodyDataset Task Region Input shape Train samples
CropHarvest Seg. Kenya [1, 1, 12, 18] 1,345
Brazil 203
Togo 1,319
S2-Agri100100{}_{100}start_FLOATSUBSCRIPT 100 end_FLOATSUBSCRIPT Class. France [5, 5, 24, 10] 1,500
TreeSat ML Class. Germany [6, 6, 1, 2] 45,337
[6, 6, 1, 11]
EuroSat Class. Europe [64, 64, 1, 3] 21,600
[64, 64, 1, 11]
Fuel Moisture Reg. USA [1, 1, 3, 19] 1,578
Algae Blooms Reg. USA [1, 1, 12, 19] 777

Unlike natural images in which the data and its label are self-contained, remote sensing labels are inherently associated to a place and time on Earth (i.e., a latitude/longitude and timestamp). In addition, while natural images contain RGB channels from the same camera sensor, Presto’s pixel-timeseries input contains channels from multiple remote sensing instruments and data products. We therefore wanted to communicate to the model: (i) the location of the datapoint (already present in the input as static variable through coordinates sLocsubscript𝑠Locs_{\text{Loc}}italic_s start_POSTSUBSCRIPT Loc end_POSTSUBSCRIPT) and a variable’s (ii) timestamp and (iii) channel group. We did this by adding encodings to the previously described embeddings e𝑒eitalic_e. The complete encoding has dimension desubscript𝑑𝑒d_{e}italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and contains a concatenation of positional, month, and learned channel encodings described below.

The transformer input E∈ℝ(T⋅|𝒞dynamic|+|𝒞static|)×de𝐸superscriptℝ⋅𝑇subscript𝒞dynamicsubscript𝒞staticsubscript𝑑𝑒E\in\mathbb{R}^{(T\cdot|\mathcal{C}_{\textrm{dynamic}}|+|\mathcal{C}_{\textrm{% static}}|)\times d_{e}}italic_E ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_T ⋅ | caligraphic_C start_POSTSUBSCRIPT dynamic end_POSTSUBSCRIPT | + | caligraphic_C start_POSTSUBSCRIPT static end_POSTSUBSCRIPT | ) × italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (for encoder dimension desubscript𝑑𝑒d_{e}italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT) is a concatenation of:

Table 2: Mean F1 score across all CropHarvest tasks. Presto outpeforms TIML [Tseng et al., 2022] and MOSAIKS-1D while requiring the adaptation of far fewer parameters. The TIML and MOSAIKS-1D model did not receive Dynamic World as input, so we measured Presto’s performance both with and without it.

{NiceTabular}

lrrr\CodeBefore\rowcolorMidnightBlue!206,7\Body #. parameters
Model Total Adapted Mean F1
Random Forest 0.441
MOSAIKS-1DR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT 418K 8193 0.738
TIML 91K 91K 0.8020.8020.8020.802
PrestoR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT 402K 129 0.8350.8350.8350.835
no DW 0.8360.836\bm{0.836}bold_0.836

3.3 Pre-training via Structured Masking

A key requirement for Presto was to perform well even with incomplete inputs (i.e., when there are missing timesteps, channels, or both). When masking out part of the input x𝑥xitalic_x, we therefore tailored the masking strategies to encourage the model to learn representations that perform well when given a subset of bands or timesteps for downstream tasks. For a T×D𝑇𝐷T\times Ditalic_T × italic_D input of T𝑇Titalic_T timesteps and D𝐷Ditalic_D total input channels, we used the following masking techniques (illustrated in Figure 1), where Presto considers a token to be a 1×d1𝑑1\times d1 × italic_d input (a single timestep of d𝑑ditalic_d grouped channels). The coordinates were never masked but the static topological tokens can be.

1. Random: (t×d)𝑡𝑑(t\times d)( italic_t × italic_d ) masked values, with t<T𝑡𝑇t<Titalic_t < italic_T and d<D𝑑𝐷d<Ditalic_d < italic_D
1. Channel-groups: (T×d)𝑇𝑑(T\times d)( italic_T × italic_d ) masked values, with d<D𝑑𝐷d<Ditalic_d < italic_D
1. Contiguous timesteps: (t×D)𝑡𝐷(t\times D)( italic_t × italic_D ) masked values, t<T𝑡𝑇t<Titalic_t < italic_T
1. Timesteps: (t×D)𝑡𝐷(t\times D)( italic_t × italic_D ) masked values, with t<T𝑡𝑇t<Titalic_t < italic_T

For each training instance, we randomly sampled from the above strategies to construct a mask.

To handle both the categorical and continuous inputs we used the following loss function, which balances the continuous and categorical losses for every batch so that each reconstructed value receives the same weighting in the final loss:ℒtotal=ℒMSE+λ⁢NcatNcont⁢ℒCEsubscriptℒtotalsubscriptℒMSE𝜆subscript𝑁catsubscript𝑁contsubscriptℒCE\mathcal{L}_{\text{total}}=\mathcal{L}_{\text{MSE}}+\lambda\frac{N_{\text{cat}% }}{N_{\text{cont}}}\mathcal{L}_{\text{CE}}caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT + italic_λ divide start_ARG italic_N start_POSTSUBSCRIPT cat end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT cont end_POSTSUBSCRIPT end_ARG caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT.ℒMSEsubscriptℒMSE\mathcal{L}_{\text{MSE}}caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT is the mean squared error reconstruction loss used for the continuous values, ℒCEsubscriptℒCE\mathcal{L}_{\text{CE}}caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT is the cross entropy loss used for the categorical values, Ncontsubscript𝑁contN_{\text{cont}}italic_N start_POSTSUBSCRIPT cont end_POSTSUBSCRIPT is the number of masked continuous values and Ncatsubscript𝑁catN_{\text{cat}}italic_N start_POSTSUBSCRIPT cat end_POSTSUBSCRIPT is the number of masked categorical values in the batch. λ𝜆\lambdaitalic_λ is a hyperparameter, which we set to 2222.

Refer to caption

Figure 3: Presto is robust to incomplete inputs. We measured the AUC ROC score of Presto with Linear probing (PrestoR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT) on the CropHarvest dataset when no Dynamic World input is passed, and with a subset of input months (the x-axis). We plot the performance of MOSAIKS-1D and TIML when they receive the full 12 months of input (dashed horizontal lines) - PrestoR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT recovered the performance of these models given only a subset of input months.

4 Experiments

In all experiments described below, we use a Presto model with identical encoder and decoder configurations (2 attention layers with 8 heads, an embedding size of 128 and an MLP ratio of 4). We investigated the effect of different encoder configurations in Table 6.

For downstream evaluation, we took the encoder-decoder model learned during pre-training and discarded the decoder. As in He et al. [2022], we passed a global pool of all the encoder’s output tokens to a downstream classifier. We evaluated the performance of three different models: PrestoR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT, PrestoR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT, and PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT, defined below.

•
Feature extraction. Rolf et al. [2021] demonstrated the utility of neural networks as feature-extractors on top of which computationally efficient classifiers could be trained. PrestoR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT and PrestoR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT consist respectively of linear or logistic regressions and random forests trained on Presto’s embeddings. Since only the regression/random forest is trained, this a computationally efficient method for adapting Presto to a wide range of tasks.
•
Fine-tuning. PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT consists of the Presto encoder, followed by a linear transformation of the pooled tokens to the desired outputs. This entire model (the encoder and the linear transformation) is fine-tuned on the training data from each evaluation task. We used a subset of the (downstream) training data for validation.

During pre-training, we used a validation task consisting of classifying all points in the CropHarvest dataset [Tseng et al., 2021] according to their FAO indicative crop classifications. For this validation task, we excluded points used for evaluation (Section 5.1).

For evaluation, we compared Presto to state-of-the-art task-specific baselines (Section 5). Because there are no other global self-supervised models for pixel-timeseries, we adapted MOSAIKS [Rolf et al., 2021] for timeseries data by performing convolutions over the temporal rather than spatial dimension (MOSAIKS-1D). We used the output features with random forests (MOSAIKS-1DR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT) and regressions (MOSAIKS-1DR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT).

Refer to caption

Figure 4: We obtained per-image predictions using Presto by computing a mean and standard deviation of Presto’s per-pixel outputs, and passing this concatenated vector to a downstream classifier. We illustrate this for the EuroSat task.

5 Evaluation Tasks & Results

We evaluated Presto using six evaluation tasks spanning diverse task types, geographic locations (4 continents and 38 countries), input data modalities, and fine-tuning dataset sizes (Table 3.2). Whenever possible, we benchmarked Presto against the state-of-the-art model for that task.

Applying Presto to downstream tasks is computationally efficient. While other methods require a cluster of GPUs for fine-tuning [Cong et al., 2022], we fine-tuned Presto on a single GPU or CPU. For the fuel moisture task described in Section 5.1, fine-tuning Presto took under 6 minutes on a 2017 MacBook Pro’s CPU. When Presto is used as a feature extractor, simple models can be trained which require few parameters to be learned, as we show in Table 3.2. Even when fully fine-tuned, Presto’s small size meant that relatively few parameters needed to be trained (Tables 5.2.1 and 5.3.1). This makes Presto accessible to practitioners, especially those lacking significant computational resources.

Below, we describe the tasks used to evaluate Presto and discuss Presto’s performance on these tasks.

Table 3: RMSE results on the regression tasks. The literature baselines are not directly comparable, since they use different input datasets or private test data (or both). Rao et al. [2020] reported an RMSE of 25 on the fuel moisture dataset with a physics-assisted neural network and the algae bloom competition winner reported an RMSE of 0.761, indicating our results are within the scope of utility. Best results are highlighted blue, with second best results in bold. Models have a high variance in performance across tasks, so we calculated the mean difference in RMSE from the linear regression baseline across both tasks. Presto performed most consistently, both when used as a feature-extractor and when fine-tuned.

{NiceTabular}

lrrr\CodeBefore\rowcolororange!205\rowcolorMidnightBlue!206,7\Body Fuel Moisture Algae Blooms Mean difference
Linear Regression 28.20 0.8500.8500.8500.850 0%
Random Forest 23.8423.8423.8423.84 1.2491.2491.2491.249 15.7%
MOSAIKS-1DR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT 28.7528.7528.7528.75 0.9720.9720.9720.972 8.15%
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT (random init.) 26.0726.0726.0726.07 0.9550.9550.9550.955 2.40%
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT 25.2825.2825.2825.28 0.8150.8150.8150.815 −7.24%percent7.24{\color[rgb]{0,0,1}\bm{-7.24\%}}bold_- bold_7.24 bold_%
PrestoR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT 25.9825.9825.9825.98 0.8840.8840.8840.884 −1.94%percent1.94-1.94\%- 1.94 %

5.1 Timeseries Tasks

•
Crop type Segmentation: The CropHarvest [Tseng et al., 2021] evaluation datasets consist of binary pixel classification of (i) maize in Kenya, (ii) coffee in Brazil and (iii) cropland in Togo. We compared Presto to the baselines provided by CropHarvest and to Task-Informed Meta-Learning [TIML, Tseng et al., 2022], which achieved state-of-the-art results on these datasets.
•
Fuel Moisture: The live fuel moisture dataset [Rao et al., 2020] measures live fuel moisture content in the Western U.S. Rao et al. [2020]’s baseline used 5-fold cross validation to evaluate model performance; for future comparability, we used a single geographically separated test set (a test set covering a different geographic area than the training set).
•
Algae Blooms: The algae blooms dataset [alg, 2023] measures the severity of cyanobacterial algal blooms in different parts of the U.S. We used the subset of the dataset in the Midwestern U.S. The dataset was originally released as part of a competition, so the test data is not available. In addition, competitors could download many Earth observation datasets to train their models, making direct comparisons to competition results difficult. Since the competition’s winning solution used a tree-based method, we benchmarked against a regression and a random forest using a geographically separated test set.

Table 4: Results on the TreeSatAI dataset. We compared Presto to the dataset’s benchmark models. The MLPs contain 3 layers (with 563K-723K parameters respectively) and are tuned for this task. We froze the Presto encoder’s 402k parameters and trained a random forest on its outputs with default scikit-learn hyperparameters.

{NiceTabular}

llrrrr\CodeBefore\rowcolorMidnightBlue!205,8\Body Weighted Micro
Model Data F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT mAP F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT mAP
MLP S1 10.09 29.42 12.82 33.09
LightGBM 11.86 32.79 14.07 35.11
PrestoR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT 38.3438.34\bm{38.34}bold_38.34 35.4535.45\bm{35.45}bold_35.45 40.7940.79\bm{40.79}bold_40.79 38.6438.64\bm{38.64}bold_38.64
MLP S2 51.9751.9751.9751.97 64.1964.1964.1964.19 54.5954.5954.5954.59 65.8365.8365.8365.83
LightGBM 48.17 61.99 52.52 61.66
PrestoR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT 55.2955.29\bm{55.29}bold_55.29 61.53 58.2958.29\bm{58.29}bold_58.29 63.31

5.1.1 Timeseries Results

Presto excels at timeseries tasks, significantly outperforming the state-of-the-art for CropHarvest (Table 3.2) and outperforming all baselines for the regression tasks (Table 5).

We found that Presto is performant when passed only a subset of timesteps compared to the 12 timesteps used for pre-training. Presto remained performant when receiving only 3 input timesteps for the fuel moisture task (Table 5). We also evaluated Presto when a subset of input months are passed for the CropHarvest dataset (Figure 3). Using a subset of the 12 months, Presto surpassed the performance of TIML and MOSAIKS-1D which used all input months.

Presto is also robust to the removal of input channels. On the CropHarvest dataset (Table 3.2), Presto remained performant without the Dynamic World input, showing a negligible difference in mean F1 score compared to the full input.

5.2 Image-based Tasks

Presto is designed to ingest single pixel-timeseries. When one prediction is required for a set of pixels (as for image-based tasks and the Image-Timeseries tasks in Section 5.3), we used the following approach to obtain per-image predictions from Presto’s pixel outputs (Figure 4): (i) we encoded the pixels in an image individually, yielding N output tokens, (ii) we calculated the mean and standard deviation of these N output tokens per dimension and concatenated the result, yielding a 2⁢de2subscript𝑑𝑒2d_{e}2 italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT-dimensional vector (where desubscript𝑑𝑒d_{e}italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is Presto’s output token size, or 128), and (iii) we passed this mean and standard deviation vector to a downstream classifier.

•
TreeSatAI: The TreeSatAI dataset consists of detecting the presence of one or more tree species (out of 20 possible species) in forestry images in Germany [Ahlswede et al., 2023]. We used the train and test splits provided by Ahlswede et al. [2023] and compared Presto to the deep learning and tree-based baselines provided with the dataset. As done for the baselines, we evaluated models using only Sentinel-2 (S2) or Sentinel-1 (S1) data.
•
EuroSAT: The EuroSAT dataset classifies Sentinel-2 multispectral images in Europe with one of 10 landcover classes [Helber et al., 2019]. We used the train and test splits provided by Neumann et al. [2019]. We compared Presto to SatMAE, ConvMAE and ScaleMAE using a k Nearest Neighbors (kNN) classifier at a variety of input resolutions, as was done by Reed et al. [2022]. We also compared fine-tuned Presto against Seasonal Contrast (SeCo) [Manas et al., 2021] and Geography-Aware Self-Supervised Learning (GASSL) [Ayush et al., 2021]. EuroSAT provides all multispectral Sentinel-2 bands, but most other models ingest only RGB images. We evaluated Presto both when it received all multispectral bands as input (MS) and when it only received the RGB bands.

Refer to caption

Figure 5: EuroSat accuracy of a kNN@5 classifier given pre-trained model embeddings at a variety of input resolutions (following Reed et al. [2022]) as a function of FLOPs required to encode an image (note the log scale on the x-axes). All image-based models resized images to 224×224224224224\times 224224 × 224, so the FLOPs required to encode an image do not change with image resolution. Presto achieved competitive results with image-based models while requiring up to four orders of magnitude less FLOPs to encode an image.

5.2.1 Image-based Results

Despite being pre-trained on pixel-timeseries data, Presto is competitive on single-timestep image datasets against much larger models. We followed the setup of Reed et al. [2022] in measuring the performance of a kNN-classifier on Presto’s output embeddings for the EuroSat dataset at varying resolutions. Presto achieved comparable average accuracy (over all image resolutions) to larger ViT-based models with RGB data and significantly outperformed these models with multispectral (MS) data (Figure 5), while requiring orders of magnitude less compute to encode the images in both cases and for any resolution.

Presto is performant even when only a small subset of input channels are available compared to the pre-training channels. For the EuroSAT task (Table 5), Presto received either the full Sentinel-2 input or only RGB bands (which represent only a single token, since only one timestep is available). Similarly, we evaluated Presto when it receives either Sentinel-2 or Sentinel-1 data for the TreeSatAI task (Table 5.1). In both cases, Presto was competitive with methods designed to ingest single-timestep, single-sensor data.

Table 5: EuroSAT finetuning accuracy. Presto is the only backbone that can handle both MS and RGB inputs (separate SatMAE models are trained for RGB and MS inputs). We reported Presto results for full resolution; results at reduced resolutions are in Table 11.

{NiceTabular}

lllrr\CodeBefore\rowcolororange!206,7\rowcolorMidnightBlue!208,9\Body Backbone Inputs Params (M) Accuracy
GASSL ResNet-18 RGB 11.69 0.895
SeCo ResNet-18 RGB 11.69 0.931
SatMAE ViT-Large RGB 303.10 0.955
SatMAE ViT-Large MS 305.96 0.990
Random init. Presto RGB 0.40 0.745
MS 0.924
Presto Presto RGB 0.40 0.849
MS 0.953

5.3 Image-Timeseries Tasks

•
S2-Agri100100{}_{100}start_FLOATSUBSCRIPT 100 end_FLOATSUBSCRIPT: The S2-Agri dataset [Sainte Fare Garnot et al., 2020] classifies crop types in agricultural parcels. We used a variant of S2-Agri (S2-Agri100100{}_{100}start_FLOATSUBSCRIPT 100 end_FLOATSUBSCRIPT) developed by Yuan et al. [2022] for the SITS-Former model in which 100 parcels for each crop type are used for training and validation respectively (all other parcels are used for testing), and a 5×5555\times 55 × 5 pixel patch from each parcel is used for input. We benchmarked Presto against both the pre-trained and randomly initialized SITS-Former model.

5.3.1 Image-Timeseries Results

The S2-Agri100100{}_{100}start_FLOATSUBSCRIPT 100 end_FLOATSUBSCRIPT dataset consists of 24 timesteps at 10 to 30 day intervals (compared to Presto’s pre-training data, which consists of 12-month timeseries). Presto remained performant on this dataset, achieving comparable results with SITS-Former despite having 6×6\times6 × fewer parameters (shown in Table 5.3.1). This shows that Presto can ingest timeseries at different temporal resolutions and at varying intervals.

In addition, the S2-Agri dataset is missing pixel location metadata, which is always passed to Presto during pre-training. S2-Agri was sampled from a single S2-tile, so we used the location of the central pixel of this tile for all pixels in the dataset. Even with this much less accurate location metadata, Presto remained performant.

Table 6: Results on the S2-Agri100100{}_{100}start_FLOATSUBSCRIPT 100 end_FLOATSUBSCRIPT dataset. We followd [Yuan et al., 2022] in reporting overall accuracy (OA), Kappa Cohen score (κ𝜅\kappaitalic_κ) and macro-F11{}_{1}start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT score. All results are an average of 3 runs - standard errors are reported in Table A.5.

{NiceTabular}

lccrrrr\CodeBefore\rowlistcolors4orange!20[cols=3-6]\rowlistcolors5MidnightBlue!20[cols=3-6]\Body Params (M) Pre Trained? OA κ𝜅\kappaitalic_κ F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
SITS Former 2.5 65.13 0.55 42.12
✓ 67.03 0.56 42.8342.83\bm{42.83}bold_42.83
Presto 0.4 45.98 0.35 27.45
✓ 68.8968.89\bm{68.89}bold_68.89 0.580.58\bm{0.58}bold_0.58 40.41

5.4 Ablations

We conducted three ablations to better understand Presto’s performance:

Table 7: Structured masking strategies yield the best downstream performance. We measured PrestoR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT’s F1 score on the CropHarvest validation task. Combining structured strategies outperformed the “Random” masking employed by [He et al., 2022].

{NiceTabular}

ccccr\CodeBefore\rowcolorMidnightBlue!206\Body Channel Groups Random Timesteps Contiguous Timesteps F1 Score
✓ 0.646
✓ 0.653
✓ 0.664
✓ 0.649
✓ ✓ ✓ ✓ 0.6650.665\bm{0.665}bold_0.665

•
Structured masking strategies perform best: Table 5.4 shows results from ablating the masking strategies. Unlike other masked autoencoder methods [He et al., 2022], we found that combining structured masking with random masking outperforms random masking alone.
•
Pre-training Presto is critical to achieve strong performance: In Tables 5, 5.2.1 and Table 5.3.1, we compared the performance of a randomly-initialized Presto architecture with the pre-trained model. Pre-training yielded a significant increase in performance (a 50% increase in accuracy on the S2-Agri100100{}_{100}start_FLOATSUBSCRIPT 100 end_FLOATSUBSCRIPT dataset). Even when the downstream training dataset size was large (EuroSat has 21,600 training samples), pre-training yielded a 14% increase in accuracy given RGB inputs and up to 22% increase in accuracy at lower resolutions (Table 11). For TreeSatAI with S1 data (Table 15), a randomly initialized model slightly outperformed the pre-trained model. We hypothesize that this is due to the difference in input relative to the pre-training data, since the TreetSatAI input consists of a single image from only one timestep and one channel group.
•
Presto’s performance scales with model size: To measure how different model sizes affect Presto’s performance, we pre-trained two larger Presto variants: a deeper variant with 4 encoder layers instead of 2, and a wider variant with a doubled encoder size (Table 6). Performance improved as model size increased, suggesting that practitioners who can afford greater computational costs could obtain better results by training a larger Presto model.

6 Discussion & Conclusion

Limitations

Presto is designed to ingest 10m/px resolution imagery and is pre-trained on products at this scale. This decision is motivated by the free, global availability over time of products at this scale (such as Sentinel-1 and Sentinel-2). Presto does not natively process very-high resolution imagery such as <1absent1<1< 1 m/px imagery from commercial satellites or drones, which can be costly and often lack complete coverage globally and temporally. In addition, Presto is a pixel-timeseries model. While we demonstrated Presto’s flexibility on single-timestep image datasets, image-based models may be preferred if a user’s goal is to process entire images to make a prediction. We observed that Presto’s performance on the EuroSAT dataset plateaued as the input resolution increased (Table 5), due to images from classes where the relevant pixels for the class are a minority of the pixels in the image (e.g., highways). In such scene classification challenges, image-based models which can learn the shape of the relevant pixels may be better suited. We discuss this further in Section A.6.

Conclusion

Table 8: Effect of model size on validation performance. To understand the effect of model size on performance, we pre-train two larger variants of Presto. As in Table 5.4, we measure PrestoR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT’s performance on the CropHarvest validation task. The number of parameters includes both the encoder and decoder parameters. The FLOPS are computed for a “full” input (12 timesteps, with no missing channels), when passed through the encoder and decoder.

{NiceTabular}

rrrrr\CodeBefore\rowcolorMidnightBlue!202\BodyDepth Width # params (M) FLOPs (M) F1 score
2 128 0.81 88.94 0.665
2 256 2.02 220.81 0.687
4 128 1.21 132.42 0.669
We present Presto: a lightweight, pre-trained timeseries transformer for remote sensing. By leveraging structure unique to remote sensing data—specifically, (i) an important temporal dimension, (ii) associated metadata and (iii) a diversity of sensors, we are able to train an extremely lightweight model which achieves state-of-the-art results in a wide variety of globally distributed evaluation tasks. Computational efficiency is of paramount importance in remote sensing settings and often determines which models ultimately get selected for deployment. We demonstrated that strong performance can be achieved while meeting this constraint, and that self-supervised learning can provide significant benefits even for small models.

Impact statement

Machine learning applications to remote sensing have a wide range of societally beneficial outcomes, ranging from tracking progress on sustainable development goals [Ferreira et al., 2020] to improved weather forecasting [English et al., 2013, Voosen, 2020] to disaster management [Kansakar and Hossain, 2016].

Presto is designed to be accessible to a wide range of practitioners; we achieve this by only training Presto on publicly available data and by keeping the model size small enough so it can be leveraged in compute-constrained environments. In addition to increasing Presto’s accessibility, its small size also lowers its carbon footprint [Strubell et al., 2019].

As described by Tuia et al. [2023], a natural concern when applying machine learning algorithms to remote sensing data is its use to collect information about individuals who are unaware that data is being collected, and therefore cannot consent to this practice. We therefore encourage deployment of Presto in collaboration with local communities and stakeholders [Krafft, , Kshirsagar et al., 2021, Nakalembe and Kerner, 2023].

Acknowledgements

This work was supported by NASA under the NASA Harvest Consortium on Food Security and Agriculture (Award #80NSSC18M0039). This research was enabled in part by compute resources provided by Mila (mila.quebec); in addition, we acknowledge material support from NVIDIA Corporation in the form of computational resources. We thank Esther Rolf and Caleb Robinson for reviewing drafts of this manuscript.

References

[1] Earth engine data catalogue. https://developers.google.com/earth-engine/datasets/catalog. Accessed: 2023-01-31.
alg [2023] Tick tick bloom: Harmful algal bloom detection challenge. https://www.drivendata.org/competitions/143/tick-tick-bloom/page/649/, 2023. Accessed: 2023-03-10.
90m Digital Elevation Data [2003] S. 90m Digital Elevation Data. The CGIAR consortium for spatial information, 2003.
Abys et al. [2024] C. Abys, S. Skakun, and I. Becker-Reshef. Two decades of winter wheat expansion and intensification in russia. Remote Sensing Applications: Society and Environment, 2024.
Ahlswede et al. [2023] S. Ahlswede, C. Schulz, C. Gava, P. Helber, B. Bischke, M. Förster, F. Arias, J. Hees, B. Demir, and B. Kleinschmit. Treesatai benchmark archive: A multi-sensor, multi-label dataset for tree species classification in remote sensing. Earth System Science Data, 2023.
Ayush et al. [2021] K. Ayush, B. Uzkent, C. Meng, K. Tanmay, M. Burke, D. Lobell, and S. Ermon. Geography-aware self-supervised learning. In CVPR, 2021.
Batjes et al. [2017] N. H. Batjes, E. Ribeiro, A. Van Oostrum, J. Leenaars, T. Hengl, and J. Mendes de Jesus. Wosis: providing standardised soil profile data for the world. Earth System Science Data, 2017.
Böhm et al. [2022] V. Böhm, W. J. Leong, R. B. Mahesh, I. Prapas, E. Nemni, F. Kalaitzis, S. Ganju, and R. Ramos-Pollan. Sar-based landslide classification pretraining leads to better segmentation. In Artificial Intelligence for Humanitarian Assistance and Disaster Response Workshop at NeurIPS, 2022.
Bressan et al. [2022] P. O. Bressan, J. M. Junior, J. A. C. Martins, M. J. de Melo, D. N. Gonçalves, D. M. Freitas, A. P. M. Ramos, M. T. G. Furuya, L. P. Osco, J. de Andrade Silva, et al. Semantic segmentation with labeling uncertainty and class imbalance applied to vegetation mapping. International Journal of Applied Earth Observation and Geoinformation, 2022.
Brown et al. [2022] C. F. Brown, S. P. Brumby, B. Guzder-Williams, T. Birch, S. B. Hyde, J. Mazzariello, W. Czerwinski, V. J. Pasquarella, R. Haertel, S. Ilyushchenko, K. Schwehr, M. Weisse, F. Stolle, C. Hanson, O. Guinan, R. Moore, and A. M. Tait. Dynamic world, near real-time global 10 m land use land cover mapping. Scientific Data, Jun 2022.
Cong et al. [2022] Y. Cong, S. Khanna, C. Meng, P. Liu, E. Rozi, Y. He, M. Burke, D. B. Lobell, and S. Ermon. SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, NeurIPS, 2022. URL https://openreview.net/forum?id=WBhqzpF6KYH.
Di Tommaso et al. [2022] S. Di Tommaso, S. Wang, V. Vajipey, N. Gorelick, R. Strey, and D. B. Lobell. Annual field-scale maps of tall and short crops at the global scale using gedi and sentinel-2. arXiv preprint arXiv:2212.09681, 2022.
English et al. [2013] S. English, T. McNally, N. Bormann, K. Salonen, M. Matricardi, A. Moranyi, M. Rennie, M. Janisková, S. Di Michele, A. Geer, et al. Impact of satellite data, 2013.
Ferreira et al. [2020] B. Ferreira, M. Iten, and R. G. Silva. Monitoring sustainable development by means of earth observation data and machine learning: A review. Environmental Sciences Europe, 2020.
Fuller et al. [2023] A. Fuller, K. Millard, and J. R. Green. CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=ezqI5WgGvY.
Gao et al. [2022] P. Gao, T. Ma, H. Li, Z. Lin, J. Dai, and Y. Qiao. Convmae: Masked convolution meets masked autoencoders. arXiv preprint arXiv:2205.03892, 2022.
Gorelick et al. [2017] N. Gorelick, M. Hancher, M. Dixon, S. Ilyushchenko, D. Thau, and R. Moore. Google earth engine: Planetary-scale geospatial analysis for everyone. Remote sensing of Environment, 2017.
Hansen et al. [2013] M. C. Hansen, P. V. Potapov, R. Moore, M. Hancher, S. A. Turubanova, A. Tyukavina, D. Thau, S. V. Stehman, S. J. Goetz, T. R. Loveland, et al. High-resolution global maps of 21st-century forest cover change. Science, 2013.
He et al. [2022] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022.
Helber et al. [2019] P. Helber, B. Bischke, A. Dengel, and D. Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019.
Hengl et al. [2017] T. Hengl, J. Mendes de Jesus, G. B. Heuvelink, M. Ruiperez Gonzalez, M. Kilibarda, A. Blagotić, W. Shangguan, M. N. Wright, X. Geng, B. Bauer-Marschallinger, et al. Soilgrids250m: Global gridded soil information based on machine learning. PLoS one, 2017.
Jean et al. [2019] N. Jean, S. Wang, A. Samar, G. Azzari, D. Lobell, and S. Ermon. Tile2vec: Unsupervised representation learning for spatially distributed data. In AAAI, 2019.
Kansakar and Hossain [2016] P. Kansakar and F. Hossain. A review of applications of satellite earth observation data for global societal benefit and stewardship of planet earth. Space Policy, 2016.
Kerner et al. [2020] H. Kerner, G. Tseng, I. Becker-Reshef, C. Nakalembe, B. Barker, B. Munshell, M. Paliyam, and M. Hosseini. Rapid response crop maps in data sparse regions. In ACM SIGKDD Conference on Data Mining and Knowledge Discovery Workshops, 2020.
[25] A. Krafft. ASU researcher combats food insecurity with AI. https://news.asu.edu/20230303-solutions-asu-researcher-combats-food-insecurity-ai. Accessed: 2023-09-21.
Kshirsagar et al. [2021] M. Kshirsagar, C. Robinson, S. Yang, S. Gholami, I. Klyuzhin, S. Mukherjee, M. Nasir, A. Ortiz, F. Oviedo, D. Tanner, et al. Becoming good at ai for good. In AAAI/ACM Conference on AI, Ethics, and Society, 2021.
Lacoste et al. [2023] A. Lacoste, N. Lehmann, P. Rodriguez, E. D. Sherwin, H. Kerner, B. Lütjens, J. A. Irvin, D. Dao, H. Alemohammad, A. Drouin, et al. Geo-bench: Toward foundation models for earth monitoring. arXiv preprint arXiv:2306.03831, 2023.
Manas et al. [2021] O. Manas, A. Lacoste, X. Giró-i Nieto, D. Vazquez, and P. Rodriguez. Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In CVPR, 2021.
Nakalembe and Kerner [2023] C. Nakalembe and H. Kerner. Considerations for ai-eo for agriculture in sub-saharan africa. Environmental Research Letters, 2023.
Nakalembe et al. [2021] C. Nakalembe, C. Justice, H. Kerner, C. Justice, and I. Becker-Reshef. Sowing seeds of food security in africa. Eos (Washington. DC), 102, 2021.
Neumann et al. [2019] M. Neumann, A. S. Pinto, X. Zhai, and N. Houlsby. In-domain representation learning for remote sensing. arXiv preprint arXiv:1911.06721, 2019.
Parkinson et al. [2006] C. Parkinson, A. Ward, and M. King. Earth science reference handbook. National Aeronautics and Space Administration: Washington, DC, USA, 2006.
Pelletier et al. [2019] C. Pelletier, G. I. Webb, and F. Petitjean. Temporal convolutional neural network for the classification of satellite image time series. Remote Sensing, 2019.
Rao et al. [2020] K. Rao, A. P. Williams, J. F. Flefil, and A. G. Konings. Sar-enhanced mapping of live fuel moisture content. Remote Sensing of Environment, 2020.
Reed et al. [2022] C. J. Reed, R. Gupta, S. Li, S. Brockman, C. Funk, B. Clipp, S. Candido, M. Uyttendaele, and T. Darrell. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. arXiv preprint arXiv:2212.14532, 2022.
Robinson et al. [2019] C. Robinson, L. Hou, K. Malkin, R. Soobitsky, J. Czawlytko, B. Dilkina, and N. Jojic. Large scale high-resolution land cover mapping with multi-resolution data. In CVPR, 2019.
Rolf et al. [2021] E. Rolf, J. Proctor, T. Carleton, I. Bolliger, V. Shankar, M. Ishihara, B. Recht, and S. Hsiang. A generalizable and accessible approach to machine learning with global satellite imagery. Nature communications, 2021.
Rouse et al. [1974] J. W. Rouse, R. H. Haas, J. A. Schell, D. W. Deering, et al. Monitoring vegetation systems in the great plains with erts. NASA Spec. Publ, 351(1):309, 1974.
Rußwurm and Körner [2020] M. Rußwurm and M. Körner. Self-attention for raw optical satellite time series classification. ISPRS journal of photogrammetry and remote sensing, 2020.
Rußwurm et al. [2020] M. Rußwurm, S. Wang, M. Korner, and D. Lobell. Meta-learning for few-shot land cover classification. In CVPR Workshops, pages 200–201, 2020.
Sainte Fare Garnot et al. [2020] V. Sainte Fare Garnot, L. Landrieu, S. Giordano, and N. Chehata. Satellite image time series classification with pixel-set encoders and temporal self-attention. CVPR, 2020.
Strubell et al. [2019] E. Strubell, A. Ganesh, and A. McCallum. Energy and policy considerations for deep learning in nlp. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019.
Sudmanns et al. [2019] M. Sudmanns, D. Tiede, H. Augustin, and S. Lang. Assessing global sentinel-2 coverage dynamics and data availability for operational earth observation (eo) applications using the eo-compass. International journal of digital earth, 2019.
Tarasiou et al. [2023] M. Tarasiou, E. Chavez, and S. Zafeiriou. ViTs for SITS: Vision Transformers for Satellite Image Time Series. In CVPR, 2023.
Tseng et al. [2021] G. Tseng, I. Zvonkov, C. L. Nakalembe, and H. Kerner. Cropharvest: A global dataset for crop-type classification. In NeurIPS, Datasets and Benchmarks Track, 2021. URL https://openreview.net/forum?id=JtjzUXPEaCu.
Tseng et al. [2022] G. Tseng, H. Kerner, and D. Rolnick. TIML: Task-informed meta-learning for crop type mapping. In AI for Agriculture and Food Systems at AAAI, 2022.
Tuia et al. [2023] D. Tuia, K. Schindler, B. Demir, G. Camps-Valls, X. X. Zhu, M. Kochupillai, S. Džeroski, J. N. van Rijn, H. H. Hoos, F. Del Frate, et al. Artificial intelligence to advance earth observation: a perspective. arXiv preprint arXiv:2305.08413, 2023.
Van Tricht [2021] K. Van Tricht. Mapping crops at global scale! what works and what doesn’t? https://blog.vito.be/remotesensing/worldcereal-benchmarking, 2021. Accessed: 2023-07-31.
Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. NeurIPS, 2017.
Voosen [2020] P. Voosen. Europe builds ‘digital twin’of earth to hone climate forecasts, 2020.
Wang et al. [2020a] S. Wang, W. Chen, S. M. Xie, G. Azzari, and D. B. Lobell. Weakly supervised deep learning for segmentation of remote sensing imagery. Remote Sensing, 2020a.
Wang et al. [2020b] S. Wang, S. Di Tommaso, J. M. Deines, and D. B. Lobell. Mapping twenty years of corn and soybean across the us midwest using the landsat archive. Scientific Data, 2020b.
Yifang et al. [2015] B. Yifang, P. Gong, and C. Gini. Global land cover mapping using earth observation satellite data: Recent progresses and challenges. ISPRS journal of photogrammetry and remote sensing, 2015.
You et al. [2017] J. You, X. Li, M. Low, D. Lobell, and S. Ermon. Deep gaussian process for crop yield prediction based on remote sensing data. Proceedings of the AAAI Conference on Artificial Intelligence, 2017.
Yuan and Lin [2020] Y. Yuan and L. Lin. Self-supervised pretraining of transformers for satellite image time series classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14:474–487, 2020.
Yuan et al. [2022] Y. Yuan, L. Lin, Q. Liu, R. Hang, and Z.-G. Zhou. Sits-former: A pre-trained spatio-spectral-temporal representation model for sentinel-2 time series classification. International Journal of Applied Earth Observation and Geoinformation, 106:102651, 2022.

Appendix A Appendix

Reproducibility

All code and data used to train and evaluate Presto will be made available upon publication, and the code is currently available at https://github.com/nasaharvest/presto. In addition, we discuss specific implementation details in Appendices A.1 and A.4. We have strived to make the Presto codebase accessible to other practitioners; to this end, we include a demo Jupyter notebook demonstrating how Presto can be applied to a new downstream task, which is available at https://github.com/nasaharvest/presto/blob/main/downstream_task_demo.ipynb.

A.1 Pre-training details

We outline training hyperparameters below:

•
Training length: We train the model for 20 epochs, with a batch size of 4096409640964096 (resulting in 5950595059505950 batches per epoch). On a single NVIDIA V100 GPU, this takes 43 1414\frac{1}{4}divide start_ARG 1 end_ARG start_ARG 4 end_ARG hours.
•
Optimizer and learning rate: We train the model with an AdamW optimizer. We use a cosine annealing schedule for our learning rate, with a maximum learning rate of 0.001 at the 2ndsuperscript2nd2^{\text{nd}}2 start_POSTSUPERSCRIPT nd end_POSTSUPERSCRIPT epoch. We apply a weight decay of 0.05, and β𝛽\betaitalic_βs of (0.9, 0.95).
•
Masking: We use a masking ratio of 0.750.750.750.75, randomly selecting (for each instance) a masking strategy from the ones described in Section 3.3. If the masking strategy cannot mask the right number of tokens, we randomly mask additional tokens to achieve the correct masking ratio.

A.1.1 Pre-training data

Refer to caption

Figure 6: The distribution of the pre-training dataset described in Section 3.1.

Remote sensing models can be deployed in a wide range of geographies, with few labelled datapoints available at fine-tuning time [Kerner et al., 2020, Böhm et al., 2022]. We therefore aim to collect a globally representative pre-training dataset. We achieve this by following the sampling strategy used by Dynamic World [Brown et al., 2022]. We divide the Earth into three regions: the Western Hemisphere and two regions in the Eastern Hemisphere. These regions are further divided into ecoregions, and stratified samples are gathered from each region using land cover classes as sampling strata. Figure 6 shows the resulting geographical distribution. Each sample represents a 510×510510510510\times 510510 × 510 pixel tile with a spatial resolution of 10 meter per pixel. To obtain pixel-timeseries we grid-sample 2,500 pixels from each sample, yielding a total of 21,535,000 pixel samples (each with 24 one-month timesteps).

A.1.2 Input data

We leverage the following data products when pre-training Presto:

•
Sentinel-1 Synthetic Aperture Radar observations (S1): The VV (emit and receive at vertical polarization) and VH (emit at vertical and receive at horizontal polarization) bands: 2 real-valued dynamic values per monthly timestep.
•
Sentinel-2 Multispectral images (S2): We removed the 60m resolution bands, yielding bands with 10m and 20m resolution with channels in the visible, near-infrared and short-wave infrared range: 10 real-valued dynamic values per timestep.
•
ERA5 Climate Reanalysis Meteorological data (ERA5): Monthly total precipitation and temperature at 2 metres above the ground: 2 real-valued dynamic values per timestep.
•
NDVI [Rouse et al., 1974]: Computed from the red (B4) and near-infrared (B8) Sentinel-2 bands: 1 real-valued dynamic value per timestep.
•
Dynamic World Land Cover classes [DW, Brown et al., 2022]: Land cover classes produced for every non-cloudy Sentinel-2 image: 1 dynamic categorical value from the set of possible classes 𝒱𝒱\mathcal{V}caligraphic_V per timestep. We took the mode of classes for all timesteps within a month.
•
Topography data (TG), from the Shuttle Radar Topography Mission’s Digital Elevation Model: The elevation and slope of each pixel, real-valued and static in time.
•
Coordinates (Loc): 3D static in time Cartesian coordinates computed from the latitude and longitude of the pixel’s geographical location: sLoc=[cos⁡(lat)×cos⁡(lon),cos⁡(lat)×sin⁡(lon),sin⁡(lat)]subscript𝑠Loclatlonlatlonlats_{\text{Loc}}=[\cos(\text{lat})\times\cos(\text{lon}),\cos(\text{lat})\times% \sin(\text{lon}),\sin(\text{lat})]italic_s start_POSTSUBSCRIPT Loc end_POSTSUBSCRIPT = [ roman_cos ( lat ) × roman_cos ( lon ) , roman_cos ( lat ) × roman_sin ( lon ) , roman_sin ( lat ) ].

A.1.3 Channel Groups

As described in Section 3.2, we transform the pixel timeseries x𝑥xitalic_x into a number of tokens, where each token is a linear transformation of a subset of the input channels. We group together channels which (i) come from the same sensor or product, (ii) have equivalent native spatial resolutions and (iii) represent similar parts of the electromagnetic spectrum (for Sentinel-2 channel groups). We group the input data into the following channel groups:

•
Sentinel-1: The VV and VH bands from the Sentinel-1 sensor
•
Sentinel-2 RGB: The B2, B3 and B4 bands from the Sentinel-2 sensor
•
Sentinel-2 Red Edge: The B5, B6 and B7 bands from the Sentinel-2 sensor
•
Sentinel-2 Near Infra Red (10m): The B8 band from the Sentinel-2 sensor
•
Sentinel-2 Near Infra Red (20m): The B8A band from the Sentinel-2 sensor
•
Sentinel-2 Short Wave Infra Red: The B11 and B12 bands from the Sentinel-2 sensor
•
NDVI: The normalized difference vegetation index, calculated from the Sentinel-2 B4 and B8 bands.
•
ERA5 Climatology: Precipitation and temperature at 2m from the ERA5 Climate Reanalysis product
•
Topography: The elevation and slope of a pixel, calculated by the SRTM’s DEM
•
Location: The cartesian coordinates of a pixel, computed from the pixel’s latitude and longitude

A.2 FLOP calculations

Table 9: Model sizes and FLOPs required to encode a single EuroSat image (or pixel, for Presto), as measured by the thop library. When plotting results in Table 5, we multiply the FLOPs for Presto by the number of pixels encoded for an image. At its highest resolution, EuroSAT images are 64×64646464\times 6464 × 64, so Presto FLOPs for a full resolution image can be obtained by multiplying the per-pixel FLOPs by 4,096. We include this value in brackets for completeness.

Model	Backbone	Params (M)	MegaFlops
SatMAE (RGB) [Cong et al., 2022]	ViT-Large	303.10	59,685.69
SatMAE (MS) [Cong et al., 2022]	ViT-Large	305.96	535,515.25
ScaleMAE [Reed et al., 2022]	ViT-Large	303.10	59,685.69
ConvMAE [Gao et al., 2022]	ConvMAE-Large	88.78	23,315.58
SeCo [Manas et al., 2021]	ResNet-18	11.69	149.37
GASSL [Ayush et al., 2021]	ResNet-18	11.69	149.37
Presto RGB pixel (image)	Presto	0.40	0.79 (3,235.84)
Presto MS pixel (image)	Presto	0.40	2.37 (9,707.52)

We use the thop library (https://github.com/Lyken17/pytorch-OpCounter) to calculate the FLOPs required to encode a EuroSAT image (as plotted in Table 5(b)). For the SatMAE, ScaleMAE and ConvMAE models, all images were resized to 224×224224224224\times 224224 × 224, so the FLOPs required to encode an image is independent of resolution. For Presto, we computed the FLOPs required to encode a single pixel and multiplied this by the number of pixels in an image at each resolution (e.g. the “64” resolution has 64×64646464\times 6464 × 64 pixels, so we multiply the FLOPs required to encode a single pixel by 64×64=40966464409664\times 64=409664 × 64 = 4096). The FLOPs calculated by the thop library are recorded in Table 9.

A.3 Baselines

In addition to task-specific baselines, we benchmark Presto against:

•
Random Forests: Random forests are powerful baselines in remote sensing as they they remain competitive with state-of-the-art methods [Pelletier et al., 2019, Kerner et al., 2020]. Tree-based methods, especially random forests, are commonly deployed in large-scale machine learning for remote sensing applications [Hansen et al., 2013, Van Tricht, 2021, Di Tommaso et al., 2022].
•
MOSAIKS-1D: We adapt MOSAIKS [Rolf et al., 2021] for timeseries data. MOSAIKS-1D uses patches from the pre-training dataset and convolves over the temporal dimension instead of the spatial dimension. We benchmark MOSAIKS-1D on all timeseries evaluation tasks. Because this does not work for categorical inputs, we exclude Dynamic World. As with Presto, we use the output features with random forests (MOSAIKS-1DR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT) and with regressions (MOSAIKS-1DR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT).

Table 10: Full results for regression tasks from Table 5, including standard error computed from three runs.

Fuel Moisture	Algae Blooms	Mean difference
Linear Regression	28.20	0.8500.8500.8500.850	0%
Random Forest	23.84±0.42plus-or-minus23.840.4223.84\pm 0.4223.84 ± 0.42	1.249±0.02plus-or-minus1.2490.021.249\pm 0.021.249 ± 0.02	15.7%
MOSAIKS-1DR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT	28.75±0.15plus-or-minus28.750.1528.75\pm 0.1528.75 ± 0.15	0.972±0.01plus-or-minus0.9720.010.972\pm 0.010.972 ± 0.01	8.15%
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT (random init.)	26.07±0.52plus-or-minus26.070.5226.07\pm 0.5226.07 ± 0.52	0.955±0.05plus-or-minus0.9550.050.955\pm 0.050.955 ± 0.05	2.40%
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT	25.28±0.30plus-or-minus25.280.3025.28\pm 0.3025.28 ± 0.30	0.815±0.03plus-or-minus0.8150.030.815\pm 0.030.815 ± 0.03	−7.24%percent7.24{\color[rgb]{0,0,1}\bm{-7.24\%}}bold_- bold_7.24 bold_%
PrestoR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT	25.98±0.66plus-or-minus25.980.6625.98\pm 0.6625.98 ± 0.66	0.884±0.01plus-or-minus0.8840.010.884\pm 0.010.884 ± 0.01	−1.94%percent1.94-1.94\%- 1.94 %

A.4 Downstream Results

We include complete results for the evaluation tasks. These include error bars, as well as additional results reported for the CropHarvest (Table 12 and Figure 3), regression tasks (Table 10), EuroSAT (Tables 11, 13 and 14), TreeSatAI (Table 15) and Sen2-Agri100100{}_{100}start_FLOATSUBSCRIPT 100 end_FLOATSUBSCRIPT (Table A.5) datasets.

We run all downstream classifiers with 3 seeds (0,42,84042840,42,840 , 42 , 84), with the exception of the kNN classifiers and the linear regression (which are deterministic). In the tables in the main paper (Tables 3.2, 5.1, 5.3.1 and 5) we report the average of these runs; the standard error is reported in Tables 12,15, A.5 and 10.

•
Presto as a feature extractor: When used as a feature extractor, a random forest, regression of K-nearest-neighbours classifier is trained on Presto’s output embeddings. In this case, we use scikit-learn models with the default hyperparameters. For the CropHarvest tasks, the class labels are extremely imbalanced; we therefore set class_weight equal to balanced for those tasks, for both Presto and MOSAIKS-1D.
•
Fine-tuning Presto: When fine-tuning Presto, we use the same hyperparameters across all tasks: an AdamW optimizer with a learning rate of 3e-4 and weight decay of 0.050.050.050.05.

As discussed in Section 5.2, we obtain per-image predictions using Presto by computing a mean and standard deviation of Presto’s output pixels, and passing a concatenation of these two vectors to a downstream classifier. This is illustrated in Figure 4.

A.5 Disentangling the effect of pre-training

To understand the effect of pre-training Presto, we fine-tune Presto and train it from scratch on EuroSat (Table 5.2.1), the regression tasks (Table 5 in the main paper) and TreeSatAI (Table 15). We omit the CropHarvest dataset because it was expressly designed as a few-shot-learning dataset. Its small size makes the construction of validation sets with which to control the finetuning (e.g. with early stopping) challenging.

Overall, we find a consistent and significant improvement from the use of pre-trained Presto compared to a randomly initialized version of the model. For the EuroSat task, pre-training consistently delivers an incresse in accuracy score >0.1absent0.1>0.1> 0.1 (representing increases in accuracy of up to 25%). This effect is consistent with what we observe on the TreeSatAI dataset for S2 data and on the regression tasks (where pre-training reduces RMSE by to 15% on the algae blooms task). For the TreeSatAI dataset with S1 data, pre-training penalizes the model compared to random initialization - we hypothesize that this is due to the difference in input (a single timestep and single channel group image) relative to the pre-training data. The benefit of pre-training effect is especially pronounced on the S2-Agri100100{}_{100}start_FLOATSUBSCRIPT 100 end_FLOATSUBSCRIPT dataset; we hypothesize this is due to the small training set size.

Table 11: Accuracy results for pre-trained and from-scratch Presto when fine-tuned on EuroSat, at varying resolutions. We hypothesize that the drop in performance for the full resolution (64) RGB input is due to the model construction; the model outputs for all pixels in the image (4,096 pixels for the full resolution) are aggregated and passed to a linear layer for classification, yielding a noisy gradient signal.

Resolution	2	4	8	16	32	64
random init.	RGB	0.703±0.005plus-or-minus0.7030.0050.703\pm 0.0050.703 ± 0.005	0.684±0.032plus-or-minus0.6840.0320.684\pm 0.0320.684 ± 0.032	0.694±0.013plus-or-minus0.6940.0130.694\pm 0.0130.694 ± 0.013	0.739±0.004plus-or-minus0.7390.0040.739\pm 0.0040.739 ± 0.004	0.750±0.018plus-or-minus0.7500.0180.750\pm 0.0180.750 ± 0.018	0.745±0.009plus-or-minus0.7450.0090.745\pm 0.0090.745 ± 0.009
pre-trained	0.792±0.010plus-or-minus0.7920.0100.792\pm 0.0100.792 ± 0.010	0.837±0.006plus-or-minus0.8370.0060.837\pm 0.0060.837 ± 0.006	0.847±0.016plus-or-minus0.8470.0160.847\pm 0.0160.847 ± 0.016	0.865±0.006plus-or-minus0.8650.0060.865\pm 0.0060.865 ± 0.006	0.872±0.002plus-or-minus0.8720.0020.872\pm 0.0020.872 ± 0.002	0.849±0.004plus-or-minus0.8490.0040.849\pm 0.0040.849 ± 0.004
random init.	MS	0.837±0.014plus-or-minus0.8370.0140.837\pm 0.0140.837 ± 0.014	0.884±0.010plus-or-minus0.8840.0100.884\pm 0.0100.884 ± 0.010	0.895±0.006plus-or-minus0.8950.0060.895\pm 0.0060.895 ± 0.006	0.907±0.13plus-or-minus0.9070.130.907\pm 0.130.907 ± 0.13	0.924±0.005plus-or-minus0.9240.0050.924\pm 0.0050.924 ± 0.005	0.924±0.003plus-or-minus0.9240.0030.924\pm 0.0030.924 ± 0.003
pre-trained	0.898±0.005plus-or-minus0.8980.0050.898\pm 0.0050.898 ± 0.005	0.925±0.004plus-or-minus0.9250.0040.925\pm 0.0040.925 ± 0.004	0.939±0.000plus-or-minus0.9390.0000.939\pm 0.0000.939 ± 0.000	0.950±0.002plus-or-minus0.9500.0020.950\pm 0.0020.950 ± 0.002	0.958±0.001plus-or-minus0.9580.0010.958\pm 0.0010.958 ± 0.001	0.953±0.004plus-or-minus0.9530.0040.953\pm 0.0040.953 ± 0.004

Table 12: Additional results for the CropHarvest task. In addition to the F1 scores reported in the main paper, we report AUC ROC scores, with standard error bars computed with three runs.

Model	Kenya	Brazil	Togo	Mean
F1	Random Forest	0.559±0.003plus-or-minus0.5590.0030.559\pm 0.0030.559 ± 0.003	0.000±0.000plus-or-minus0.0000.0000.000\pm 0.0000.000 ± 0.000	0.756±0.002plus-or-minus0.7560.0020.756\pm 0.0020.756 ± 0.002	0.441
MOSAIKS-1DR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT	0.790±0.027plus-or-minus0.7900.0270.790\pm 0.0270.790 ± 0.027	0.746±0.084plus-or-minus0.7460.0840.746\pm 0.0840.746 ± 0.084	0.679±0.024plus-or-minus0.6790.0240.679\pm 0.0240.679 ± 0.024	0.738
TIML	0.838±0.000plus-or-minus0.8380.0000.838\pm 0.0000.838 ± 0.000	0.835±0.012plus-or-minus0.8350.0120.835\pm 0.0120.835 ± 0.012	0.732±0.002plus-or-minus0.7320.0020.732\pm 0.0020.732 ± 0.002	0.8020.8020.8020.802
PrestoR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT	0.816±0.000plus-or-minus0.8160.0000.816\pm 0.0000.816 ± 0.000	0.891±0.000plus-or-minus0.8910.0000.891\pm 0.0000.891 ± 0.000	0.798±0.000plus-or-minus0.7980.0000.798\pm 0.0000.798 ± 0.000	0.8350.8350.8350.835
no DW	0.861±0.000plus-or-minus0.8610.000\bm{0.861\pm 0.000}bold_0.861 bold_± bold_0.000	0.888±0.000plus-or-minus0.8880.0000.888\pm 0.0000.888 ± 0.000	0.760±0.000plus-or-minus0.7600.0000.760\pm 0.0000.760 ± 0.000	0.8360.8360.8360.836
AUC ROC	Random Forest	0.578±0.006plus-or-minus0.5780.0060.578\pm 0.0060.578 ± 0.006	0.941±0.004plus-or-minus0.9410.0040.941\pm 0.0040.941 ± 0.004	0.892±0.001plus-or-minus0.8920.0010.892\pm 0.0010.892 ± 0.001	0.803
MOSAIKS-1DR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT	0.693±0.036plus-or-minus0.6930.0360.693\pm 0.0360.693 ± 0.036	0.890±0.038plus-or-minus0.8900.0380.890\pm 0.0380.890 ± 0.038	0.836±0.005plus-or-minus0.8360.0050.836\pm 0.0050.836 ± 0.005	0.806
TIML	0.794±0.003plus-or-minus0.7940.0030.794\pm 0.0030.794 ± 0.003	0.988±0.001plus-or-minus0.9880.0010.988\pm 0.0010.988 ± 0.001	0.890±0.000plus-or-minus0.8900.0000.890\pm 0.0000.890 ± 0.000	0.8900.8900.8900.890
PrestoR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT	0.834±0.000plus-or-minus0.8340.0000.834\pm 0.0000.834 ± 0.000	0.997±0.000plus-or-minus0.9970.0000.997\pm 0.0000.997 ± 0.000	0.921±0.000plus-or-minus0.9210.0000.921\pm 0.0000.921 ± 0.000	0.9170.9170.9170.917
no DW	0.863±0.000plus-or-minus0.8630.000\bm{0.863\pm 0.000}bold_0.863 bold_± bold_0.000	0.989±0.000plus-or-minus0.9890.0000.989\pm 0.0000.989 ± 0.000	0.912±0.000plus-or-minus0.9120.0000.912\pm 0.0000.912 ± 0.000	0.9210.9210.9210.921

Table 13: Additional results for the EuroSat task - results for the ScaleMAE, SatMAE and ConvMAE models are from [Reed et al., 2022]. We report kNN classifier results for different values of k𝑘kitalic_k, and at varying input resolutions.

Resolution	16	32	64
k𝑘kitalic_k	5555	20202020	100100100100	5555	20202020	100100100100	5555	20202020	100100100100
SatMAE	0.729	0.727	0.695	0.871	0.876	0.854	0.934	0.931	0.913
ScaleMAE	0.751	0.744	0.699	0.912	0.901	0.869	0.960	0.956	0.935
ConvMAE	0.835	0.826	0.788	0.909	0.898	0.863	0.947	0.940	0.914
Presto (RGB)	0.869	0.828	0.713	0.869	0.829	0.712	0.869	0.829	0.713
Presto (MS)	0.916	0.892	0.844	0.920	0.892	0.846	0.921	0.893	0.846

Table 14: Additional results for the EuroSat task for Presto when run with reduced resolutions (compared to those used by [Reed et al., 2022] and reported in Table 13). We report kNN classifier results for different values of k𝑘kitalic_k, and at varying input resolutions.

Resolution	2	4	8
k𝑘kitalic_k	5555	20202020	100100100100	5555	20202020	100100100100	5555	20202020	100100100100
Presto (RGB)	0.843	0.811	0.699	0.860	0.820	0.706	0.869	0.826	0.710
Presto (MS)	0.873	0.852	0.799	0.895	0.874	0.824	0.911	0.886	0.838

Table 15: Additional results for the TreeSatAI (as in [Ahlswede et al., 2023], we report precision and recall in addition to F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score and mAP). In addition, we report the results of finetuning Presto (PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT) from the pre-trained weights and from a random initialization.

Model	Data	Aggregation	F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT	mAP	Precision	Recall
MLP	S1	Weighted	10.0910.0910.0910.09	29.4229.4229.4229.42	33.2933.2933.2933.29	7.137.137.137.13
LightGBM	11.8611.8611.8611.86	32.7932.7932.7932.79	37.9637.9637.9637.96	8.068.068.068.06
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT (random init.)	40.36±0.77plus-or-minus40.360.7740.36\pm 0.7740.36 ± 0.77	39.77±0.79plus-or-minus39.770.7939.77\pm 0.7939.77 ± 0.79	30.69±0.82plus-or-minus30.690.8230.69\pm 0.8230.69 ± 0.82	64.69±1.09plus-or-minus64.691.0964.69\pm 1.0964.69 ± 1.09
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT	38.69±0.78plus-or-minus38.690.7838.69\pm 0.7838.69 ± 0.78	37.41±0.58plus-or-minus37.410.5837.41\pm 0.5837.41 ± 0.58	30.09±0.74plus-or-minus30.090.7430.09\pm 0.7430.09 ± 0.74	61.20±0.85plus-or-minus61.200.8561.20\pm 0.8561.20 ± 0.85
PrestoR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT	38.34±0.07plus-or-minus38.340.0738.34\pm 0.0738.34 ± 0.07	35.45±0.03plus-or-minus35.450.0335.45\pm 0.0335.45 ± 0.03	29.67±0.07plus-or-minus29.670.0729.67\pm 0.0729.67 ± 0.07	57.23±0.06plus-or-minus57.230.0657.23\pm 0.0657.23 ± 0.06
MLP	Micro	12.8212.8212.8212.82	33.0933.0933.0933.09	63.0163.0163.0163.01	7.137.137.137.13
LightGBM	14.07	35.11	55.4955.4955.4955.49	8.068.068.068.06
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT (random init.)	42.04±0.73plus-or-minus42.040.7342.04\pm 0.7342.04 ± 0.73	43.00±0.80plus-or-minus43.000.8043.00\pm 0.8043.00 ± 0.80	31.20±1.00plus-or-minus31.201.0031.20\pm 1.0031.20 ± 1.00	64.69±1.09plus-or-minus64.691.0964.69\pm 1.0964.69 ± 1.09
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT	41.65±0.46plus-or-minus41.650.4641.65\pm 0.4641.65 ± 0.46	40.75±0.69plus-or-minus40.750.6940.75\pm 0.6940.75 ± 0.69	31.58±0.47plus-or-minus31.580.4731.58\pm 0.4731.58 ± 0.47	61.20±0.85plus-or-minus61.200.8561.20\pm 0.8561.20 ± 0.85
PrestoR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT	40.79±0.04plus-or-minus40.790.0440.79\pm 0.0440.79 ± 0.04	38.64±0.02plus-or-minus38.640.0238.64\pm 0.0238.64 ± 0.02	31.69±0.03plus-or-minus31.690.0331.69\pm 0.0331.69 ± 0.03	57.23±0.06plus-or-minus57.230.0657.23\pm 0.0657.23 ± 0.06
MLP	S2	Weighted	51.9751.9751.9751.97	64.1964.1964.1964.19	74.5974.5974.5974.59	42.2342.2342.2342.23
LightGBM	48.1748.1748.1748.17	61.9961.9961.9961.99	74.2774.2774.2774.27	40.0440.0440.0440.04
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT (random init.)	52.74±0.50plus-or-minus52.740.5052.74\pm 0.5052.74 ± 0.50	57.24±0.64plus-or-minus57.240.6457.24\pm 0.6457.24 ± 0.64	45.87±1.17plus-or-minus45.871.1745.87\pm 1.1745.87 ± 1.17	64.29±1.51plus-or-minus64.291.5164.29\pm 1.5164.29 ± 1.51
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT	53.63±0.42plus-or-minus53.630.4253.63\pm 0.4253.63 ± 0.42	59.16±1.24plus-or-minus59.161.2459.16\pm 1.2459.16 ± 1.24	47.15±1.40plus-or-minus47.151.4047.15\pm 1.4047.15 ± 1.40	65.11±3.21plus-or-minus65.113.2165.11\pm 3.2165.11 ± 3.21
PrestoR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT	55.29±0.08plus-or-minus55.290.0855.29\pm 0.0855.29 ± 0.08	61.53±0.09plus-or-minus61.530.0961.53\pm 0.0961.53 ± 0.09	56.93±0.07plus-or-minus56.930.0756.93\pm 0.0756.93 ± 0.07	58.56±0.09plus-or-minus58.560.0958.56\pm 0.0958.56 ± 0.09
MLP	Micro	54.4954.4954.4954.49	65.8365.8365.8365.83	77.1877.1877.1877.18	42.2342.2342.2342.23
LightGBM	52.5252.5252.5252.52	61.6661.6661.6661.66	76.2776.2776.2776.27	40.0440.0440.0440.04
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT (random init.)	52.56±0.41plus-or-minus52.560.4152.56\pm 0.4152.56 ± 0.41	58.08±0.66plus-or-minus58.080.6658.08\pm 0.6658.08 ± 0.66	44.56±1.03plus-or-minus44.561.0344.56\pm 1.0344.56 ± 1.03	64.29±1.51plus-or-minus64.291.5164.29\pm 1.5164.29 ± 1.51
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT	53.31±0.18plus-or-minus53.310.1853.31\pm 0.1853.31 ± 0.18	59.77±1.13plus-or-minus59.771.1359.77\pm 1.1359.77 ± 1.13	45.51±1.46plus-or-minus45.511.4645.51\pm 1.4645.51 ± 1.46	65.11±3.21plus-or-minus65.113.2165.11\pm 3.2165.11 ± 3.21
PrestoR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT	58.29±0.06plus-or-minus58.290.0658.29\pm 0.0658.29 ± 0.06	63.31±0.06plus-or-minus63.310.0663.31\pm 0.0663.31 ± 0.06	58.04±0.05plus-or-minus58.040.0558.04\pm 0.0558.04 ± 0.05	58.56±0.09plus-or-minus58.560.0958.56\pm 0.0958.56 ± 0.09

Table 16: Full results on the S2-Agri100100{}_{100}start_FLOATSUBSCRIPT 100 end_FLOATSUBSCRIPT dataset, including standard errors obtained from 3 runs. To obtain standard errors for the SITS-Former, we run the official code (https://github.com/linlei1214/SITS-Former) with 3 seeds. Best results are highlighted.

{NiceTabular}

lccrrrr Params (M) Pre-trained? OA κ𝜅\kappaitalic_κ F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
SITS Former 2.5 65.13±3.01plus-or-minus65.133.0165.13\pm 3.0165.13 ± 3.01 0.55±0.03plus-or-minus0.550.030.55\pm 0.030.55 ± 0.03 42.12±0.52plus-or-minus42.120.5242.12\pm 0.5242.12 ± 0.52
✓ 67.03±2.24plus-or-minus67.032.2467.03\pm 2.2467.03 ± 2.24 0.56±0.02plus-or-minus0.560.020.56\pm 0.020.56 ± 0.02 42.83±0.30plus-or-minus42.830.30\bm{42.83}\pm 0.30bold_42.83 ± 0.30
Presto 0.4 45.98±2.74plus-or-minus45.982.7445.98\pm 2.7445.98 ± 2.74 0.35±0.02plus-or-minus0.350.020.35\pm 0.020.35 ± 0.02 27.45±0.64plus-or-minus27.450.6427.45\pm 0.6427.45 ± 0.64
✓ 68.89±1.05plus-or-minus68.891.05\bm{68.89}\pm 1.05bold_68.89 ± 1.05 0.58±0.01plus-or-minus0.580.01\bm{0.58}\pm 0.01bold_0.58 ± 0.01 40.41±0.25plus-or-minus40.410.2540.41\pm 0.2540.41 ± 0.25

A.6 Presto’s failure modes

Refer to caption

Figure 7: Accuracy of kNN@5 classifier with Presto RGB representations on the EuroSat dataset vs. the input resolution, for different categories. Some categories have been left out for clarity.

Refer to caption

(a) Forest

Refer to caption

(b) Annual Crop

Refer to caption

(d) River

Figure 8: the RGB bands of example images from EuroSat classes.

Presto processes pixel-timeseries independently, without spatial context from other pixels or locations. This means that when we make image-based predictions (such as for scene classification), Presto’s independent pixel representations must be aggregated into a single prediction. We opt for a simple concatenation of the element-wise mean and standard deviation of the representations, from which a classifier makes a prediction. Information gets lost in such a simple aggregation, which impacts Presto’s performance on such tasks.

For example, Presto’s performance on the EuroSat dataset reaches a plateau when increasing the input resolution. As Figure 7 shows, this is mainly caused by a failure to accurately predict specific classes (for example, the Highway and River classes). Figure 8 shows example images for these classes, as well as for the Forest and AnnualCrop classes, on which Presto achieves higher accuracies. While in the Forest and AnnualCrop images, most pixels of the image actually represent the labelled class, in the Highway and River images only a relatively small part of the image actually contains the label (a highway or river). We hypothesize that since many pixels in the Highway and River images do not actually represent that class, the crude token-aggregation method we use to represent images is insufficiently discriminative to accurately classify these images.

Other pre-trained remote sensing models use much more powerful mechanisms for aggregating spatial information. For example, ViT models convolve over patches and then apply an attention mechanism between spatial patches. If image-based predictions are needed and these predictions are highly dependent on the occurrence of objects in subregions of the image, models which natively process this important spatial information may be better suited.

We plan on exploring techniques to mitigate this difficulty with Presto in future work.

Lightweight, Pre-trained Transformers for Remote Sensing Timeseries (original) (raw)

Abstract

1 Introduction

2 Related Work

Architectures for Remote Sensing

Self-supervised learning for Remote Sensing

3 Method

3.1 Pre-training Data

3.2 Encoding and tokenization

3.3 Pre-training via Structured Masking

4 Experiments

5 Evaluation Tasks & Results

5.1 Timeseries Tasks

5.1.1 Timeseries Results

5.2 Image-based Tasks

5.2.1 Image-based Results

5.3 Image-Timeseries Tasks

5.3.1 Image-Timeseries Results

5.4 Ablations

6 Discussion & Conclusion

Limitations

Conclusion

Impact statement

Acknowledgements

References

Appendix A Appendix

Reproducibility

A.1 Pre-training details

A.1.1 Pre-training data

A.1.2 Input data

A.1.3 Channel Groups

A.2 FLOP calculations

A.3 Baselines

A.4 Downstream Results

A.5 Disentangling the effect of pre-training

A.6 Presto’s failure modes