Lightweight, Pre-trained Transformers for Remote Sensing Timeseries (original) (raw)

Gabriel Tseng1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT Ruben Cartuyvels1,313{}^{1,3}start_FLOATSUPERSCRIPT 1 , 3 end_FLOATSUPERSCRIPT Ivan Zvonkov44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT Mirali Purohit55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPT
David Rolnick1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT Hannah Kerner55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Mila – Quebec AI Institute
22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT McGill University
33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT KU Leuven
44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT University of Maryland, College Park
55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPT Arizona State University

Abstract

Machine learning methods for satellite data have a range of societally relevant applications, but labels used to train models can be difficult or impossible to acquire. Self-supervision is a natural solution in settings with limited labeled data, but current self-supervised models for satellite data fail to take advantage of the characteristics of that data, including the temporal dimension (which is critical for many applications, such as monitoring crop growth) and availability of data from many complementary sensors (which can significantly improve a model’s predictive performance). We present Presto (the Pretrained Remote Sensing Transformer), a model pre-trained on remote sensing pixel-timeseries data. By designing Presto specifically for remote sensing data, we can create a significantly smaller but performant model. Presto excels at a wide variety of globally distributed remote sensing tasks and performs competitively with much larger models while requiring far less compute. Presto can be used for transfer learning or as a feature extractor for simple models, enabling efficient deployment at scale.

1 Introduction

Machine learning is increasingly being applied to the remote sensing domain, in particular to understand the evolution of the Earth’s surface over time [Brown et al., 2022, Voosen, 2020, Abys et al., 2024, Wang et al., 2020b]. These applications can have important societally beneficial outcomes, ranging from tracking progress on sustainable development goals [Ferreira et al., 2020] to improved weather forecasting [English et al., 2013, Voosen, 2020] to disaster management [Kansakar and Hossain, 2016]. However, labeled datasets often contain labels that are few, sparse, and unreliable [Bressan et al., 2022], especially for under-resourced geographies, leading to poor global generalization [Yifang et al., 2015, Kerner et al., 2020, Nakalembe et al., 2021]. This has spurred the investigation of self-supervised learning algorithms for remote sensing data.

Current self-supervised approaches for remote sensing data have drawn from methods in computer vision, yielding models that treat remote sensing data as single-timestep images [Jean et al., 2019, Manas et al., 2021, Ayush et al., 2021]. Such models (i) cannot benefit from patterns that emerge when an area is monitored over time, which is especially important for agriculture and other seasonal landcover, (ii) typically only consider a single satellite product (such as Sentinel-2 multispectral data), despite there being hundreds of publicly available satellite data products [GEE, ], (iii) are typically large and computationally expensive [Reed et al., 2022, Cong et al., 2022, Fuller et al., 2023], making the deployment of these models at scale challenging, and (iv) cannot natively handle the labels for many remote sensing datasets, which are points or irregularly shaped polygons [Rao et al., 2020, Batjes et al., 2017], requiring additional methods to handle these labels[Wang et al., 2020a].

We introduce the Pretrained Remote Sensing Transformer (Presto), a lightweight model designed to ingest pixel-timeseries inputs from a variety of Earth observation sensors and data products. Presto operates on individual pixels, using the temporal and multimodal structure of the data instead of the image structure. To learn powerful representations of remote sensing data that can be adapted to a wide range of tasks, Presto leverages a self-supervised masked autoencoding approach, reconstructing unobserved timepoints and sensory modalities. This allows Presto to be robust to missing data and to flexibly accommodate diverse input formats. We find Presto excels even in image-based tasks where the temporal dimension is completely absent.

Presto addresses the following requirements, which are critical to the useful deployment of pre-trained models in the remote sensing context:

Refer to caption

Figure 1: Presto learns from structurally-masked remote sensing pixel-timeseries. We construct a multi-sensor remote sensing pixel-timeseries, and randomly select one of the four masking strategies described in Section 3.3. The encoder-decoder model is trained to reconstruct the original timeseries. At fine-tuning time, we discard the decoder and only use the encoder’s output. The downstream task may have incomplete inputs (missing timesteps or sensors) since the encoder is specifically trained on such inputs. Presto receives both static-in-time and dynamic-in-time inputs and the location metadata of each pixel timeseries.

Our results support the surprising conclusion that a pixel-based approach can in some cases match or outperform sophisticated computer vision-based approaches. We hypothesize that this is possible because (i) Presto learns from many semantically dense data sources, allowing it to extract informative patterns from pixel-timeseries, and (ii) many remote sensing tasks require significantly smaller receptive fields than those provided by computer vision-based models. Brown et al. [2022] leveraged such properties to train a model 100×100\times100 × smaller than standard models while achieving state-of-the-art land-cover segmentation results.

Architectures for Remote Sensing

When processing remote sensing timeseries, transformers have been extensively investigated either as unmodified architectures [Rußwurm and Körner, 2020] or as architectures designed for specific tasks [Sainte Fare Garnot et al., 2020, Tarasiou et al., 2023]. Recurrent networks have also been investigated [Kerner et al., 2020, Rußwurm and Körner, 2020]. When treating remote sensing data as single or few (up to 3) timestep images, architectures from computer vision are commonly used, ranging from ResNets [Manas et al., 2021, Ayush et al., 2021, Rußwurm et al., 2020] to Vision Transformers [Cong et al., 2022, Reed et al., 2022, Fuller et al., 2023].

Self-supervised learning for Remote Sensing

While contrastive learning has been investigated for remote sensing [Manas et al., 2021], recent self-supervised learning research has focused on masked autoencoders [Yuan et al., 2022, Cong et al., 2022, Reed et al., 2022, Fuller et al., 2023]. However, these approaches (i) focus on learning from raw satellite data products (ignoring derived products such as elevation) and typically only ingest data from a single sensor (the exception being the CROMA model of Fuller et al. [2023], which ingests both Sentinel-1 and Sentinel-2 data), (ii) ingest very few or no timesteps (Reed et al. [2022] and Fuller et al. [2023] ingest only one timestep while Cong et al. [2022] ingest up to three timesteps), (iii) expect data in a certain size (for instance, ViT based models require spatial dimensions to be present), so that missing data is not handled natively, and (iv) generally yield larger models ranging from 2.5 million parameters [Yuan and Lin, 2020] to over 300 million parameters for ViT-based methods, making their deployment in compute-constrained settings challenging.

3 Method

We aim to learn a model, f𝑓fitalic_f, which can learn useful representations in a self-supervised manner given unlabelled remote sensing pixel-timeseries data while meeting the usability requirements outlined in Section 1. This model can then be applied to a wide variety of downstream remote sensing tasks. These downstream tasks may contain input data from a range of sensors with differing numbers of timesteps.

Our approach is based on the masked autoencoding framework [He et al., 2022], in which the network architecture includes both an encoder (f𝑓fitalic_f) and a decoder (g𝑔gitalic_g). During pre-training, part of the input is masked out and the encoder embeds the remaining (non-masked) part of the input. The decoder aims to reconstruct the masked-out part of the input, given the encoder’s output. At fine-tuning time, we discard g𝑔gitalic_g and only use f𝑓fitalic_f (either as a feature extractor or a fine-tuneable model) for downstream tasks. In the sections below, we discuss how Presto customizes this general framework for multi-sensor remote sensing timeseries data. An overview of the Presto pre-training methodology is shown in Figure 1, and full pre-training details are in Section A.1.

3.1 Pre-training Data

Self-supervised models for remote sensing must generalize to a wide range of geographies and tasks [Lacoste et al., 2023]. We therefore aimed to collect a globally representative pre-training dataset. We followed the sampling strategy of Brown et al. [2022] to construct a dataset of 21.5M pixel samples, each with a resolution of 10m per pixel. Appendix A.1.1 describes the pre-training dataset construction process in detail. Presto was trained on pixel-timeseries of 12-month contiguous intervals, sampled from a 2-year period from the beginning of 2020 until the end of 2021, with each month represented by one timestep (similar to the approach adopted by Tseng et al. [2021]). Derived data products that result from the analysis of lower level data (e.g., Parkinson et al. [2006]) can significantly improve model performance [Rao et al., 2020, Hengl et al., 2017]. We therefore pre-trained Presto on a diverse set of directly-sensed and derived Earth observation products which we pre-processed and exported using Google Earth Engine [Gorelick et al., 2017].

A pre-training batch contained several pixel-timeseries samples, each of which is a concatenation of dynamic-in-time datapoints with each timestep representing a month (yielding T=12𝑇12T=12italic_T = 12 timesteps in total). The following dynamic-in-time data products were used, yielding 15151515 channels: (i) Sentinel-2 (S2) multispectral data, (ii) Sentinel-1 (S1) radar data, (iii) ERA5 climate reanalysis data, (iv) NDVI [Rouse et al., 1974] derived from Sentinel-2 data and (v) land cover classes 𝒱𝒱\mathcal{V}caligraphic_V from Dynamic World. To every pixel-timeseries we appended two static-in-time products: (i) topography data from the SRTM digital elevation model [90m Digital Elevation Data, 2003] and (ii) location coordinates of each pixel. Hence, one pre-training sample x𝑥xitalic_x, comprising a pixel-timeseries t∈[ℝT×15;𝒱T×1]𝑡superscriptℝ𝑇15superscript𝒱𝑇1t\in[\mathbb{R}^{T\times 15};\mathcal{V}^{T\times 1}]italic_t ∈ [ blackboard_R start_POSTSUPERSCRIPT italic_T × 15 end_POSTSUPERSCRIPT ; caligraphic_V start_POSTSUPERSCRIPT italic_T × 1 end_POSTSUPERSCRIPT ] and static variables s∈ℝ1×5𝑠superscriptℝ15s\in\mathbb{R}^{1\times 5}italic_s ∈ blackboard_R start_POSTSUPERSCRIPT 1 × 5 end_POSTSUPERSCRIPT, is summarized as follows:

x=[{tiS1;tiS2;tiERA5;tiNDVI;tiDW|i=1,…,12};sTG;sLoc]𝑥conditional-setsuperscriptsubscript𝑡𝑖S1superscriptsubscript𝑡𝑖S2superscriptsubscript𝑡𝑖ERA5superscriptsubscript𝑡𝑖NDVIsuperscriptsubscript𝑡𝑖DW𝑖1…12superscript𝑠TGsuperscript𝑠Locx=\Big{[}\big{\{}t_{i}^{\text{S1}};\ t_{i}^{\text{S2}};\ t_{i}^{\text{ERA5}};% \ t_{i}^{\text{NDVI}};\ t_{i}^{\text{DW}}\ \ i=1,...,12\big{\}};\ s^{\text{TG% }};\ s^{\text{Loc}}\Big{]}italic_x = [ { italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT S1 end_POSTSUPERSCRIPT ; italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT S2 end_POSTSUPERSCRIPT ; italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ERA5 end_POSTSUPERSCRIPT ; italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT NDVI end_POSTSUPERSCRIPT ; italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT DW end_POSTSUPERSCRIPT italic_i = 1 , … , 12 } ; italic_s start_POSTSUPERSCRIPT TG end_POSTSUPERSCRIPT ; italic_s start_POSTSUPERSCRIPT Loc end_POSTSUPERSCRIPT ] (1)

From now on, we use “pixel-timeseries” to refer to both the dynamic and the static variables.

3.2 Encoding and tokenization

Refer to caption

Figure 2: Presto learns to reconstruct channels that are completely masked in a spatially cohesive manner. In this experiment, we masked only the Sentinel-2 RGB channels; Presto was able to reconstruct these channels even when they were absent from the input. The reconstructions are spatially consistent even though Presto only receives single pixel inputs.

We transformed the pixel-timeseries x𝑥xitalic_x into a number of tokens (each represented by an embedding e𝑒eitalic_e) to be processed by the Presto transformer. Per timestep 0≤i<T0𝑖𝑇0\leq i<T0 ≤ italic_i < italic_T, we split the input variables into channel groups 𝒞𝒞\mathcal{C}caligraphic_C according to their type of sensor or source: e.g., the S1 bands form one channel group. We describe these groups in more detail in Appendix A.1.3. Each real-valued channel group represents a different sensor, native spatial resolution or (in the case of Sentinel-2 channel-groups) region of the electromagnetic spectrum. We projected each channel group to a common latent space of dimension desubscript𝑑𝑒d_{e}italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT by separate learned linear projections h𝒞superscriptℎ𝒞h^{\mathcal{C}}italic_h start_POSTSUPERSCRIPT caligraphic_C end_POSTSUPERSCRIPT: e.g., eiS1=hS1⁢(tiS1)superscriptsubscript𝑒𝑖S1superscriptℎS1superscriptsubscript𝑡𝑖S1e_{i}^{\text{S1}}=h^{\text{S1}}(t_{i}^{\text{S1}})italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT S1 end_POSTSUPERSCRIPT = italic_h start_POSTSUPERSCRIPT S1 end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT S1 end_POSTSUPERSCRIPT ). The Dynamic World classes are categorical, so we embedded them by indexing them into an embedding matrix.

Table 1: We evaluated Presto on a wide variety of downstream tasks, including segmentation (seg.), multi-label (ml) scene classification (class.) and regression (reg.) tasks. There is diversity in terms of data composition, geographic area and training set size. Input shape describes the shape of a single sample, in terms of [Height, Width, Timesteps, Channels]. We bold the temporal dimension, to highlight time-series versus single-timestep inputs.

llrrr\CodeBefore\rowcolorgray!202,3,4,6,7,10\BodyDataset Task Region Input shape Train samples
CropHarvest Seg. Kenya [1, 1, 12, 18] 1,345
Brazil 203
Togo 1,319
S2-Agri100100{}_{100}start_FLOATSUBSCRIPT 100 end_FLOATSUBSCRIPT Class. France [5, 5, 24, 10] 1,500
TreeSat ML Class. Germany [6, 6, 1, 2] 45,337
[6, 6, 1, 11]
EuroSat Class. Europe [64, 64, 1, 3] 21,600
[64, 64, 1, 11]
Fuel Moisture Reg. USA [1, 1, 3, 19] 1,578
Algae Blooms Reg. USA [1, 1, 12, 19] 777

Unlike natural images in which the data and its label are self-contained, remote sensing labels are inherently associated to a place and time on Earth (i.e., a latitude/longitude and timestamp). In addition, while natural images contain RGB channels from the same camera sensor, Presto’s pixel-timeseries input contains channels from multiple remote sensing instruments and data products. We therefore wanted to communicate to the model: (i) the location of the datapoint (already present in the input as static variable through coordinates sLocsubscript𝑠Locs_{\text{Loc}}italic_s start_POSTSUBSCRIPT Loc end_POSTSUBSCRIPT) and a variable’s (ii) timestamp and (iii) channel group. We did this by adding encodings to the previously described embeddings e𝑒eitalic_e. The complete encoding has dimension desubscript𝑑𝑒d_{e}italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and contains a concatenation of positional, month, and learned channel encodings described below.

The transformer input E∈ℝ(T⋅|𝒞dynamic|+|𝒞static|)×de𝐸superscriptℝ⋅𝑇subscript𝒞dynamicsubscript𝒞staticsubscript𝑑𝑒E\in\mathbb{R}^{(T\cdot|\mathcal{C}_{\textrm{dynamic}}|+|\mathcal{C}_{\textrm{% static}}|)\times d_{e}}italic_E ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_T ⋅ | caligraphic_C start_POSTSUBSCRIPT dynamic end_POSTSUBSCRIPT | + | caligraphic_C start_POSTSUBSCRIPT static end_POSTSUBSCRIPT | ) × italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (for encoder dimension desubscript𝑑𝑒d_{e}italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT) is a concatenation of:

Table 2: Mean F1 score across all CropHarvest tasks. Presto outpeforms TIML [Tseng et al., 2022] and MOSAIKS-1D while requiring the adaptation of far fewer parameters. The TIML and MOSAIKS-1D model did not receive Dynamic World as input, so we measured Presto’s performance both with and without it.

{NiceTabular}

lrrr\CodeBefore\rowcolorMidnightBlue!206,7\Body #. parameters
Model Total Adapted Mean F1
Random Forest 0.441
MOSAIKS-1DR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT 418K 8193 0.738
TIML 91K 91K 0.8020.8020.8020.802
PrestoR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT 402K 129 0.8350.8350.8350.835
no DW 0.8360.836\bm{0.836}bold_0.836

3.3 Pre-training via Structured Masking

A key requirement for Presto was to perform well even with incomplete inputs (i.e., when there are missing timesteps, channels, or both). When masking out part of the input x𝑥xitalic_x, we therefore tailored the masking strategies to encourage the model to learn representations that perform well when given a subset of bands or timesteps for downstream tasks. For a T×D𝑇𝐷T\times Ditalic_T × italic_D input of T𝑇Titalic_T timesteps and D𝐷Ditalic_D total input channels, we used the following masking techniques (illustrated in Figure 1), where Presto considers a token to be a 1×d1𝑑1\times d1 × italic_d input (a single timestep of d𝑑ditalic_d grouped channels). The coordinates were never masked but the static topological tokens can be.

    1. Random: (t×d)𝑡𝑑(t\times d)( italic_t × italic_d ) masked values, with t<T𝑡𝑇t<Titalic_t < italic_T and d<D𝑑𝐷d<Ditalic_d < italic_D
    1. Channel-groups: (T×d)𝑇𝑑(T\times d)( italic_T × italic_d ) masked values, with d<D𝑑𝐷d<Ditalic_d < italic_D
    1. Contiguous timesteps: (t×D)𝑡𝐷(t\times D)( italic_t × italic_D ) masked values, t<T𝑡𝑇t<Titalic_t < italic_T
    1. Timesteps: (t×D)𝑡𝐷(t\times D)( italic_t × italic_D ) masked values, with t<T𝑡𝑇t<Titalic_t < italic_T

For each training instance, we randomly sampled from the above strategies to construct a mask.

To handle both the categorical and continuous inputs we used the following loss function, which balances the continuous and categorical losses for every batch so that each reconstructed value receives the same weighting in the final loss:ℒtotal=ℒMSE+λ⁢NcatNcont⁢ℒCEsubscriptℒtotalsubscriptℒMSE𝜆subscript𝑁catsubscript𝑁contsubscriptℒCE\mathcal{L}_{\text{total}}=\mathcal{L}_{\text{MSE}}+\lambda\frac{N_{\text{cat}% }}{N_{\text{cont}}}\mathcal{L}_{\text{CE}}caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT + italic_λ divide start_ARG italic_N start_POSTSUBSCRIPT cat end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT cont end_POSTSUBSCRIPT end_ARG caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT.ℒMSEsubscriptℒMSE\mathcal{L}_{\text{MSE}}caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT is the mean squared error reconstruction loss used for the continuous values, ℒCEsubscriptℒCE\mathcal{L}_{\text{CE}}caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT is the cross entropy loss used for the categorical values, Ncontsubscript𝑁contN_{\text{cont}}italic_N start_POSTSUBSCRIPT cont end_POSTSUBSCRIPT is the number of masked continuous values and Ncatsubscript𝑁catN_{\text{cat}}italic_N start_POSTSUBSCRIPT cat end_POSTSUBSCRIPT is the number of masked categorical values in the batch. λ𝜆\lambdaitalic_λ is a hyperparameter, which we set to 2222.

Refer to caption

Figure 3: Presto is robust to incomplete inputs. We measured the AUC ROC score of Presto with Linear probing (PrestoR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT) on the CropHarvest dataset when no Dynamic World input is passed, and with a subset of input months (the x-axis). We plot the performance of MOSAIKS-1D and TIML when they receive the full 12 months of input (dashed horizontal lines) - PrestoR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT recovered the performance of these models given only a subset of input months.

4 Experiments

In all experiments described below, we use a Presto model with identical encoder and decoder configurations (2 attention layers with 8 heads, an embedding size of 128 and an MLP ratio of 4). We investigated the effect of different encoder configurations in Table 6.

For downstream evaluation, we took the encoder-decoder model learned during pre-training and discarded the decoder. As in He et al. [2022], we passed a global pool of all the encoder’s output tokens to a downstream classifier. We evaluated the performance of three different models: PrestoR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT, PrestoR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT, and PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT, defined below.

During pre-training, we used a validation task consisting of classifying all points in the CropHarvest dataset [Tseng et al., 2021] according to their FAO indicative crop classifications. For this validation task, we excluded points used for evaluation (Section 5.1).

For evaluation, we compared Presto to state-of-the-art task-specific baselines (Section 5). Because there are no other global self-supervised models for pixel-timeseries, we adapted MOSAIKS [Rolf et al., 2021] for timeseries data by performing convolutions over the temporal rather than spatial dimension (MOSAIKS-1D). We used the output features with random forests (MOSAIKS-1DR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT) and regressions (MOSAIKS-1DR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT).

Refer to caption

Figure 4: We obtained per-image predictions using Presto by computing a mean and standard deviation of Presto’s per-pixel outputs, and passing this concatenated vector to a downstream classifier. We illustrate this for the EuroSat task.

5 Evaluation Tasks & Results

We evaluated Presto using six evaluation tasks spanning diverse task types, geographic locations (4 continents and 38 countries), input data modalities, and fine-tuning dataset sizes (Table 3.2). Whenever possible, we benchmarked Presto against the state-of-the-art model for that task.

Applying Presto to downstream tasks is computationally efficient. While other methods require a cluster of GPUs for fine-tuning [Cong et al., 2022], we fine-tuned Presto on a single GPU or CPU. For the fuel moisture task described in Section 5.1, fine-tuning Presto took under 6 minutes on a 2017 MacBook Pro’s CPU. When Presto is used as a feature extractor, simple models can be trained which require few parameters to be learned, as we show in Table 3.2. Even when fully fine-tuned, Presto’s small size meant that relatively few parameters needed to be trained (Tables 5.2.1 and 5.3.1). This makes Presto accessible to practitioners, especially those lacking significant computational resources.

Below, we describe the tasks used to evaluate Presto and discuss Presto’s performance on these tasks.

Table 3: RMSE results on the regression tasks. The literature baselines are not directly comparable, since they use different input datasets or private test data (or both). Rao et al. [2020] reported an RMSE of 25 on the fuel moisture dataset with a physics-assisted neural network and the algae bloom competition winner reported an RMSE of 0.761, indicating our results are within the scope of utility. Best results are highlighted blue, with second best results in bold. Models have a high variance in performance across tasks, so we calculated the mean difference in RMSE from the linear regression baseline across both tasks. Presto performed most consistently, both when used as a feature-extractor and when fine-tuned.

{NiceTabular}

lrrr\CodeBefore\rowcolororange!205\rowcolorMidnightBlue!206,7\Body Fuel Moisture Algae Blooms Mean difference
Linear Regression 28.20 0.8500.8500.8500.850 0%
Random Forest 23.8423.8423.8423.84 1.2491.2491.2491.249 15.7%
MOSAIKS-1DR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT 28.7528.7528.7528.75 0.9720.9720.9720.972 8.15%
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT (random init.) 26.0726.0726.0726.07 0.9550.9550.9550.955 2.40%
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT 25.2825.2825.2825.28 0.8150.8150.8150.815 −7.24%percent7.24{\color[rgb]{0,0,1}\bm{-7.24\%}}bold_- bold_7.24 bold_%
PrestoR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT 25.9825.9825.9825.98 0.8840.8840.8840.884 −1.94%percent1.94-1.94\%- 1.94 %

5.1 Timeseries Tasks

Table 4: Results on the TreeSatAI dataset. We compared Presto to the dataset’s benchmark models. The MLPs contain 3 layers (with 563K-723K parameters respectively) and are tuned for this task. We froze the Presto encoder’s 402k parameters and trained a random forest on its outputs with default scikit-learn hyperparameters.

{NiceTabular}

llrrrr\CodeBefore\rowcolorMidnightBlue!205,8\Body Weighted Micro
Model Data F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT mAP F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT mAP
MLP S1 10.09 29.42 12.82 33.09
LightGBM 11.86 32.79 14.07 35.11
PrestoR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT 38.3438.34\bm{38.34}bold_38.34 35.4535.45\bm{35.45}bold_35.45 40.7940.79\bm{40.79}bold_40.79 38.6438.64\bm{38.64}bold_38.64
MLP S2 51.9751.9751.9751.97 64.1964.1964.1964.19 54.5954.5954.5954.59 65.8365.8365.8365.83
LightGBM 48.17 61.99 52.52 61.66
PrestoR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT 55.2955.29\bm{55.29}bold_55.29 61.53 58.2958.29\bm{58.29}bold_58.29 63.31

5.1.1 Timeseries Results

Presto excels at timeseries tasks, significantly outperforming the state-of-the-art for CropHarvest (Table 3.2) and outperforming all baselines for the regression tasks (Table 5).

We found that Presto is performant when passed only a subset of timesteps compared to the 12 timesteps used for pre-training. Presto remained performant when receiving only 3 input timesteps for the fuel moisture task (Table 5). We also evaluated Presto when a subset of input months are passed for the CropHarvest dataset (Figure 3). Using a subset of the 12 months, Presto surpassed the performance of TIML and MOSAIKS-1D which used all input months.

Presto is also robust to the removal of input channels. On the CropHarvest dataset (Table 3.2), Presto remained performant without the Dynamic World input, showing a negligible difference in mean F1 score compared to the full input.

5.2 Image-based Tasks

Presto is designed to ingest single pixel-timeseries. When one prediction is required for a set of pixels (as for image-based tasks and the Image-Timeseries tasks in Section 5.3), we used the following approach to obtain per-image predictions from Presto’s pixel outputs (Figure 4): (i) we encoded the pixels in an image individually, yielding N output tokens, (ii) we calculated the mean and standard deviation of these N output tokens per dimension and concatenated the result, yielding a 2⁢de2subscript𝑑𝑒2d_{e}2 italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT-dimensional vector (where desubscript𝑑𝑒d_{e}italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is Presto’s output token size, or 128), and (iii) we passed this mean and standard deviation vector to a downstream classifier.

Refer to caption

Figure 5: EuroSat accuracy of a kNN@5 classifier given pre-trained model embeddings at a variety of input resolutions (following Reed et al. [2022]) as a function of FLOPs required to encode an image (note the log scale on the x-axes). All image-based models resized images to 224×224224224224\times 224224 × 224, so the FLOPs required to encode an image do not change with image resolution. Presto achieved competitive results with image-based models while requiring up to four orders of magnitude less FLOPs to encode an image.

5.2.1 Image-based Results

Despite being pre-trained on pixel-timeseries data, Presto is competitive on single-timestep image datasets against much larger models. We followed the setup of Reed et al. [2022] in measuring the performance of a kNN-classifier on Presto’s output embeddings for the EuroSat dataset at varying resolutions. Presto achieved comparable average accuracy (over all image resolutions) to larger ViT-based models with RGB data and significantly outperformed these models with multispectral (MS) data (Figure 5), while requiring orders of magnitude less compute to encode the images in both cases and for any resolution.

Presto is performant even when only a small subset of input channels are available compared to the pre-training channels. For the EuroSAT task (Table 5), Presto received either the full Sentinel-2 input or only RGB bands (which represent only a single token, since only one timestep is available). Similarly, we evaluated Presto when it receives either Sentinel-2 or Sentinel-1 data for the TreeSatAI task (Table 5.1). In both cases, Presto was competitive with methods designed to ingest single-timestep, single-sensor data.

Table 5: EuroSAT finetuning accuracy. Presto is the only backbone that can handle both MS and RGB inputs (separate SatMAE models are trained for RGB and MS inputs). We reported Presto results for full resolution; results at reduced resolutions are in Table 11.

{NiceTabular}

lllrr\CodeBefore\rowcolororange!206,7\rowcolorMidnightBlue!208,9\Body Backbone Inputs Params (M) Accuracy
GASSL ResNet-18 RGB 11.69 0.895
SeCo ResNet-18 RGB 11.69 0.931
SatMAE ViT-Large RGB 303.10 0.955
SatMAE ViT-Large MS 305.96 0.990
Random init. Presto RGB 0.40 0.745
MS 0.924
Presto Presto RGB 0.40 0.849
MS 0.953

5.3 Image-Timeseries Tasks

5.3.1 Image-Timeseries Results

The S2-Agri100100{}_{100}start_FLOATSUBSCRIPT 100 end_FLOATSUBSCRIPT dataset consists of 24 timesteps at 10 to 30 day intervals (compared to Presto’s pre-training data, which consists of 12-month timeseries). Presto remained performant on this dataset, achieving comparable results with SITS-Former despite having 6×6\times6 × fewer parameters (shown in Table 5.3.1). This shows that Presto can ingest timeseries at different temporal resolutions and at varying intervals.

In addition, the S2-Agri dataset is missing pixel location metadata, which is always passed to Presto during pre-training. S2-Agri was sampled from a single S2-tile, so we used the location of the central pixel of this tile for all pixels in the dataset. Even with this much less accurate location metadata, Presto remained performant.

Table 6: Results on the S2-Agri100100{}_{100}start_FLOATSUBSCRIPT 100 end_FLOATSUBSCRIPT dataset. We followd [Yuan et al., 2022] in reporting overall accuracy (OA), Kappa Cohen score (κ𝜅\kappaitalic_κ) and macro-F11{}_{1}start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT score. All results are an average of 3 runs - standard errors are reported in Table A.5.

{NiceTabular}

lccrrrr\CodeBefore\rowlistcolors4orange!20[cols=3-6]\rowlistcolors5MidnightBlue!20[cols=3-6]\Body Params (M) Pre Trained? OA κ𝜅\kappaitalic_κ F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
SITS Former 2.5 65.13 0.55 42.12
✓ 67.03 0.56 42.8342.83\bm{42.83}bold_42.83
Presto 0.4 45.98 0.35 27.45
✓ 68.8968.89\bm{68.89}bold_68.89 0.580.58\bm{0.58}bold_0.58 40.41

5.4 Ablations

We conducted three ablations to better understand Presto’s performance:

Table 7: Structured masking strategies yield the best downstream performance. We measured PrestoR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT’s F1 score on the CropHarvest validation task. Combining structured strategies outperformed the “Random” masking employed by [He et al., 2022].

{NiceTabular}

ccccr\CodeBefore\rowcolorMidnightBlue!206\Body Channel Groups Random Timesteps Contiguous Timesteps F1 Score
✓ 0.646
✓ 0.653
✓ 0.664
✓ 0.649
✓ ✓ ✓ ✓ 0.6650.665\bm{0.665}bold_0.665

6 Discussion & Conclusion

Limitations

Presto is designed to ingest 10m/px resolution imagery and is pre-trained on products at this scale. This decision is motivated by the free, global availability over time of products at this scale (such as Sentinel-1 and Sentinel-2). Presto does not natively process very-high resolution imagery such as <1absent1<1< 1 m/px imagery from commercial satellites or drones, which can be costly and often lack complete coverage globally and temporally. In addition, Presto is a pixel-timeseries model. While we demonstrated Presto’s flexibility on single-timestep image datasets, image-based models may be preferred if a user’s goal is to process entire images to make a prediction. We observed that Presto’s performance on the EuroSAT dataset plateaued as the input resolution increased (Table 5), due to images from classes where the relevant pixels for the class are a minority of the pixels in the image (e.g., highways). In such scene classification challenges, image-based models which can learn the shape of the relevant pixels may be better suited. We discuss this further in Section A.6.

Conclusion

Table 8: Effect of model size on validation performance. To understand the effect of model size on performance, we pre-train two larger variants of Presto. As in Table 5.4, we measure PrestoR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT’s performance on the CropHarvest validation task. The number of parameters includes both the encoder and decoder parameters. The FLOPS are computed for a “full” input (12 timesteps, with no missing channels), when passed through the encoder and decoder.

{NiceTabular}

rrrrr\CodeBefore\rowcolorMidnightBlue!202\BodyDepth Width # params (M) FLOPs (M) F1 score
2 128 0.81 88.94 0.665
2 256 2.02 220.81 0.687
4 128 1.21 132.42 0.669
We present Presto: a lightweight, pre-trained timeseries transformer for remote sensing. By leveraging structure unique to remote sensing data—specifically, (i) an important temporal dimension, (ii) associated metadata and (iii) a diversity of sensors, we are able to train an extremely lightweight model which achieves state-of-the-art results in a wide variety of globally distributed evaluation tasks. Computational efficiency is of paramount importance in remote sensing settings and often determines which models ultimately get selected for deployment. We demonstrated that strong performance can be achieved while meeting this constraint, and that self-supervised learning can provide significant benefits even for small models.

Impact statement

Machine learning applications to remote sensing have a wide range of societally beneficial outcomes, ranging from tracking progress on sustainable development goals [Ferreira et al., 2020] to improved weather forecasting [English et al., 2013, Voosen, 2020] to disaster management [Kansakar and Hossain, 2016].

Presto is designed to be accessible to a wide range of practitioners; we achieve this by only training Presto on publicly available data and by keeping the model size small enough so it can be leveraged in compute-constrained environments. In addition to increasing Presto’s accessibility, its small size also lowers its carbon footprint [Strubell et al., 2019].

As described by Tuia et al. [2023], a natural concern when applying machine learning algorithms to remote sensing data is its use to collect information about individuals who are unaware that data is being collected, and therefore cannot consent to this practice. We therefore encourage deployment of Presto in collaboration with local communities and stakeholders [Krafft, , Kshirsagar et al., 2021, Nakalembe and Kerner, 2023].

Acknowledgements

This work was supported by NASA under the NASA Harvest Consortium on Food Security and Agriculture (Award #80NSSC18M0039). This research was enabled in part by compute resources provided by Mila (mila.quebec); in addition, we acknowledge material support from NVIDIA Corporation in the form of computational resources. We thank Esther Rolf and Caleb Robinson for reviewing drafts of this manuscript.

References

Appendix A Appendix

Reproducibility

All code and data used to train and evaluate Presto will be made available upon publication, and the code is currently available at https://github.com/nasaharvest/presto. In addition, we discuss specific implementation details in Appendices A.1 and A.4. We have strived to make the Presto codebase accessible to other practitioners; to this end, we include a demo Jupyter notebook demonstrating how Presto can be applied to a new downstream task, which is available at https://github.com/nasaharvest/presto/blob/main/downstream_task_demo.ipynb.

A.1 Pre-training details

We outline training hyperparameters below:

A.1.1 Pre-training data

Refer to caption

Figure 6: The distribution of the pre-training dataset described in Section 3.1.

Remote sensing models can be deployed in a wide range of geographies, with few labelled datapoints available at fine-tuning time [Kerner et al., 2020, Böhm et al., 2022]. We therefore aim to collect a globally representative pre-training dataset. We achieve this by following the sampling strategy used by Dynamic World [Brown et al., 2022]. We divide the Earth into three regions: the Western Hemisphere and two regions in the Eastern Hemisphere. These regions are further divided into ecoregions, and stratified samples are gathered from each region using land cover classes as sampling strata. Figure 6 shows the resulting geographical distribution. Each sample represents a 510×510510510510\times 510510 × 510 pixel tile with a spatial resolution of 10 meter per pixel. To obtain pixel-timeseries we grid-sample 2,500 pixels from each sample, yielding a total of 21,535,000 pixel samples (each with 24 one-month timesteps).

A.1.2 Input data

We leverage the following data products when pre-training Presto:

A.1.3 Channel Groups

As described in Section 3.2, we transform the pixel timeseries x𝑥xitalic_x into a number of tokens, where each token is a linear transformation of a subset of the input channels. We group together channels which (i) come from the same sensor or product, (ii) have equivalent native spatial resolutions and (iii) represent similar parts of the electromagnetic spectrum (for Sentinel-2 channel groups). We group the input data into the following channel groups:

A.2 FLOP calculations

Table 9: Model sizes and FLOPs required to encode a single EuroSat image (or pixel, for Presto), as measured by the thop library. When plotting results in Table 5, we multiply the FLOPs for Presto by the number of pixels encoded for an image. At its highest resolution, EuroSAT images are 64×64646464\times 6464 × 64, so Presto FLOPs for a full resolution image can be obtained by multiplying the per-pixel FLOPs by 4,096. We include this value in brackets for completeness.

Model Backbone Params (M) MegaFlops
SatMAE (RGB) [Cong et al., 2022] ViT-Large 303.10 59,685.69
SatMAE (MS) [Cong et al., 2022] ViT-Large 305.96 535,515.25
ScaleMAE [Reed et al., 2022] ViT-Large 303.10 59,685.69
ConvMAE [Gao et al., 2022] ConvMAE-Large 88.78 23,315.58
SeCo [Manas et al., 2021] ResNet-18 11.69 149.37
GASSL [Ayush et al., 2021] ResNet-18 11.69 149.37
Presto RGB pixel (image) Presto 0.40 0.79 (3,235.84)
Presto MS pixel (image) Presto 0.40 2.37 (9,707.52)

We use the thop library (https://github.com/Lyken17/pytorch-OpCounter) to calculate the FLOPs required to encode a EuroSAT image (as plotted in Table 5(b)). For the SatMAE, ScaleMAE and ConvMAE models, all images were resized to 224×224224224224\times 224224 × 224, so the FLOPs required to encode an image is independent of resolution. For Presto, we computed the FLOPs required to encode a single pixel and multiplied this by the number of pixels in an image at each resolution (e.g. the “64” resolution has 64×64646464\times 6464 × 64 pixels, so we multiply the FLOPs required to encode a single pixel by 64×64=40966464409664\times 64=409664 × 64 = 4096). The FLOPs calculated by the thop library are recorded in Table 9.

A.3 Baselines

In addition to task-specific baselines, we benchmark Presto against:

Table 10: Full results for regression tasks from Table 5, including standard error computed from three runs.

Fuel Moisture Algae Blooms Mean difference
Linear Regression 28.20 0.8500.8500.8500.850 0%
Random Forest 23.84±0.42plus-or-minus23.840.4223.84\pm 0.4223.84 ± 0.42 1.249±0.02plus-or-minus1.2490.021.249\pm 0.021.249 ± 0.02 15.7%
MOSAIKS-1DR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT 28.75±0.15plus-or-minus28.750.1528.75\pm 0.1528.75 ± 0.15 0.972±0.01plus-or-minus0.9720.010.972\pm 0.010.972 ± 0.01 8.15%
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT (random init.) 26.07±0.52plus-or-minus26.070.5226.07\pm 0.5226.07 ± 0.52 0.955±0.05plus-or-minus0.9550.050.955\pm 0.050.955 ± 0.05 2.40%
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT 25.28±0.30plus-or-minus25.280.3025.28\pm 0.3025.28 ± 0.30 0.815±0.03plus-or-minus0.8150.030.815\pm 0.030.815 ± 0.03 −7.24%percent7.24{\color[rgb]{0,0,1}\bm{-7.24\%}}bold_- bold_7.24 bold_%
PrestoR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT 25.98±0.66plus-or-minus25.980.6625.98\pm 0.6625.98 ± 0.66 0.884±0.01plus-or-minus0.8840.010.884\pm 0.010.884 ± 0.01 −1.94%percent1.94-1.94\%- 1.94 %

A.4 Downstream Results

We include complete results for the evaluation tasks. These include error bars, as well as additional results reported for the CropHarvest (Table 12 and Figure 3), regression tasks (Table 10), EuroSAT (Tables 11, 13 and 14), TreeSatAI (Table 15) and Sen2-Agri100100{}_{100}start_FLOATSUBSCRIPT 100 end_FLOATSUBSCRIPT (Table A.5) datasets.

We run all downstream classifiers with 3 seeds (0,42,84042840,42,840 , 42 , 84), with the exception of the kNN classifiers and the linear regression (which are deterministic). In the tables in the main paper (Tables 3.2, 5.1, 5.3.1 and 5) we report the average of these runs; the standard error is reported in Tables 12,15, A.5 and 10.

As discussed in Section 5.2, we obtain per-image predictions using Presto by computing a mean and standard deviation of Presto’s output pixels, and passing a concatenation of these two vectors to a downstream classifier. This is illustrated in Figure 4.

A.5 Disentangling the effect of pre-training

To understand the effect of pre-training Presto, we fine-tune Presto and train it from scratch on EuroSat (Table 5.2.1), the regression tasks (Table 5 in the main paper) and TreeSatAI (Table 15). We omit the CropHarvest dataset because it was expressly designed as a few-shot-learning dataset. Its small size makes the construction of validation sets with which to control the finetuning (e.g. with early stopping) challenging.

Overall, we find a consistent and significant improvement from the use of pre-trained Presto compared to a randomly initialized version of the model. For the EuroSat task, pre-training consistently delivers an incresse in accuracy score >0.1absent0.1>0.1> 0.1 (representing increases in accuracy of up to 25%). This effect is consistent with what we observe on the TreeSatAI dataset for S2 data and on the regression tasks (where pre-training reduces RMSE by to 15% on the algae blooms task). For the TreeSatAI dataset with S1 data, pre-training penalizes the model compared to random initialization - we hypothesize that this is due to the difference in input (a single timestep and single channel group image) relative to the pre-training data. The benefit of pre-training effect is especially pronounced on the S2-Agri100100{}_{100}start_FLOATSUBSCRIPT 100 end_FLOATSUBSCRIPT dataset; we hypothesize this is due to the small training set size.

Table 11: Accuracy results for pre-trained and from-scratch Presto when fine-tuned on EuroSat, at varying resolutions. We hypothesize that the drop in performance for the full resolution (64) RGB input is due to the model construction; the model outputs for all pixels in the image (4,096 pixels for the full resolution) are aggregated and passed to a linear layer for classification, yielding a noisy gradient signal.

Resolution 2 4 8 16 32 64
random init. RGB 0.703±0.005plus-or-minus0.7030.0050.703\pm 0.0050.703 ± 0.005 0.684±0.032plus-or-minus0.6840.0320.684\pm 0.0320.684 ± 0.032 0.694±0.013plus-or-minus0.6940.0130.694\pm 0.0130.694 ± 0.013 0.739±0.004plus-or-minus0.7390.0040.739\pm 0.0040.739 ± 0.004 0.750±0.018plus-or-minus0.7500.0180.750\pm 0.0180.750 ± 0.018 0.745±0.009plus-or-minus0.7450.0090.745\pm 0.0090.745 ± 0.009
pre-trained 0.792±0.010plus-or-minus0.7920.0100.792\pm 0.0100.792 ± 0.010 0.837±0.006plus-or-minus0.8370.0060.837\pm 0.0060.837 ± 0.006 0.847±0.016plus-or-minus0.8470.0160.847\pm 0.0160.847 ± 0.016 0.865±0.006plus-or-minus0.8650.0060.865\pm 0.0060.865 ± 0.006 0.872±0.002plus-or-minus0.8720.0020.872\pm 0.0020.872 ± 0.002 0.849±0.004plus-or-minus0.8490.0040.849\pm 0.0040.849 ± 0.004
random init. MS 0.837±0.014plus-or-minus0.8370.0140.837\pm 0.0140.837 ± 0.014 0.884±0.010plus-or-minus0.8840.0100.884\pm 0.0100.884 ± 0.010 0.895±0.006plus-or-minus0.8950.0060.895\pm 0.0060.895 ± 0.006 0.907±0.13plus-or-minus0.9070.130.907\pm 0.130.907 ± 0.13 0.924±0.005plus-or-minus0.9240.0050.924\pm 0.0050.924 ± 0.005 0.924±0.003plus-or-minus0.9240.0030.924\pm 0.0030.924 ± 0.003
pre-trained 0.898±0.005plus-or-minus0.8980.0050.898\pm 0.0050.898 ± 0.005 0.925±0.004plus-or-minus0.9250.0040.925\pm 0.0040.925 ± 0.004 0.939±0.000plus-or-minus0.9390.0000.939\pm 0.0000.939 ± 0.000 0.950±0.002plus-or-minus0.9500.0020.950\pm 0.0020.950 ± 0.002 0.958±0.001plus-or-minus0.9580.0010.958\pm 0.0010.958 ± 0.001 0.953±0.004plus-or-minus0.9530.0040.953\pm 0.0040.953 ± 0.004

Table 12: Additional results for the CropHarvest task. In addition to the F1 scores reported in the main paper, we report AUC ROC scores, with standard error bars computed with three runs.

Model Kenya Brazil Togo Mean
F1 Random Forest 0.559±0.003plus-or-minus0.5590.0030.559\pm 0.0030.559 ± 0.003 0.000±0.000plus-or-minus0.0000.0000.000\pm 0.0000.000 ± 0.000 0.756±0.002plus-or-minus0.7560.0020.756\pm 0.0020.756 ± 0.002 0.441
MOSAIKS-1DR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT 0.790±0.027plus-or-minus0.7900.0270.790\pm 0.0270.790 ± 0.027 0.746±0.084plus-or-minus0.7460.0840.746\pm 0.0840.746 ± 0.084 0.679±0.024plus-or-minus0.6790.0240.679\pm 0.0240.679 ± 0.024 0.738
TIML 0.838±0.000plus-or-minus0.8380.0000.838\pm 0.0000.838 ± 0.000 0.835±0.012plus-or-minus0.8350.0120.835\pm 0.0120.835 ± 0.012 0.732±0.002plus-or-minus0.7320.0020.732\pm 0.0020.732 ± 0.002 0.8020.8020.8020.802
PrestoR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT 0.816±0.000plus-or-minus0.8160.0000.816\pm 0.0000.816 ± 0.000 0.891±0.000plus-or-minus0.8910.0000.891\pm 0.0000.891 ± 0.000 0.798±0.000plus-or-minus0.7980.0000.798\pm 0.0000.798 ± 0.000 0.8350.8350.8350.835
no DW 0.861±0.000plus-or-minus0.8610.000\bm{0.861\pm 0.000}bold_0.861 bold_± bold_0.000 0.888±0.000plus-or-minus0.8880.0000.888\pm 0.0000.888 ± 0.000 0.760±0.000plus-or-minus0.7600.0000.760\pm 0.0000.760 ± 0.000 0.8360.8360.8360.836
AUC ROC Random Forest 0.578±0.006plus-or-minus0.5780.0060.578\pm 0.0060.578 ± 0.006 0.941±0.004plus-or-minus0.9410.0040.941\pm 0.0040.941 ± 0.004 0.892±0.001plus-or-minus0.8920.0010.892\pm 0.0010.892 ± 0.001 0.803
MOSAIKS-1DR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT 0.693±0.036plus-or-minus0.6930.0360.693\pm 0.0360.693 ± 0.036 0.890±0.038plus-or-minus0.8900.0380.890\pm 0.0380.890 ± 0.038 0.836±0.005plus-or-minus0.8360.0050.836\pm 0.0050.836 ± 0.005 0.806
TIML 0.794±0.003plus-or-minus0.7940.0030.794\pm 0.0030.794 ± 0.003 0.988±0.001plus-or-minus0.9880.0010.988\pm 0.0010.988 ± 0.001 0.890±0.000plus-or-minus0.8900.0000.890\pm 0.0000.890 ± 0.000 0.8900.8900.8900.890
PrestoR𝑅{}_{R}start_FLOATSUBSCRIPT italic_R end_FLOATSUBSCRIPT 0.834±0.000plus-or-minus0.8340.0000.834\pm 0.0000.834 ± 0.000 0.997±0.000plus-or-minus0.9970.0000.997\pm 0.0000.997 ± 0.000 0.921±0.000plus-or-minus0.9210.0000.921\pm 0.0000.921 ± 0.000 0.9170.9170.9170.917
no DW 0.863±0.000plus-or-minus0.8630.000\bm{0.863\pm 0.000}bold_0.863 bold_± bold_0.000 0.989±0.000plus-or-minus0.9890.0000.989\pm 0.0000.989 ± 0.000 0.912±0.000plus-or-minus0.9120.0000.912\pm 0.0000.912 ± 0.000 0.9210.9210.9210.921

Table 13: Additional results for the EuroSat task - results for the ScaleMAE, SatMAE and ConvMAE models are from [Reed et al., 2022]. We report kNN classifier results for different values of k𝑘kitalic_k, and at varying input resolutions.

Resolution 16 32 64
k𝑘kitalic_k 5555 20202020 100100100100 5555 20202020 100100100100 5555 20202020 100100100100
SatMAE 0.729 0.727 0.695 0.871 0.876 0.854 0.934 0.931 0.913
ScaleMAE 0.751 0.744 0.699 0.912 0.901 0.869 0.960 0.956 0.935
ConvMAE 0.835 0.826 0.788 0.909 0.898 0.863 0.947 0.940 0.914
Presto (RGB) 0.869 0.828 0.713 0.869 0.829 0.712 0.869 0.829 0.713
Presto (MS) 0.916 0.892 0.844 0.920 0.892 0.846 0.921 0.893 0.846

Table 14: Additional results for the EuroSat task for Presto when run with reduced resolutions (compared to those used by [Reed et al., 2022] and reported in Table 13). We report kNN classifier results for different values of k𝑘kitalic_k, and at varying input resolutions.

Resolution 2 4 8
k𝑘kitalic_k 5555 20202020 100100100100 5555 20202020 100100100100 5555 20202020 100100100100
Presto (RGB) 0.843 0.811 0.699 0.860 0.820 0.706 0.869 0.826 0.710
Presto (MS) 0.873 0.852 0.799 0.895 0.874 0.824 0.911 0.886 0.838

Table 15: Additional results for the TreeSatAI (as in [Ahlswede et al., 2023], we report precision and recall in addition to F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score and mAP). In addition, we report the results of finetuning Presto (PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT) from the pre-trained weights and from a random initialization.

Model Data Aggregation F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT mAP Precision Recall
MLP S1 Weighted 10.0910.0910.0910.09 29.4229.4229.4229.42 33.2933.2933.2933.29 7.137.137.137.13
LightGBM 11.8611.8611.8611.86 32.7932.7932.7932.79 37.9637.9637.9637.96 8.068.068.068.06
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT (random init.) 40.36±0.77plus-or-minus40.360.7740.36\pm 0.7740.36 ± 0.77 39.77±0.79plus-or-minus39.770.7939.77\pm 0.7939.77 ± 0.79 30.69±0.82plus-or-minus30.690.8230.69\pm 0.8230.69 ± 0.82 64.69±1.09plus-or-minus64.691.0964.69\pm 1.0964.69 ± 1.09
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT 38.69±0.78plus-or-minus38.690.7838.69\pm 0.7838.69 ± 0.78 37.41±0.58plus-or-minus37.410.5837.41\pm 0.5837.41 ± 0.58 30.09±0.74plus-or-minus30.090.7430.09\pm 0.7430.09 ± 0.74 61.20±0.85plus-or-minus61.200.8561.20\pm 0.8561.20 ± 0.85
PrestoR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT 38.34±0.07plus-or-minus38.340.0738.34\pm 0.0738.34 ± 0.07 35.45±0.03plus-or-minus35.450.0335.45\pm 0.0335.45 ± 0.03 29.67±0.07plus-or-minus29.670.0729.67\pm 0.0729.67 ± 0.07 57.23±0.06plus-or-minus57.230.0657.23\pm 0.0657.23 ± 0.06
MLP Micro 12.8212.8212.8212.82 33.0933.0933.0933.09 63.0163.0163.0163.01 7.137.137.137.13
LightGBM 14.07 35.11 55.4955.4955.4955.49 8.068.068.068.06
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT (random init.) 42.04±0.73plus-or-minus42.040.7342.04\pm 0.7342.04 ± 0.73 43.00±0.80plus-or-minus43.000.8043.00\pm 0.8043.00 ± 0.80 31.20±1.00plus-or-minus31.201.0031.20\pm 1.0031.20 ± 1.00 64.69±1.09plus-or-minus64.691.0964.69\pm 1.0964.69 ± 1.09
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT 41.65±0.46plus-or-minus41.650.4641.65\pm 0.4641.65 ± 0.46 40.75±0.69plus-or-minus40.750.6940.75\pm 0.6940.75 ± 0.69 31.58±0.47plus-or-minus31.580.4731.58\pm 0.4731.58 ± 0.47 61.20±0.85plus-or-minus61.200.8561.20\pm 0.8561.20 ± 0.85
PrestoR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT 40.79±0.04plus-or-minus40.790.0440.79\pm 0.0440.79 ± 0.04 38.64±0.02plus-or-minus38.640.0238.64\pm 0.0238.64 ± 0.02 31.69±0.03plus-or-minus31.690.0331.69\pm 0.0331.69 ± 0.03 57.23±0.06plus-or-minus57.230.0657.23\pm 0.0657.23 ± 0.06
MLP S2 Weighted 51.9751.9751.9751.97 64.1964.1964.1964.19 74.5974.5974.5974.59 42.2342.2342.2342.23
LightGBM 48.1748.1748.1748.17 61.9961.9961.9961.99 74.2774.2774.2774.27 40.0440.0440.0440.04
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT (random init.) 52.74±0.50plus-or-minus52.740.5052.74\pm 0.5052.74 ± 0.50 57.24±0.64plus-or-minus57.240.6457.24\pm 0.6457.24 ± 0.64 45.87±1.17plus-or-minus45.871.1745.87\pm 1.1745.87 ± 1.17 64.29±1.51plus-or-minus64.291.5164.29\pm 1.5164.29 ± 1.51
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT 53.63±0.42plus-or-minus53.630.4253.63\pm 0.4253.63 ± 0.42 59.16±1.24plus-or-minus59.161.2459.16\pm 1.2459.16 ± 1.24 47.15±1.40plus-or-minus47.151.4047.15\pm 1.4047.15 ± 1.40 65.11±3.21plus-or-minus65.113.2165.11\pm 3.2165.11 ± 3.21
PrestoR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT 55.29±0.08plus-or-minus55.290.0855.29\pm 0.0855.29 ± 0.08 61.53±0.09plus-or-minus61.530.0961.53\pm 0.0961.53 ± 0.09 56.93±0.07plus-or-minus56.930.0756.93\pm 0.0756.93 ± 0.07 58.56±0.09plus-or-minus58.560.0958.56\pm 0.0958.56 ± 0.09
MLP Micro 54.4954.4954.4954.49 65.8365.8365.8365.83 77.1877.1877.1877.18 42.2342.2342.2342.23
LightGBM 52.5252.5252.5252.52 61.6661.6661.6661.66 76.2776.2776.2776.27 40.0440.0440.0440.04
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT (random init.) 52.56±0.41plus-or-minus52.560.4152.56\pm 0.4152.56 ± 0.41 58.08±0.66plus-or-minus58.080.6658.08\pm 0.6658.08 ± 0.66 44.56±1.03plus-or-minus44.561.0344.56\pm 1.0344.56 ± 1.03 64.29±1.51plus-or-minus64.291.5164.29\pm 1.5164.29 ± 1.51
PrestoF⁢T𝐹𝑇{}_{FT}start_FLOATSUBSCRIPT italic_F italic_T end_FLOATSUBSCRIPT 53.31±0.18plus-or-minus53.310.1853.31\pm 0.1853.31 ± 0.18 59.77±1.13plus-or-minus59.771.1359.77\pm 1.1359.77 ± 1.13 45.51±1.46plus-or-minus45.511.4645.51\pm 1.4645.51 ± 1.46 65.11±3.21plus-or-minus65.113.2165.11\pm 3.2165.11 ± 3.21
PrestoR⁢F𝑅𝐹{}_{RF}start_FLOATSUBSCRIPT italic_R italic_F end_FLOATSUBSCRIPT 58.29±0.06plus-or-minus58.290.0658.29\pm 0.0658.29 ± 0.06 63.31±0.06plus-or-minus63.310.0663.31\pm 0.0663.31 ± 0.06 58.04±0.05plus-or-minus58.040.0558.04\pm 0.0558.04 ± 0.05 58.56±0.09plus-or-minus58.560.0958.56\pm 0.0958.56 ± 0.09

Table 16: Full results on the S2-Agri100100{}_{100}start_FLOATSUBSCRIPT 100 end_FLOATSUBSCRIPT dataset, including standard errors obtained from 3 runs. To obtain standard errors for the SITS-Former, we run the official code (https://github.com/linlei1214/SITS-Former) with 3 seeds. Best results are highlighted.

{NiceTabular}

lccrrrr Params (M) Pre-trained? OA κ𝜅\kappaitalic_κ F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
SITS Former 2.5 65.13±3.01plus-or-minus65.133.0165.13\pm 3.0165.13 ± 3.01 0.55±0.03plus-or-minus0.550.030.55\pm 0.030.55 ± 0.03 42.12±0.52plus-or-minus42.120.5242.12\pm 0.5242.12 ± 0.52
✓ 67.03±2.24plus-or-minus67.032.2467.03\pm 2.2467.03 ± 2.24 0.56±0.02plus-or-minus0.560.020.56\pm 0.020.56 ± 0.02 42.83±0.30plus-or-minus42.830.30\bm{42.83}\pm 0.30bold_42.83 ± 0.30
Presto 0.4 45.98±2.74plus-or-minus45.982.7445.98\pm 2.7445.98 ± 2.74 0.35±0.02plus-or-minus0.350.020.35\pm 0.020.35 ± 0.02 27.45±0.64plus-or-minus27.450.6427.45\pm 0.6427.45 ± 0.64
✓ 68.89±1.05plus-or-minus68.891.05\bm{68.89}\pm 1.05bold_68.89 ± 1.05 0.58±0.01plus-or-minus0.580.01\bm{0.58}\pm 0.01bold_0.58 ± 0.01 40.41±0.25plus-or-minus40.410.2540.41\pm 0.2540.41 ± 0.25

A.6 Presto’s failure modes

Refer to caption

Figure 7: Accuracy of kNN@5 classifier with Presto RGB representations on the EuroSat dataset vs. the input resolution, for different categories. Some categories have been left out for clarity.

Refer to caption

(a) Forest

Refer to caption

(b) Annual Crop

Refer to caption

(c) Highway

Refer to caption

(d) River

Figure 8: the RGB bands of example images from EuroSat classes.

Presto processes pixel-timeseries independently, without spatial context from other pixels or locations. This means that when we make image-based predictions (such as for scene classification), Presto’s independent pixel representations must be aggregated into a single prediction. We opt for a simple concatenation of the element-wise mean and standard deviation of the representations, from which a classifier makes a prediction. Information gets lost in such a simple aggregation, which impacts Presto’s performance on such tasks.

For example, Presto’s performance on the EuroSat dataset reaches a plateau when increasing the input resolution. As Figure 7 shows, this is mainly caused by a failure to accurately predict specific classes (for example, the Highway and River classes). Figure 8 shows example images for these classes, as well as for the Forest and AnnualCrop classes, on which Presto achieves higher accuracies. While in the Forest and AnnualCrop images, most pixels of the image actually represent the labelled class, in the Highway and River images only a relatively small part of the image actually contains the label (a highway or river). We hypothesize that since many pixels in the Highway and River images do not actually represent that class, the crude token-aggregation method we use to represent images is insufficiently discriminative to accurately classify these images.

Other pre-trained remote sensing models use much more powerful mechanisms for aggregating spatial information. For example, ViT models convolve over patches and then apply an attention mechanism between spatial patches. If image-based predictions are needed and these predictions are highly dependent on the occurrence of objects in subregions of the image, models which natively process this important spatial information may be better suited.

We plan on exploring techniques to mitigate this difficulty with Presto in future work.