Using ncdfCF (original) (raw)

What is netCDF

“NetCDF (Network Common Data Form) is a set of software libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. It is also a community standard for sharing scientific data.”

NetCDF is developed by UCAR/Unidata and is widely used for climate and weather data as well as for other environmental data sets. The netcdf library is ported to a wide variety of operating systems and platforms, from laptop computers to large mainframes. Data sets are typically large arrays with axes for longitude, latitude and time, with other axes, such as depth, added according to the nature of the data. Other types of data are also commonly found.

Importantly, “a netCDF file includes information about the data it contains”. This comes in two flavours:

Both types of metadata are necessary to “understand” the netCDF resource.

Conventions

The descriptive metadata are not defined by the netcdflibrary. To ensure interoperability, several “conventions” have been developed over the years such that users of netCDF data can correctly interpret what data developers have put in the resource. The most important of the conventions is the CF Metadata Conventions. These conventions define a large number of standards that help interpret netCDF resources.

Other common conventions are related to climate prediction data, such as CMIP-5 and CMIP-6.

Using netCDF resources in R

Basic access

The RNetCDF package is developed and maintained by the same team that developed and maintains the netcdf library. It provides an interface to the netcdf library that stays very close to the API of the C library. As a result, it lacks an intuitive user experience and workflow that R users would be familiar with.

Package ncdf4, the most widely used package to access netCDF resources, does one better by performing the tedious task of reading the structural metadata from the resource that is needed for a basic understanding of the contents, such as dimension and variable details, but the library API concept remains with functions that fairly directly map to the netcdf library functions.

One would really need to understand the netCDF data model and implementation details to effectively use these packages. For instance, most data describing a dimension is stored as a variable. So to read thedimnames() of a dimension you’d have to callvar.get.nc() or ncvar_get(). Neither package loads the attributes of the dimensions, variables and the data set (“global” variables), which is essential to understand what the dimensions and variables represent.

While both packages are very good at what they do, it is clearly not enough.

Extending the base packages

Several packages have been developed to address some of these issues and make access to the data easier. Unfortunately, none of these packages provide a comprehensive R-style solution to accessing and interpreting netCDF resources in an intuitive way.

ncdfCF

Package ncdfCF provides a high-level interface using functions and methods that are familiar to the R user. It reads the structural metadata and also the attributes upon opening the resource. In the process, the ncdfCF package also applies CF Metadata Conventions to interpret the data. This currently applies to:

Basic usage

Opening and inspecting the contents of a netCDF resource is very straightforward:

library(ncdfCF)
  
# Get a netCDF file, here hourly data for 2016-01-01 over Rwanda
fn <- system.file("extdata", "ERA5land_Rwanda_20160101.nc", package = "ncdfCF")
  
# Open the file, all metadata is read
ds <- open_ncdf(fn)
  
# Easy access in understandable format to all the details
ds
#> <Dataset> ERA5land_Rwanda_20160101 
#> Resource   : /private/var/folders/gs/s0mmlczn4l7bjbmwfrrhjlt80000gn/T/Rtmp1HGA78/Rinst14bca5d8b0ccb/ncdfCF/extdata/ERA5land_Rwanda_20160101.nc 
#> Format     : offset64 
#> Conventions: CF-1.6 
#> Keep open  : FALSE 
#> 
#> Variables:
#>  name long_name             units data_type axes                     
#>  t2m  2 metre temperature   K     NC_SHORT  longitude, latitude, time
#>  pev  Potential evaporation m     NC_SHORT  longitude, latitude, time
#>  tp   Total precipitation   m     NC_SHORT  longitude, latitude, time
#> 
#> Axes:
#>  id axis name      length unlim values                                       
#>  0  T    time      24     U     [2016-01-01 00:00:00 ... 2016-01-01 23:00:00]
#>  1  X    longitude 31           [28 ... 31]                                  
#>  2  Y    latitude  21           [-1 ... -3]                                  
#>  unit                             
#>  hours since 1900-01-01 00:00:00.0
#>  degrees_east                     
#>  degrees_north                    
#> 
#> Attributes:
#>  id name        type    length
#>  0  CDI         NC_CHAR  64   
#>  1  Conventions NC_CHAR   6   
#>  2  history     NC_CHAR 482   
#>  3  CDO         NC_CHAR  64   
#>  value                                             
#>  Climate Data Interface version 2.4.1 (https://m...
#>  CF-1.6                                            
#>  Tue May 28 18:39:12 2024: cdo seldate,2016-01-0...
#>  Climate Data Operators version 2.4.1 (https://m...
  
# Variables can be accessed through standard list-type extraction syntax
t2m <- ds[["t2m"]]
t2m
#> <Variable> t2m 
#> Long name: 2 metre temperature 
#> 
#> Axes:
#>  id axis name      length unlim values                                       
#>  1  X    longitude 31           [28 ... 31]                                  
#>  2  Y    latitude  21           [-1 ... -3]                                  
#>  0  T    time      24     U     [2016-01-01 00:00:00 ... 2016-01-01 23:00:00]
#>  unit                             
#>  degrees_east                     
#>  degrees_north                    
#>  hours since 1900-01-01 00:00:00.0
#> 
#> Attributes:
#>  id name          type      length value              
#>  0  long_name     NC_CHAR   19     2 metre temperature
#>  1  units         NC_CHAR    1     K                  
#>  2  add_offset    NC_DOUBLE  1     292.664569285614   
#>  3  scale_factor  NC_DOUBLE  1     0.00045127252204996
#>  4  _FillValue    NC_SHORT   1     -32767             
#>  5  missing_value NC_SHORT   1     -32767
  
# Same with dimensions, but now without first assigning the object to a symbol
ds[["longitude"]]
#> <Longitude axis> [1] longitude
#> Length   : 31
#> Axis     : X 
#> Values   : 28, 28.1, 28.2 ... 30.8, 30.9, 31 degrees_east
#> Bounds   : (not set)
#> 
#> Attributes:
#>  id name          type    length value       
#>  0  standard_name NC_CHAR  9     longitude   
#>  1  long_name     NC_CHAR  9     longitude   
#>  2  units         NC_CHAR 12     degrees_east
#>  3  axis          NC_CHAR  1     X
  
# Regular base R operations simplify life further
dimnames(ds[["pev"]]) # A variable: list of dimension names
#> [1] "longitude" "latitude"  "time"
  
dimnames(ds[["longitude"]]) # A dimension: vector of dimension element values
#>  [1] 28.0 28.1 28.2 28.3 28.4 28.5 28.6 28.7 28.8 28.9 29.0 29.1 29.2 29.3 29.4
#> [16] 29.5 29.6 29.7 29.8 29.9 30.0 30.1 30.2 30.3 30.4 30.5 30.6 30.7 30.8 30.9
#> [31] 31.0
  
# Access attributes
ds[["pev"]]$attribute("long_name")
#> [1] "Potential evaporation"

In the last command you noted the list-like syntax with the$ operator. The base objects in the package are based on the R6 object-oriented model. R6 is a light-weight but powerful and efficient framework to build object models. Access to the public fields and functions is provided through the $ operator. Common base R operators and functions, such as shown above, are supported to facilitate integration ofncdfCF in frameworks built on base R or S3.

Working with the data

The data() and subset() functions return data from a variable in a CFData instance. TheCFData instance holds the actual data, as well as important metadata of the data, including its axes, the coordinate reference system, and the attributes, among others. The CFDatainstance also lets you manipulate the data in a way that is informed by the metadata. This overcomes a typical issue when working with netCDF data that adheres to the CF Metadata Conventions.

The ordering of the axes in a typical netCDF resource is different from the way that R orders its data. That leads to surprising results if you are not aware of this issue:

# Open a file and read the data from a variable into a CFData instance
fn <- system.file("extdata", "tasmax_NAM-44_day_20410701-vncdfCF.nc", package = "ncdfCF")
ds <- open_ncdf(fn)
tx <- ds[["tasmax"]]$data()
tx
#> <Data> tasmax 
#> Long name: Daily Maximum Near-Surface Air Temperature 
#> 
#> Values: [263.4697 ... 313.2861] K
#>     NA: 0 (0.0%)
#> 
#> Axes:
#>  id axis name   long_name                        length unlim
#>  2  X    x      x-coordinate in Cartesian system 148         
#>  3  Y    y      y-coordinate in Cartesian system 140         
#>  0  T    time                                      1    U    
#>     Z    height                                    1         
#>  values                unit                         
#>  [0 ... 7350000]       m                            
#>  [0 ... 6950000]       m                            
#>  [2041-07-01 12:00:00] days since 1949-12-1 00:00:00
#>  [2]                   m                            
#> 
#> Attributes:
#>  id name          type     length value                                     
#>   0 standard_name NC_CHAR  15     air_temperature                           
#>   1 long_name     NC_CHAR  42     Daily Maximum Near-Surface Air Temperature
#>   2 units         NC_CHAR   1     K                                         
#>   3 grid_mapping  NC_CHAR  17     Lambert_Conformal                         
#>   5 _FillValue    NC_FLOAT  1     1.00000002004088e+20                      
#>   6 missing_value NC_FLOAT  1     1.00000002004088e+20                      
#>   7 original_name NC_CHAR  11     TEMP at 2 M                               
#>   8 cell_methods  NC_CHAR  13     time: maximum                             
#>   9 FieldType     NC_INT    1     104                                       
#>  10 MemoryOrder   NC_CHAR   3     XY

# Use the terra package for plotting
# install.packages("terra")
library(terra)
#> terra 1.7.78

# Get the data in exactly the way it is stored in the file, using `raw()`
tx_raw <- tx$raw()
str(tx_raw)
#>  num [1:148, 1:140, 1] 301 301 301 301 301 ...
#>  - attr(*, "dimnames")=List of 3
#>   ..$ x   : chr [1:148] "0" "50000" "1e+05" "150000" ...
#>   ..$ y   : chr [1:140] "0" "50000" "1e+05" "150000" ...
#>   ..$ time: chr "2041-07-01 12:00:00"

# Plot the data
r <- terra::rast(tx_raw)
r
#> class       : SpatRaster 
#> dimensions  : 148, 140, 1  (nrow, ncol, nlyr)
#> resolution  : 1, 1  (x, y)
#> extent      : 0, 140, 0, 148  (xmin, xmax, ymin, ymax)
#> coord. ref. :  
#> source(s)   : memory
#> name        :    lyr.1 
#> min value   : 263.4697 
#> max value   : 313.2861
plot(r)

North America is lying on its side. This is because the data is stored differently in the netCDF resource than R expects. There is, in fact, not a single way of storing data in a netCDF resources, the dimensions may be stored in any order. The CF Metadata Conventions add metadata to interpret the file storage. The array() method uses that to produce an array in the familiar R storage arrangement:

tx_array <- tx$array()
str(tx_array)
#>  num [1:140, 1:148, 1] 277 277 277 277 277 ...
#>  - attr(*, "dimnames")=List of 3
#>   ..$ y   : chr [1:140] "6950000" "6900000" "6850000" "6800000" ...
#>   ..$ x   : chr [1:148] "0" "50000" "1e+05" "150000" ...
#>   ..$ time: chr "2041-07-01 12:00:00"
r <- terra::rast(tx_array)
terra::plot(r)

Ok, so now we got North America looking pretty ok again. The data has been oriented in the right way. Behind the scenes that may have involved transposing and flipping the data, depending on the data storage arrangement in the netCDF resource.

But the coordinate system is still not right. These are just ordinal values along both axes. The terra::SpatRaster object also does not show a CRS. All of the above steps can be fixed by simply calling the terra() method on the data object. This will return a terra::SpatRaster for a data object with three axes and a terra::SpatRasterDataset for a data object with four axes, including scalar axes if present:

r <- tx$terra()
r
#> class       : SpatRaster 
#> dimensions  : 140, 148, 1  (nrow, ncol, nlyr)
#> resolution  : 50000, 50000  (x, y)
#> extent      : -25000, 7375000, -25000, 6975000  (xmin, xmax, ymin, ymax)
#> coord. ref. : +proj=lcc +lat_0=46.0000038146973 +lon_0=-97 +lat_1=35 +lat_2=60 +x_0=3675000 +y_0=3475000 +datum=WGS84 +units=m +no_defs 
#> source(s)   : memory
#> name        : 2041-07-01 12:00:00 
#> min value   :            263.4697 
#> max value   :            313.2861
terra::plot(r)

So that’s a fully specified terra::SpatRaster from netCDF data.

(Disclaimer: Package terra can do this too with simply terra::rast(fn) and then selecting a layer to plot (which is not always trivial if you are looking for a specific layer; e.g. what does “lyr.1” represent?). The whole point of the above examples is to demonstrate the different steps in processing netCDF data. There are also some subtle differences such as the names of the layers. Furthermore, ncdfCF doesn’t insert the attributes of the variable into the SpatRaster. terra can only handle netCDF resources that “behave” properly (especially the axis order) and it has no particular consideration for the different calendars that can be used with CF data.)