GitHub - BigelowLab/threddscrawler: A generic THREDDS crawler for R (original) (raw)

THREDDS Crawler for R

Requirements

R >= 3.0

httr

XML

Installation

It is easy to install with devtools

library(devtools) install_github("btupper/threddscrawler")

Classes

TopCatalogRefClass for catalogs that are containers of CatalogRefClass pointers. This is like a listing of files and subdirectories in a directory, but here the files and subdirectories are all CatalogRefClass pointers.

CatalogRefClass is a pointer to TopCatalogRefClass

THREDDS dataset comes in two flavors: collections of datasets and direct datasets. I split these into DatasetsRefClass (collections) and DatasetRefClass (direct); the latter has an 'access' child node the former does not. A collection is a listing of one or more datasets (either direct or catalogs). A direct dataset is a pointer to an actual resource like a NetCDF file.

Example from NERACOOS

NERACOOS exposes data using a THREDDS server. This is an example that draws upon the MUR SST data subset prepared in 2015.

We start by examining the catalog Note that programmatically we access the companion XML file

We'll crawl these pages in succession...

Top

NorthEastShelf

DailyFiles

2010

library(threddscrawler)

start by getting the TopCatalog - picture TopCatalog as web page that list one or more catalogs.

Top <- get_catalog('http://www.neracoos.org/thredds/catalog/GMRI/SST/TESTS/NASA_MUR_SST/catalog.xml') Top

Reference Class: "TopCatalogRef"

verbose_mode: FALSE

url: http://www.neracoos.org/thredds/catalog/GMRI/SST/TESTS/NASA_MUR_SST/catalog.xml

children: service dataset

catalogs: NorthEastShelf GulfOfMaine

now get the catalogs embedded in the page. Note that these point to other TopCatalogs.

A <- Top$get_catalogs() A

$NorthEastShelf

Reference Class: "CatalogRefClass"

verbose_mode: FALSE

url: http://www.neracoos.org/thredds/catalog/GMRI/SST/TESTS/NASA_MUR_SST/NorthEastShelf/catalog.xml

children:

name:NorthEastShelf

href:NorthEastShelf/catalog.xml

title:NorthEastShelf

type:

ID:GMRI_TESTS/NASA_MUR_SST/NorthEastShelf

$GulfOfMaine

Reference Class: "CatalogRefClass"

verbose_mode: FALSE

url: http://www.neracoos.org/thredds/catalog/GMRI/SST/TESTS/NASA_MUR_SST/GulfOfMaine/catalog.xml

children:

name:GulfOfMaine

href:GulfOfMaine/catalog.xml

title:GulfOfMaine

type:

ID:GMRI_TESTS/NASA_MUR_SST/GulfOfMaine

now we get the catalogs in the NorthEastShelf

NES <- A[['NorthEastShelf']]$get_catalog() NES

Reference Class: "TopCatalogRef"

verbose_mode: FALSE

url: http://www.neracoos.org/thredds/catalog/GMRI/SST/TESTS/NASA_MUR_SST/NorthEastShelf/catalog.xml

children: service dataset

catalogs: MonthlyMeans MonthlyFiles DailyFiles AggregatedMeans

let's get the catalogs. I won't show them, but we'll get the TopCatalog for the 'DailyFiles'

B <- NES$get_catalogs() DAYS <- B[['DailyFiles']]$get_catalog()

now get 2010

C <- DAYS$get_catalogs() Y2010 <- C[['2010']]$get_catalog()

Now we are at "the bottom" of the search path and we find only a collection of datasets. Instead of requesting subsequent catalogs we can now request datasets.

days <- Y2010$get_datasets() head(days, n = 2)

$20101231-JPL-L4UHfnd-GLOB-v01-fv04-MUR_subset.nc

Reference Class: "DatasetsRefClass"

verbose_mode: FALSE

url: http://www.neracoos.org/thredds/catalog/GMRI/SST/TESTS/NASA_MUR_SST/NorthEastShelf/DailyFiles/2010/20101231-JPL-L4UHfnd-GLOB-v01-fv04-MUR_subset.nc

children: dataSize date

datasets: NA

$20101230-JPL-L4UHfnd-GLOB-v01-fv04-MUR_subset.nc

Reference Class: "DatasetsRefClass"

verbose_mode: FALSE

url: http://www.neracoos.org/thredds/catalog/GMRI/SST/TESTS/NASA_MUR_SST/NorthEastShelf/DailyFiles/2010/20101230-JPL-L4UHfnd-GLOB-v01-fv04-MUR_subset.nc

children: dataSize date

datasets: NA

Note that the 'datasets' attribute is NA - that tells us that have a real data source, not a catalog of data sources.