Creating a new connector. (original) (raw)

Introduction

biodb is a framework designed to help you implement new connectors for databases. To illustrate this, we will show you a practical example where we create a connector for the ChEBI database. In this example, we will present you a small implementation of a _ChEBI_connector, and show you how to declare it to your biodb instance.

A more complete and functional connector for accessing ChEBI database is implemented in biodbChEBI library. See 1 for a list of the capabilities of this official biodb connector.

Generating a new extension package

When creating a new extension package, biodb can help you generate all the necessary files.

A call to genNewExtPkg() will generate the skeletons for the _biodb_connector class and the biodb entry class, along with the testthat files, the DESCRIPTION file, etc. A simplified call might look like this:

biodb::genNewExtPkg(path='biodbChebiEx', dbName='chebi.ex', connType='compound',
                    dbTitle='ChEBI connector example', entryType='xml', remote=TRUE)

See 2 for a brief description of the parameters. Other parameters exist for the author’s email, the author’s name, for generating a Makefile, or configuring for writing C++ code with Rcpp.

The files generated by the genNewExtPkg() function are the following ones:

list.files('biodbChebiEx', all.files=TRUE, recursive=TRUE)
##  [1] ".BBSoptions"                                    
##  [2] ".Rbuildignore"                                  
##  [3] ".gitignore"                                     
##  [4] "DESCRIPTION"                                    
##  [5] "LICENSE"                                        
##  [6] "R/ChebiExConn.R"                                
##  [7] "R/ChebiExEntry.R"                               
##  [8] "R/package.R"                                    
##  [9] "README.md"                                      
## [10] "biodb_ext.yml"                                  
## [11] "inst/definitions.yml"                           
## [12] "inst/testref/entry-chebi.ex-0001.json"          
## [13] "longtests/testthat.R"                           
## [14] "longtests/testthat/test_long_001_init_logging.R"
## [15] "longtests/testthat/test_long_100_generic.R"     
## [16] "longtests/testthat/test_long_200_example.R"     
## [17] "tests/testthat.R"                               
## [18] "tests/testthat/test_001_init_logging.R"         
## [19] "tests/testthat/test_050_fcts.R"                 
## [20] "tests/testthat/test_100_generic.R"              
## [21] "tests/testthat/test_200_example.R"              
## [22] "vignettes/biodbChebiEx.Rmd"

Inside the biodb_ext.yml file are stored the values of the parameters used with biodb::genNewExtPkg(). This is in case you want to upgrade some the generated files (.gitignore,.travis.yml, Makefile, etc) with newer versions from biodb package. You would then only need to call biodb::upgradeExtPkg(path='biodbChebiEx')and the biodb_ext.yml file would be read for parameter values.

The inst/definitions.yml file defines the new connector, we will fill in some values inside it. Then we need to write implementations for the methods in the connector classR/ChebiExConn.R. On the other side, R/ChebiExEntry.R, the entry class, needs no modification for our basic usage.

The test files in tests/testthat will be executed when running R CMD check, they need to be edited first though. Generic tests need to enabled inside tests/testthat/test_100_generic.R. The files tests/testthat/test_050_fcts.R andtests/testthat/test_200_example.R contain only examples, thus they need to be modified or removed.

The test files in tests/long will not be executed when running R CMD check. They can be run manually after installing the package locally, by callingR -e "testthat::test_dir('tests/long')".

A skeleton vignette has also been generated (vignettes/intro.Rmd), and should be completed with specific examples for this package.

Editing the generated skeleton

Starting from the skeleton files generated by genNewExtPkg(), we need now to fill in the blanks.

The first file to take care of is inst/definitions.yml, which contains the definition of the new connector.

Then we will look quickly at R/ChebiExEntry.R, which is rather empty in our case, and R/ChebiExConn.R, which requires much more attention, having several methods that need implementation.

The naming of the classes inside the R files is important. They must be named ChebiExEntry and ChebiExConn, in order to match the name defined inside inst/definitions.yml (chebi.ex). Hopefully the generator has taken care of this, and no special action is required on this aspect, except not modifying the names.

Editing the YAML definition of the new connector

The content of the generated YAML file inst/definitions.yml is as follow:

# biodb example definitions file for extensions packages, version 1.0.0

databases:
  chebi.ex:
    name: ChEBI connector example
    description: Write here the description of this database.
    compound.db: true
    entry.content.type: xml
    parsing.expr:
      accession: substring-after(//dbns:return/dbns:accessionId,'ACCESSION:')
      name:
      - //dbns:name
      - //dbns:synonyms/dbns:data
      mass: //dbns:mass
      monoisotopic.mass: //dbns:monoisotopicMass
      smiles: //dbns:return/dbns:smiles
      inchi: //dbns:return/dbns:inchi
      inchikey: //dbns:return/dbns:inchiKey
      formula:
      - //dbns:Formulae/dbns:source[text()='MyDatabase']/../dbns:data
      - (//dbns:Formulae/dbns:data)[1]
    xml.ns:
      dbns: https://my.database.org/webservices/v1
      xsd: http://www.w3.org/2001/XMLSchema
    searchable.fields:
      - name
      - monoisotopic.mass
      - molecular.mass
      - average.mass
      - nominal.mass
    remote: true
    # Length in seconds of the connection sliding window
    scheduler.t: 1
    # Number of connections allowed inside the connection sliding window
    scheduler.n: 3
    urls:
      # Base URL of the database server, where to find entry pages
      base.url: https://my.database.org/mydb/
      # Webservice URL to use to contact web services
      ws.url: https://my.database.org/webservices/mydb/3.2/
      # Add any other URL you need for the development of your connector
      # Inside your code, you can get each of these URLs with a call like the following one:
      #     .self$getPropValSlot('urls', 'ws.url')

fields:
  chebi.ex.id:
    description: ChEBI connector example ID
    case.insensitive: true
    forbids.duplicates: true
    type: id
    card: many

It is mainly filled with examples.

This YAML file contains two main parts: databases and fields. The databases part is where you list the new connectors you’ve created, and the fields part is where you define the new entry fields your new connectors need.

Fields definition

We just have one new field to define: chebi.ex.id. This is the accession field for our new connector. All connector accession fields are in the form <connector_class_id>.id. This accession field is mainly used inside other databases, when they make references to other databases. The field accession, which is used in all entries of biodb connectors, contains the same value as the connector accession field (chebi.ex.id in our case) and is preferable when accessing an entry. The definition of the new field is quite simple, See 3 for explanations of the different parameters.

Database definition

The main part is the declaration of the new connector. This is done in the databases section, under the key chebi.id, which is the database identifier. See 4 for explanations of the different parameters.

Final version of the YAML file

After setting some parsing expressions, the URLs and the searchable fields, we get a complete definition file, that you can find at:

defFile <- system.file("extdata", "chebi_ex.yml", package='biodb')

Its content is as follow:

databases:

  chebi.ex:
    name: ChEBI example connector
    description: An example connector for ChEBI.
    compound.db: true
    entry.content.encoding: UTF-8
    entry.content.type: xml
    parsing.expr:
      accession: substring-after(//chebi:return/chebi:chebiId,'CHEBI:')
      formula:
        - //chebi:Formulae/chebi:source[text()='ChEBI']/../chebi:data
        - (//chebi:Formulae/chebi:data)[1]
      inchi: //chebi:return/chebi:inchi
      inchikey: //chebi:return/chebi:inchiKey
      mass: //chebi:mass
      monoisotopic.mass: //chebi:monoisotopicMass
      name:
        - //chebi:chebiAsciiName
      smiles: //chebi:return/chebi:smiles
    searchable.fields:
      - name
      - monoisotopic.mass
      - molecular.mass
    remote: true
    scheduler.t: 1
    scheduler.n: 3
    urls:
      base.url: https://www.ebi.ac.uk/chebi/
      ws.url: https://www.ebi.ac.uk/webservices/chebi/2.0/
    xml.ns:
      chebi: https://www.ebi.ac.uk/webservices/chebi
      xsd: http://www.w3.org/2001/XMLSchema

fields:

  chebi.ex.id:
    description: ChEBI ID
    type: id
    card: many
    forbids.duplicates: true
    case.insensitive: true

The entry class

The entry class represents an entry from the database. Each instance of an entry contains the values parsed from the database downloaded content.

The entry class of our example extension package has been generated insideR/ChebiExEntry.R. Here is its content:

#' ChEBI connector example entry class.
#'
#' Entry class for ChEBI connector example. 
#'
#' @seealso
#' \code{\link{BiodbXmlEntry}}.
#'
#' @examples
#' # Create an instance with default settings:
#' mybiodb <- biodb::newInst()
#'
#' # Get a connector that inherits from ChebiExConn:
#' conn <- mybiodb$getFactory()$createConn('chebi.ex')
#'
#' # Get the first entry
#' e <- conn$getEntry(conn$getEntryIds(1L))
#'
#' # Terminate instance.
#' mybiodb$terminate()
#'
#' @import biodb
#' @import R6
#' @export
ChebiExEntry <- R6::R6Class("ChebiExEntry",
    inherit=
        biodb::BiodbXmlEntry
    ,

public=list(

initialize=function(...) {
    super$initialize(...)
}

,doCheckContent=function(content) {
    
    # You can do some more checks of the content here.
    
    return(TRUE)
}

,doParseFieldsStep2=function(parsed.content) {
    
    # TODO Implement your custom parsing processing here.
}
))

The class inherits from BiodbXmlEntry since we have set the entryTypeparameter to "xml". An entry class must inherit from the BiodbEntry class and define some methods. To simplify this step, several generic entry classes have been defined in_biodb_ (see 5), depending on the type of content downloaded from the database. To use one of these classes for your entry class, you only have to make your class inherit from the desired generic class.

Two methods are defined that can be used to enhance our implementation. The method doCheckContent() can be used to further check the parsed content of an entry, for instance for some incoherence between fields. The method doParseFieldsStep2() allows to run some custom code for complex parsing of the entry’s content. This method is run after doParseFieldsStep1(), which is defined inside the mother class (here BiodbXmlEntry) and executes the parsing expression defined inside inst/definitions.yml.

Note: biodb uses R6 as OOP (Object Oriented Programming) model. Please see vignetteDetails on biodb, for more explanations.

The connector class

The generator has generated the full class, and thus has taken care of the inheritance part, as well as the declaration of the required methods. See 6 for a description of these methods. What is left to us is the implementation of those methods.

Here is the generated skeleton:

#' ChEBI connector example connector class.
#'
#' Connector class for ChEBI connector example.
#'
#' @seealso \code{\link{BiodbConn}}.
#'
#' @examples
#' # Create an instance with default settings:
#' mybiodb <- biodb::newInst()
#'
#' # Get a connector:
#' conn <- mybiodb$getFactory()$createConn('chebi.ex')
#'
#' # Get the first entry
#' e <- conn$getEntry(conn$getEntryIds(1L))
#'
#' # Terminate instance.
#' mybiodb$terminate()
#'
#' @import biodb
#' @import R6
#' @export
ChebiExConn <- R6::R6Class("ChebiExConn",
inherit=biodb::BiodbConn,

public=list(

initialize=function(...) {
    super$initialize(...)
}

,wsFind=function(name="", retfmt=c('plain', 'parsed', 'ids', 'request')) {
    # This is the implementation of a fictive web service called "find" that
    # search for entries by name.
    # Use it as an example for implementing your own web services.

    retfmt <- match.arg(retfmt)

    # Build request
    params <- list(name=name)
    url <- BiodbUrl$new(url=c(self$getPropValSlot('urls', 'ws.url'), 'find'),
                    params=params)
    request <- self$makeRequest(method='get', url=url)

    # Return request
    if (retfmt == 'request')
        return(request)

    # Send request
    # This the line that should be run for sending the request and getting the
    # results:
    #results <- self$getBiodb()$getRequestScheduler()$sendRequest(request)
    # Instead, for this example, we just generate the results of this fictive
    # web service:
    results <- paste('{"0001": {"name": "name1"},',
                     ' "0198": {"name": "name2"},',
                     ' "9834": {"name": "name3"}}')

    # Parse
    if (retfmt != 'plain') {
        
        # Parse JSON
        results <- jsonlite::fromJSON(results, simplifyDataFrame=FALSE)

        # Get IDs
        if (retfmt == 'ids')
            results <- names(results)
    }

    return(results)
}
),

private=list(

doGetNbEntries=function(count=FALSE) {

    # Replace the call below if you have a direct way (specific web service for
    # a remote database, provided method or information for a local database)
    # to count entries for your database.
    return(callSuper(count=count))
}


,doGetEntryIds=function(max.results=NA_integer_) {
    # Overrides super class' method.

    ids <- NA_character_
 
    # TODO Implement retrieval of accession numbers.
    
    return(ids)
}

,doSearchForEntries=function(fields=NULL, max.results=NA_integer_) {
    # Overrides super class' method.

    ids <- character()

    # TODO Implement search of entries by filtering on values of fields.
    
    return(ids)
}

,doGetEntryContentRequest=function(id, concatenate=TRUE) {

    # TODO Modify the code below to build the URLs to get the contents of the
    # entries.
    # Depending on the database, you may have to build one URL for each
    # individual entry or may be able to write just one or a few URL for all
    # entries to retrieve.
    u <- c(self$getPropValSlot('urls', 'base.url'), 'entries',
           paste(id, 'xml', sep='.'))
    url <- BiodbUrl$new(url=u)$toString()

    return(url)
}

,doGetEntryPageUrl=function(id) {

    # TODO Modify this code to build the individual URLs to the entry web pages
    fct <- function(x) {
        u <- c(self$getPropValSlot('urls', 'base.url'), 'entries', x)
        BiodbUrl$new(url=u)$toString()
    }

    return(vapply(id, fct, FUN.VALUE=''))
}

,doGetEntryImageUrl=function(id) {

    # TODO Modify this code to build the individual URLs to the entry images 
    fct <- function(x) {
        u <- c(self$getPropValSlot('urls', 'base.url'), 'images', x,
               'image.png')
        BiodbUrl$new(url=u)$toString()
    }

    return(vapply(id, fct, FUN.VALUE=''))
}
))

Inheritance

The connector class is responsible for the connection to the database. In our case, the database is a compound database.

Methods to implement

See the help inside R about BiodbConn for details about the parameters of those functions.

Remote connection methods

The remote methods are used for three different goals. First to build URLs that access the web site, to get the URL of an entry page (doGetEntryPageUrl()) or to get the URL of an entry picture (doGetEntryImageUrl()) like a molecule representation. Second to get a list of database entry identifiers (doGetEntryIds()). Third to Get the content of an entry (doGetEntryContentRequest()).

In our implementations of doGetEntryPageUrl(), doGetEntryImageUrl() anddoGetEntryContentRequest() (see below), you may notice the use of thegetPropValSlot() method to get some base URLs ("base.url", "ws.url"). These values are defined inside the connector YAML definition file that we will detail below. Also, in those methods, we use the BiobdUrl class to build the URLs.BiodbUrl handles the building of the URL parameters, as well as the encoding of special characters.

Method for searching for entries

The implemented method (doSearchForEntries()) is a generic method used to search for entries inside the database by name, mass, or any other field. For our example we have decided to implement only the search by name in order to keep the code as simple and short as possible. To see a full implementation of this method, look at the official biodb ChEBI connector at biodbChebi. Inside the method’s code you will see that the implementation of the call to the ChEBI web service API has been left to the dedicated methodwsGetLiteEntity().

Prototype to respect for web service methods

In biodb official implementations of remote connectors, the implementations of calls to web services are done in separate dedicated methods having in common some principles.

These principles are important, because they assure a uniformity between_biodb_ extension packages, allowing users to identify immediately a web service method and recognize the biodb generic parameters inside it.

Example of a web service method, taken from official biodb ChEBI extension package:

wsGetLiteEntity=function(search=NULL, search.category='ALL', stars='ALL',
                         max.results=10,
                         retfmt=c('plain', 'parsed', 'request', 'ids')) {
}

A web service method name must start with the prefix ws, which stands for_web service_, and be followed by the database API name of the web service written in Java style (i.e.: an uppercase letter for the start of each word and lowercase letters for the rest).

The first parameters of the method are the database web service parameters.

The last parameters (max.results and retfmt) are biodb specific.

max.results controls the maximum number of results wanted, and must have a default value (usually 10).

retfmt, which stands for return format, controls the format of the method’s returned value. The default value of retfmt is set to a vector and then processed inside the method with the match.arg() method. Thus the “real” default value is the first value of the vector, which must always be "plain". The set of possible values for retfmt is variable from one web service method to another. However some of the values are compulsory. See 7 for a full list of retfmt possible values officially accepted by biodb.

You may want to look into some of biodb implementations of connectors to official remote databases, and see how the calls to web services have been implemented in dedicated web service methods. See 8.

Implementation

Here is our implementation of the connector class:

ChebiExConn <- R6::R6Class("ChebiExConn",
inherit=biodb::BiodbConn,

public=list(

initialize=function(...) {
    super$initialize(...)
},

wsGetLiteEntity=function(search=NULL, search.category='ALL', stars='ALL',
                         max.results=10,
                         retfmt=c('plain', 'parsed', 'request', 'ids')) {

    # Check parameters
    chk::chk_string(search)
    chk::chk_in(search.category, self$getSearchCategories())
    chk::chk_number(max.results)
    chk::chk_gte(max.results, 0)
    chk::chk_in(stars, self$getStarsCategories())
    retfmt <- match.arg(retfmt)

    # Build request
    params <- c(search=search,
                searchCategory=search.category,
                maximumResults=max.results,
                starsCategory=stars)
    url <- c(self$getPropValSlot('urls', 'ws.url'), 'test/getLiteEntity')
    request <- self$makeRequest(method='get', url=BiodbUrl$new(url=url,
                                                                params=params),
                                 encoding='UTF-8')
    if (retfmt == 'request')
        return(request)

    # Send request
    results <- self$getBiodb()$getRequestScheduler()$sendRequest(request)

    # Parse
    if (retfmt != 'plain') {

        # Parse XML
        results <-  XML::xmlInternalTreeParse(results, asText=TRUE)

        if (retfmt == 'ids') {
            ns <- self$getPropertyValue('xml.ns')
            results <- XML::xpathSApply(results, "//chebi:chebiId",
                                        XML::xmlValue, namespaces=ns)
            results <- sub('CHEBI:', '', results)
            if (length(grep("^[0-9]+$", results)) != length(results))
                self$error("Impossible to parse XML to get entry IDs.")
        }
    }

    return(results)
}
),

private=list(
doSearchForEntries=function(fields=NULL, max.results=0) {

    ids <- character()

    if ( ! is.null(fields)) {

        # Search by name
        if ('name' %in% names(fields))
            ids <- self$wsGetLiteEntity(search=fields$name,
                                         search.category="ALL NAMES",
                                         max.results=0, retfmt='ids')
    }

    # Cut
    if (max.results > 0 && max.results < length(ids))
        ids <- ids[seq_len(max.results)]

    return(ids)
},

doGetEntryContentRequest=function(id, concatenate=TRUE) {

    url <- c(self$getPropValSlot('urls', 'ws.url'), 'test',
             'getCompleteEntity')

    urls <- vapply(id, function(x) BiodbUrl$new(url=url,
                                            params=list(chebiId=x))$toString(),
                   FUN.VALUE='')

    return(urls)
},

doGetEntryIds=function(max.results=NA_integer_) {
    return(NULL)
},

doGetEntryPageUrl=function(id) {
    # Overrides super class' method

    url <- c(self$getPropValSlot('urls', 'base.url'), 'searchId.do')

    fct <- function(x) {
        BiodbUrl$new(url=url, params=list(chebiId=x))$toString()
    }
    
    urls <- vapply(id, fct, FUN.VALUE='')

    return(urls)
},

doGetEntryImageUrl=function(id) {
    # Overrides super class' method

    url <- c(self$getPropValSlot('urls', 'base.url'), 'displayImage.do')

    fct <- function(x) {
        BiodbUrl$new(url=url, params=list(defaultImage='true', imageIndex=0,
                                      chebiId=x, dimensions=400))$toString()
    }
    
    urls <- vapply(id, fct, FUN.VALUE='')

    return(urls)
}
))

Here is our implementation of the entry class:

ChebiExEntry <- R6::R6Class("ChebiExEntry",
inherit=BiodbXmlEntry,

public=list(

initialize=function(...) {
    super$initialize(...)
}
),

private=list(
doCheck=function(content) {
    
    # You can do some more checks of the content here.
    
    return(TRUE)
}

,doParseFieldsStep2=function(parsed.content) {
    
    # TODO Implement your custom parsing processing here.
}

))

Using the new connector

To use the new connector, we first need to load the YAML definition file inside our biodb instance.

To start we create an instance of the BiodbMain class:

mybiodb <- biodb::newInst()
## INFO  [17:32:35.889] Loading definitions from package biodb version 1.16.0.

The loading of the definitions is done with a call to loadDefinitions():

mybiodb$loadDefinitions(defFile)

Now our biodb instance is aware of our new connector, and is ready to create instances of it.

To create an instance of our new connector class, we proceeds as usual in_biodb_, by calling createConn() on the factory instance, using our connector identifier:

conn <- mybiodb$getFactory()$createConn('chebi.ex')

Now we can retrieve a ChEBI entry from the remote database:

entry <- conn$getEntry('17001')
## INFO  [17:32:36.049] Create cache folder "/home/biocbuild/.cache/R/biodb/chebi.ex-0c5076ac2a43d16dbce503a44b09f649" for "chebi.ex-0c5076ac2a43d16dbce503a44b09f649".
entry$getFieldsAsDataframe()
##   accession   formula
## 1     17001 C9H13N5O4
##                                                                                                                     inchi
## 1 InChI=1S/C9H13N5O4/c10-9-13-7-5(8(18)14-9)12-3(1-11-7)6(17)4(16)2-15/h4,6,15-17H,1-2H2,(H4,10,11,13,14,18)/t4-,6+/m1/s1
##                      inchikey molecular.mass monoisotopic.mass
## 1 YQIFAMYNGGOTFB-XINAWCOVSA-N       255.2308          255.0967
##                   name                                      smiles chebi.ex.id
## 1 7,8-dihydroneopterin Nc1nc2NCC(=Nc2c(=O)[nH]1)[C@H](O)[C@H](O)CO       17001

Do not forget to terminate your biodb instance once you are done with it:

mybiodb$terminate()
## INFO  [17:32:38.108] Closing BiodbMain instance...
## INFO  [17:32:38.110] Connector "chebi.ex" deleted.

Other types of connectors and entries

We describe here the other types of connectors and entries that _biodb_provide. The generator that we have used to generate the package skeleton for chebi.excan also be used to generate skeleton for all the types described here.

Connector for a local database

With biodb we can also write a connector for a local database. As a matter of fact, all the connectors included in biodb base package are local connectors only: mass.csv.file, comp.csv.file and mass.sqlite. See 9 for a list of methods to implement when writing a local connector.

Connector for a mass spectra database

In the example above, we have implemented a compound database. Another type of database is a mass spectra database. The following connectors included in biodb package are mass spectra database connectors: mass.csv.file and mass.sqlite. See 10 for a list of methods to implement when writing a mass spectra database connector.

Connector for a downloadable database

Some database servers do not propose web services, or other connection to the database, but propose to download the whole database for local processing.

biodb offers the possibility to handle the connection to such database servers, by setting downloadable to TRUE inside the definition of the database connector.

See 11 for a list of methods to implement inside your connector when writing a downloadable database connector.

How to implement other types of entry classes

We have seen in the example how to parse XML entries by writing an entry class that inherits from the BiodbXmlEntry class. As stated before, biodb provides other types of abstract entry classes, that facilitate the parsing of diverse entry content formats. Here is a review of those formats.

HTML content

To parse HTML content, your entry class should inherit from BiodbHtmlEntry. The parsing expressions must be written in XPath language, as for XML content, but it uses a special parsing algorithm since HTML is less strict than XML and allows some “illegal” constructs.

Example of a parsing expression:

path: //input[@id='DATA']

JSON content

To parse JSON content, your entry class should inherit from BiodbJsonEntry. The parsing expressions are written in the form of lists of keys to follow as a path inside the JSON tree. Here is an example:

chrom.col.id:
- liquidChromatography
- columnCode

List content

If your connector gets entry contents directly as an R list object, like in the case of MassSqliteConn, you have interest in making your entry class inherit from BiodbListEntry abstract class. With this class, the entry content is provided as a flat named R list object, although it is also possible to pass a JSON string containing flat key/value pairs instead. The parsing expressions are the names used inside the list object. Here is an example:

accession: id
compound.id: comp_id
formula: chem_form

CSV content

The BiodbCsvEntry class helps you handle entry content in CSV (using comma separator or any other character) format. When declaring the constructor for your own entry class, do not forget to call the mother class constructor to pass it your separator and/or the string values that have to be converted to NA:

MyEntryClass <- R6::R6Class("MyEntryClass", inherit=biodb::BiodbCsvEntry,
    public=list(
        initialize=function() {
            super$initialize(sep=';', na.strings=c('', 'NA'))
        }
))

The parsing expressions are the column names of the CSV file:

accession: id
name: fullname

SDF content

If your entry content is in SDF (Structure Data File) chemical file format, make you entry class inherit from BiodbSdfEntry abstract class. Since the SDF format is an official standard format, the parsing expressions are useless in this case, your class only has to inherit from BiodbSdfEntry.

Text content

The BiodbTxtEntry abstract class allows you to handle any text file content for entries. Parsing expressions are defined as regular expressions, using thestringr package, hence in ICU Regular Expressionsformat.

Here is an example:

accession: ^ENTRY\s+(\S+)\s+Compound
exact.mass: ^EXACT_MASS\s+(\S+)$
formula: ^FORMULA\s+(\S+)$

Implementing your own parsing

If none of the predefined formats fits your needs, your class have to inherit directly from BiodbEntry.

Two methods have to be implemented in this case. The first is doParseContent(), which parses a string into the acceptable format for the second function, doParseFieldsStep1().

Look for instance at the code of BiodbTxtEntry class for a good example. Here is an excerpt:

doParseContent=function(content) {

    # Get lines of content
    lines <- strsplit(content, "\r?\n")[[1]]

    return(lines)
},

doParseFieldsStep1=function(parsed.content) {

    # Get parsing expressions
    parsing.expr <- .self$getParent()$getPropertyValue('parsing.expr')

    .self$.assertNotNull(parsed.content)
    .self$.assertNotNa(parsed.content)
    .self$.assertNotNull(parsing.expr)
    .self$.assertNotNa(parsing.expr)
    .self$.assertNotNull(names(parsing.expr))

    # Loop on all parsing expressions
    for (field in names(parsing.expr)) {

        # Match whole content 
        g <- stringr::str_match(parsed.content, parsing.expr[[field]])

        # Get positive results
        results <- g[ ! is.na(g[, 1]), , drop=FALSE]

        # Any match ?
        if (nrow(results) > 0)
            .self$setFieldValue(field, results[, 2])
    }
}

Extending the parsing of an existing class

When inheriting from one of the abstract class listed above (BiodbTxtEntry,BiodbJsonEntry, BiodbXmlEntry, …), you also have the opportunity to write some custom parsing code by implementing doParseFieldsStep2().

This method will be called just after doParseFieldsStep1(), which is implemented by the abstract class.

See HmdbMetabolitesEntry class inside biodbHmdb extension package for an example. Here is an extract:

doParseFieldsStep2=function(parsed.content) {

    # Remove fields with empty string
    for (f in .self$getFieldNames()) {
        v <- .self$getFieldValue(f)
        if (is.character(v) && ! is.na(v) && v == '')
            .self$removeField(f)
    }

    # Correct InChIKey
    if (.self$hasField('INCHIKEY')) {
        v <- sub('^InChIKey=', '', .self$getFieldValue('INCHIKEY'), perl=TRUE)
        .self$setFieldValue('INCHIKEY', v)
    }

    # Synonyms
    synonyms <- XML::xpathSApply(parsed.content, "//synonym", XML::xmlValue)
    if (length(synonyms) > 0)
        .self$appendFieldValue('name', synonyms)
}

Session information

sessionInfo()
## R version 4.5.0 RC (2025-04-04 r88126)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] biodb_1.16.0     BiocStyle_2.36.0
## 
## loaded via a namespace (and not attached):
##  [1] rappdirs_0.3.3       sass_0.4.10          generics_0.1.3      
##  [4] bitops_1.0-9         stringi_1.8.7        RSQLite_2.3.9       
##  [7] hms_1.1.3            digest_0.6.37        magrittr_2.0.3      
## [10] evaluate_1.0.3       bookdown_0.43        fastmap_1.2.0       
## [13] blob_1.2.4           plyr_1.8.9           jsonlite_2.0.0      
## [16] progress_1.2.3       DBI_1.2.3            BiocManager_1.30.25 
## [19] httr_1.4.7           XML_3.99-0.18        jquerylib_0.1.4     
## [22] cli_3.6.4            rlang_1.1.6          chk_0.10.0          
## [25] crayon_1.5.3         dbplyr_2.5.0         bit64_4.6.0-1       
## [28] withr_3.0.2          cachem_1.1.0         yaml_2.3.10         
## [31] tools_4.5.0          memoise_2.0.1        dplyr_1.1.4         
## [34] filelock_1.0.3       curl_6.2.2           vctrs_0.6.5         
## [37] R6_2.6.1             BiocFileCache_2.16.0 lifecycle_1.0.4     
## [40] stringr_1.5.1        bit_4.6.0            pkgconfig_2.0.3     
## [43] pillar_1.10.2        bslib_0.9.0          glue_1.8.0          
## [46] Rcpp_1.0.14          lgr_0.4.4            xfun_0.52           
## [49] tibble_3.2.1         tidyselect_1.2.1     knitr_1.50          
## [52] htmltools_0.5.8.1    rmarkdown_2.29       compiler_4.5.0      
## [55] prettyunits_1.2.0    askpass_1.2.1        RCurl_1.98-1.17     
## [58] openssl_2.3.2