BioAssay Ontology (BAO): a semantic description of bioassays and high-throughput screening results (original) (raw)

Main concepts of the BioAssay Ontology and curation of PubChem assays

BAO describes biological screening assays, in which the perturbation of a biological system or a component thereof (relative to a reference state) by a perturbagen is detected and in many cases quantified. An example for a simple assay is the inhibition of an enzyme by a small molecule, which would be detectable by quantifying the product of the enzymatic reaction. For example inhibition of a kinase could be detected via an antibody specific to the phosphorylated substrate (a kinase catalyzes the phosphorylation of a substrate by ATP). In one assay design, the antibody is linked with a fluorescence resonance energy transfer (FRET) donor and the (kinase) substrate with a FRET acceptor. A fluorescence signal of the FRET acceptor is only generated if donor and acceptor are in proximity, i.e. if the substrate is phosphorylated. If the kinase is inhibited by a small molecule perturbagen, the signal decreases. An implementation as homogeneous time resolved FRET (HTRF) assay is applicable to high throughput screening. Countless sophisticated biological screening assays to interrogate simple to complex biological systems have been developed.

With BAO we aim to develop an open standard for the description of HTS and microscopy-based high-content screening (HCS) assays and data for the purpose of classification and analysis. To describe biological screening experiments such as those deposited in PubChem, we first identified the main categories that need to be captured in order to meaningfully compare data from different biological screening experiment. These components are perturbagen, format, design, detection technology, meta target, endpoint, which are described here:

Perturbagen

Assay "perturbagen" refers to the agent that directly interacts or indirectly affects the meta target of a bioassay. PubChem assays predominantly have small molecules as perturbagens; however the concept perturbagen in BAO includes various other perturbing agents, including, nucleic acid (e.g. siRNA, cDNA), lipid, or proteins. Perturbagen specifications include perturbagen source and details on its delivery.

Assay Format

The assay "format" is a higher-level assay category that relates to the biological and chemical features that are common to each test condition in the assay. Assay format includes several broad categories. "Biochemical format" describes assays that are performed with a purified protein, such as the example above. "Cell-based format" relates to assays that are performed with living cells. "Organism-based format" refers to assays with a living organism. Other common formats include "cell-free format", "tissue-based format", and "physicochemical format". Additional format specifications are captured that describe, for example, whether the assay is homogeneous or heterogeneous in nature.

Assay Design and Detection Technology

The assay "design" describes the methodology to report the action of the perturbagen on the target; i.e. how the perturbation is converted into a detectable signal. In BAO, assay design is broadly classified into one of eight categories: "binding reporter", "enzyme reporter", "inducible reporter", "morphology reporter", "viability reporter", "redistribution reporter", "conformation reporter", and "membrane potential reporter". We further annotated the readout "detection technology" used in the assays. These annotations fall into one of several categories, including "spectrophotometry", "fluorescence", "luminescence", "label free technology", "scintillation counting", and "microscopy". Further specifications of assay design and detection technology can include the assay kit or detected wavelength.

Assay Meta Target

Assay "meta target" is a description of the component(s) of the biological system that interact with the perturbagen. Meta target can be directly described as a molecular entity (e.g. a purified protein or a protein complex), or indirectly by a biological process or event (e.g. phosphorylation), or a signaling pathway. An important aspect of our meta target annotations is that they are embedded with semantic information (e.g. "is target of" only "measure group"; disjointness with classes such as "perturbagen" or "endpoint"). Meta target may be further linked to additional terms and external content, such as a pathway database. One of the goals of describing meta targets is to infer possible molecular targets or perturbagen mechanisms of action based on the analysis of results of many related assays. Meta target specifications include protein modifications, cell lines, or details about the mechanism of ligand-protein interaction.

Assay Endpoint

An assay "endpoint" describes a quantitative or qualitative outcome of the bioassay. The main classes that we identified are "perturbagen concentration"- and "response"-type endpoints. Simple examples include IC50, EC50, CC50 and percent inhibition, percent activation, percent viability, respectively. We conducted two stages of endpoint formalization, the first of which was to standardize the endpoint names in PubChem by manual curation. This reduces the number of different representations of each endpoint concept. In the examples illustrated below we have reduced 85 unique PubChem endpoint representations to 18 standardized endpoints. However, it is not possible (by manual curation) to uniquely describe each endpoint by exactly one representation, because the endpoint concept depends on other assay concepts and can even vary among different perturbagens of the same assay. In BAO, we therefore defined the endpoint concepts semantically using description logic to specify relationships among the endpoint types and other BAO concepts (see below, Ontology-facilitated query examples, example 3). This enables us to retrieve inferred results, which could otherwise not be obtained or would require complex Boolean endpoint queries. An excerpt of BAO around the class assay endpoint is shown below (Ontology outline, development and implementation).

For the purpose of demonstrating the semantic querying capabilities facilitated by BAO (which are described below) we curated over 300 bioassays from PubChem and standardized the endpoints using BAO.

Ontology outline, development and implementation

BAO was designed to describe biological screening experiments and their outcomes by the six main components outlined above, in addition to general assay attributes that don't fall into any of these categories. Each BAO component includes multiple levels of sub-classes and specification classes which are linked via object property relationships to form a knowledge representation. A full description of this schema will be discussed elsewhere; the current version of the ontology (v1.2b868), is available on our website and at the NCBO bioportal. Our development approach follows established ontology engineering methodologies using a combination of top-down, domain expert-driven and bottom-up, data-driven approaches [20]. The current version of BAO consists of 730 OWL 2.0 [[21](/articles/10.1186/1471-2105-12-257#ref-CR21 "OWL 2.0, World Wide Web Consortium (W3C)[ http://www.w3.org/TR/2009/REC-owl2-overview-20091027/

            ] [Last checked on 6/3/2011]")\] classes, 72 object properties (relations), 7 data properties, and 25 individuals (not including any annotated assays). Several external ontologies contain partial information of some of the components of biological assays described by BAO. To leverage these efforts, we have imported into BAO relevant sections from Gene Ontology (GO) \[[14](/articles/10.1186/1471-2105-12-257#ref-CR14 "Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry M, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hil DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics 2000, 25: 25–29. 10.1038/75556")\], Cell Line Ontology (CLO) \[[22](/articles/10.1186/1471-2105-12-257#ref-CR22 "Sarntivijai S, Ade AS, Athey BD, States DJ: A bioinformatics analysis of the cell line nomenclature. Bioinformatics 2008, 24(23):2760–6. [Sarntivijai, Sirarat Ade, Alexander S Athey, Brian D States, David J R01 LM008106/LM/NLM NIH HHS/United States U54 DA021519/DA/NIDA NIH HHS/United States Research Support, N.I.H., Extramural England Bioinformatics (Oxford, England) Bioinformatics. 2008 Dec 1;24(23):2760–6. Epub 2008 Oct 10.] [Sarntivijai, Sirarat Ade, Alexander S Athey, Brian D States, David J R01 LM008106/LM/NLM NIH HHS/United States U54 DA021519/DA/NIDA NIH HHS/United States Research Support, N.I.H., Extramural England Bioinformatics (Oxford, England) Bioinformatics. 2008 Dec 1;24(23):2760-6. Epub 2008 Oct 10.] 10.1093/bioinformatics/btn502")\], Unit Ontology (UO) \[[23](/articles/10.1186/1471-2105-12-257#ref-CR23 "Unit Ontology[
              http://bioportal.bioontology.org/visualize/45500/
              
            ] [Last checked on 6/3/2011]")\] and others. GO biological process terms and CLO cell line names and additional parameters are used in BAO meta target and meta target specifications. Organism names associated with targets were imported from NCBI taxonomy. Protein target names and IDs were referenced from UniProt. From UO we imported concentration unit and time unit terms. We are currently working on mapping BAO to other OBO ontologies. For example, OBI includes relevant information to describe biological assays \[[24](/articles/10.1186/1471-2105-12-257#ref-CR24 "Brinkman RR, Courtot M, Derom D, Fostel JM, He Y, Lord P, Malone J, Parkinson H, Peters B, Rocca-Serra P, Ruttenberg A, Sansone SA, Soldatova LN, Stoeckert JCJ, Turner JA, Zheng J: Modeling biomedical experimental processes with OBI. J Biomed Semantics 2010, 1(Suppl 1):S7. [Brinkman, Ryan R Courtot, Melanie Derom, Dirk Fostel, Jennifer M He, Yongqun Lord, Phillip Malone, James Parkinson, Helen Peters, Bjoern Rocca Serra, Philippe Ruttenberg, Alan Sansone, Susanna-Assunta Soldatova, Larisa N Stoeckert, Christian J Jr Turner, Jessica A Zheng, Jie OBI consortium England Journal of biomedical semantics J Biomed Semantics. 2010 Jun 22;1 Suppl 1:S7.] [Brinkman, Ryan R Courtot, Melanie Derom, Dirk Fostel, Jennifer M He, Yongqun Lord, Phillip Malone, James Parkinson, Helen Peters, Bjoern Rocca Serra, Philippe Ruttenberg, Alan Sansone, Susanna-Assunta Soldatova, Larisa N Stoeckert, Christian J Jr Turner, Jessica A Zheng, Jie OBI consortium England Journal of biomedical semantics J Biomed Semantics. 2010 Jun 22;1 Suppl 1:S7.]")\]. We have mapped some of the BAO relationships to the OBO Relationship Ontology (RO) \[[25](/articles/10.1186/1471-2105-12-257#ref-CR25 "OBO Relationship Ontology[
              http://www.obofoundry.org/ro/
              
            ] [Last checked on 6/3/2011]")\] and we aim to make more use of RO relationships in the future. Additionally, we may be able to use RO to map BAO concepts to other ontologies, in particular OBI. BAO is "rich" with a DL expressivity of ALCHOIQ(D). This means that the ontology has the basic S (ALC) expressivity \[[26](/articles/10.1186/1471-2105-12-257#ref-CR26 "Schmidt-Schauß M, Smolka G: Attributive concept descriptions with complements. Artificial Intelligence 1991, 48: 1–26. 10.1016/0004-3702(91)90078-X")\] with role hierarchies (H), nominals (O), inverse properties (I), qualified cardinality restrictions (Q), and the use of datatype properties, data values or data types (D). It should be noted that three major bioinformatic terminology bases: SNOMED \[[27](/articles/10.1186/1471-2105-12-257#ref-CR27 "Spackman KA, Campbell KE, Cote RA: SNOMED RT: a reference terminology for health care. Proc AMIA Annu Fall Symp; 1997:640–644.")\], Galen \[[28](/articles/10.1186/1471-2105-12-257#ref-CR28 "Rogers J, Rector A: GALEN's model of parts and wholes: experience and comparisons. Proc AMIA Symp 2000, 714–718.")\], and GO \[[14](/articles/10.1186/1471-2105-12-257#ref-CR14 "Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry M, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hil DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics 2000, 25: 25–29. 10.1038/75556")\] have the expressivity of EL, with additional role properties. In EL, only intersections between concepts and full existential quantification are possible. In comparison, BAO is a significant improvement in expressivity.

Figure 1 illustrates the high-level outline of BAO. It shows the root-level classes, which are described above and general bioassay specifications, and some of their relationships. Some concepts (format, perturbagen and bioassay specifications) are linked directly to bioassay while others (endpoint, meta target, design, detection technology) are linked via a measure group to accommodate multiplexed and multi-parametric assays. It is also important to note that the assay components are not modeled as sub-classes of bioassay, because they do not have a formal "is a" relationship to bioassay. The bioassay component specification classes are not shown. Figure 2 shows an excerpt of the BAO classes (and their subsumption hierarchies) that are related to the concept "endpoint". For example Figure 2 illustrates the different type of endpoints, such as concentration- and response-type and also the relationships to the specification class, which includes (among others) "endpoint mode of action" with various sub-classes. These concepts are relevant for the semantic querying and reasoning capabilities described in the examples below.

Figure 1

figure 1

BAO excerpt showing the root-level classes and some of their relationships.

Full size image

Figure 2

figure 2

A view on some of BAO's concepts, defined as either primitive (light gray/yellow) or defined classes (dark gray/orange).

Full size image

The complete specification in OWL 2.0 can be visually explored and downloaded from our web page http://www.bioassayontology.org/visualize/. To illustrate how each of these classes is embedded with semantic information, the following example depicts a detailed specification for the class "IC50", defined as the concentration of the perturbagen that results in 50% inhibition.

Equivalent classes

ic50 ≡ (∃"has has mode of action".inhibition) ⊓

(∀"has mode of action".inhibition) ⊓

("has percent response". "50 percent inhibition individual")

Superclasses

ic50 ⊑ (∀"has curvefit spec". "curvefit spec")

ic50 ≡ "perturbagen concentration"

Inherited anonymous classes

ic50 ⊑ (∃"has perturbagen concentration unit". "concentration unit") ⊓

(∀"has perturbagen concentration unit"."concentration unit") ⊓

(= 1"has perturbagen concentration value".xsd: float) ⊓

(∀"has specification"."endpoint spec") ⊓

(∃"has perturbagen".perturbagen) ⊓

(= 1"has perturbagen".T)

Symbols

≡: equivalentClass, ⊑: subClassOf, ∀: allValuesFrom, ∃: someValuesFrom, = N: exactly N, T: Thing.

It is important to note that in OWL 2.0, there are only definitions for equivalent classes (necessary & sufficient conditions), and superclasses (necessary conditions). Necessary and sufficient conditions are used to classify individuals; for example we might be able to infer that an individual endpoint must be an IC50 because the mode of action is inhibition (among other criteria). With only necessary conditions, the definition is logically different, saying that if an individual is a member of the class IC50, it is necessarily a sub-class of "perturbagen concentration". The equivalent class IC50 specifies "has mode of action" only "inhibition". "Only" here denotes universal quantification, describing all the individuals whose "has mode of action" relationships refer to members of the class inhibition; or conversely, the individuals that do not have "has mode of action" relationships to individuals that are not members of the class "inhibition". There are also existential restrictions that can be seen as "among other things", and are used to close a given property, which is necessary for the reasoning process. The keyword "some" denotes existential restrictions. An example in our ontology is "has mode of action" some "inhibition". This specifies the existence of at least one relationship along a given property to an individual, which is a member of the class IC50.

Certain specifications are inherited from classes that are higher up in the class hierarchy. An example of this is the inherited anonymous class definition of individuals having the object property "has perturbagen concentration value". There is also the relationship "has perturbagen", describing that every individual of the IC50 class must have at least one perturbagen.

Ontology implementation and application

The work flow for applying the ontology to real data from PubChem is illustrated in Figure 3. First, we have summarized a set of attributes about the assays that needed to be annotated. We have considered >120 attributes (e.g., "EndpointStandardized", which takes values of IC50, percent inhibition, fold activation, etc.). These attributes are populated row-by-row in a spreadsheet for the relevant assays using a local mirror of the PubChem data source. A major portion of the spreadsheet is curated manually. In order to compensate for the errors that may have been introduced during the manual work, we have written a software module to cross-reference each entry in the spreadsheet with the PubChem data source. There were some redundant information among the annotation spreadsheet and data in PubChem, for example screening concentration reported in the assay description (which was manually curated) and the screening concentration deposited to PubChem (which was available in the mirror data source). Some information in the annotation template was explicitly repeated from PubChem in order to correctly map annotated (standardized) terms to data in PubChem, for example to standardize endpoints. This redundancy can be seen as a quality control step to uncover any discrepancies between original and curated data. This step has revealed some inconsistencies in the PubChem database, such as PubChem table entries that are not atomic, incorrect or missing screening concentrations or units; and it has also helped to minimize the errors that had been made throughout the cumbersome curation process. Second, we have developed a core software module, described as Loader/Bootstrap in Figure 3, which reads the curated and quality-checked data and then uses the ontology as well as necessary PubChem data to create a logical model of the domain. The reasoning engine Pellet was used, both to create and query the domain model. Pellet is a server-based OWL-DL reasoner that supports SROIQ(D). We also experimented with other DL reasoners, such as HermiT and FaCT++, but used Pellet because of its existing API (Application Programming Interface) that allows interfacing to other software components that we use.

Figure 3

figure 3

BAO software modules (orange/dark gray), documents and databases (light green/light gray).

Full size image

Of particular note here is the BAO expressivity of SROIQ(D). S allows atomic and complex concept negation, concept intersection, universal restrictions, limited existential quantification and transitive roles. R stands for limited complex role inclusion axioms; reflexivity, irreflexivity and role disjointness. O stands for nominals, I for inverse properties, Q for qualified cardinality restrictions and (D) for the use of datatype properties or data values. The reasoner checks the internal consistency of the logical model and inferred hidden knowledge. One example for this is the class AC50, which was inferred to be a superclass of IC50 (see Figure 4b). The ontology specification defines AC50 as the concentration of the perturbagen that results in either 50% activation (EC50) or inhibition (IC50). Figures 4a and 4b show the asserted and inferred models of AC50, respectively.

Figure 4

figure 4

a) Asserted logical taxonomy for AC50 (above) and b) Inferred logical taxonomy, where IC50 is classified as a sub-class of AC50.

Full size image

Ontology-facilitated query examples

We performed a series of experiments based on 194 out of the 300 curated PubChem bioassays that had the (standardized) endpoint terms IC50, EC50, AC50, percent activation, percent stimulation and percent inhibition. Since the entire set of assays and endpoints would have required > 17 GB worth of RDF triples, we decided to limit the amount of considered endpoints to 20 for performance reasons. Future versions of the software will focus on optimization and the use of additional annotations. With 20 endpoints, the software generated 45,075 triples (asserted ontology + triple database) in the Jena store. All example queries can be found and tested online at http://baoquery.ccs.miami.edu/joseki/query.html. The reasoner classifies the individuals and SPARQL allows an efficient search through this inferred graph.

Example 1: This example illustrates a common query for compounds with an IC50 value of less than a certain cutoff (here ≤ 10 μ M). Such a query should also return results of differently named IC50 endpoints (e.g. AC50), which a user may not know exist. A user querying the database may also be interested in returning other relevant endpoints, such as IC80 values ≤10 μ M (if they existed in the repository) or other result types such as potent inhibitors screened at less than the IC50 concentration. With the semantic definition of IC50 above, we can achieve both. Query: return all compounds from assays with an inhibitory mode of action and that have a percentage response of 50% or greater at ≤10 μ M screening concentration.

The SPARQL query was the following:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

PREFIX owl: <http://www.w3.org/2002/07/owl#>

PREFIX bao: <http://www.bioassayontology.org/bao#>

# results

SELECT DISTINCT ?compound ?endpoint ?type ?responseValue ?screeningConc ?assay

WHERE {

# from endpoints

?endpoint rdf:type bao:BAO _0000179.

?endpoint bao:BAO_ 0000196 ?inhibition.

# has a mode of action inhibition

?inhibition rdf:type bao:BAO _0000091.

# perturbagen concentration endpoint

?endpoint bao:BAO _0000336 ?screeningConc.

# has concentration unit micro molar

?endpoint bao:BAO_0000183 bao:BAO_0000107.

# has percent response

?endpoint bao:BAO _0000337 ?percentResponse.

?percentResponse bao:BAO_ 0000195 ?responseValue.

?endpoint rdf:type ?type.

?type rdfs:subClassOf bao:BAO_ 0000180.

# response endpoint

UNION {

?endpoint bao:BAO_ 0000196 ?inhibition.

?inhibition rdf:type bao:BAO_ 0000091.

?endpoint bao:BAO_ 0000338 ?pert.

?pert bao:BAO_0000183 bao:BAO_ 0000107.

?pert bao:BAO _0000336 ?screeningConc.

?pert bao:BAO_0000183 bao:BAO_ 0000107.

?endpoint bao:BAO_0000195 ?responseValue.

?endpoint rdf:type ?type.

?type rdfs:subClassOf bao:BAO_ 0000181.

}

?endpoint bao:BAO_0000185 ?compound.

?endpoint rdf:type ?type.

?assay bao:BAO _0000209 ?measureGroup.

?measureGroup bao:BAO_ 0000208 ?endpoint.

# screening concentration <= 10 micro molar && # percent

# response >= 50%

FILTER(?screeningConc <= 10 && ?responseValue >= 50)

}

The BAO software returns 2,741 SPARQL endpoint results from the inferred model residing in the triple store, 4 of which are shown below for illustrative purposes. All results are individuals with a working internal resource identifier (IRI), which corresponds to a URI, but is valid only internally. IRIs are abbreviated due to space limitations, but all complete IRIs are available via http://baoquery.ccs.miami.edu/joseki/query.html

(5) (?compound=<bao#individual_BAO_0000021_2858522>)

(?endpoint=<bao#individual_BAO_0000190_2_2357>)

(?type=<bao#BAO_ 0000190>)

(?responseValue="50"xsd:float) ^^

(?screeningConc="4.0"xsd:float) ^^

(?assay=<bao#individual_BAO_0000015_1293>)

(17) (?compound=<bao#individual_BAO_0000021_133407>)

(?endpoint=<bao#individual_BAO_0000190_2_2533>)

(?type=<bao#BAO_ 0000190>)

(?responseValue="50"xsd:float) ^^

(?screeningConc="8.59"xsd:float) ^^

(?assay=<bao#individual_BAO_0000015_2409>)

(24) (?compound=<bao#individual_BAO_0000021_11057>)

(?endpoint=<bao#individual_BAO_0000190_2_4122>)

(?type=<bao#BAO_ 0000186>)

(?responseValue="50"xsd:float) ^^

(?screeningConc="6.3096"xsd:float) ^^

(?assay=<bao#individual_BAO_0000015_948>)

(2690) (?compound=<bao#individual_BAO_0000021_657680>)

(?endpoint=<bao#individual_BAO_0000201_1_1670>)

(?type=<bao#BAO_ 0000201>)

(?responseValue="63.48"xsd:float) ^^

(?screeningConc="4.0"xsd:float) ^^

(?assay=<bao#individual_BAO_0000015_834>)

Results are shown by their unique IRIs, e.g. the first result contains the compound ID (CID) 2858522 [[29](/articles/10.1186/1471-2105-12-257#ref-CR29 "NCBI Compound ID 2858522[ http://www.ncbi.nlm.nih.gov/sites/entrez?db=pccompound&term=2858522

            ] [Last checked on 6/3/2011]")\] of an individual of the class perturbagen (BAO\_0000021). The SPARQL query also selects for the endpoints of the perturbagens that fulfill the activity criteria. The query retrieves results that classify as specific types of endpoints (subsumption reasoning). Result (5) (CID 2858522, AID 1293) was found because IC50 (note, that in PubChem AID 1293 this endpoint has been incorrectly reported as EC50; we corrected this during the curation process) (BAO\_0000190) _is_ \_ _a_ perturbagen concentration-type endpoint (as defined above). Result (17) (CID 133407, AID 2409) also returns IC50\. Result (18) (not shown) returns the same data as AC50 concordant with the (inferred) subsumption hierarchy (compare Figure [4b](/articles/10.1186/1471-2105-12-257#Fig4)). Querying AC50 (instead of IC50) thus would also retrieve this result. Result (24) (CID 11057, AID 948) is an AC50 endpoint (named "potency" in PubChem); result (23) returns the same data as IC50 (not shown) - again consistent with the inferred class hierarchy. Result (2690) (CID 657680, AID 1834) is a percentage inhibition endpoint (63.5%) and the screening concentration is 4 _μ_ M (i.e. less than the query 10 _μ_ M). These different types of results can be retrieved because of the subsumption reasoning of the DL engine using formally defined endpoints. This example illustrates that with the endpoint definition in BAO, we can identify and return relevant query results, which are not restricted to a specific endpoint type or endpoint representation (that is specified by the query), as it would typically be the case in a relational system.

Example 2: Here, we illustrate an example of constructive reasoning in identifying compounds of a particular pharmacological action. Query: return all assays with compounds that have a mode of action "activation" and show a percentage response of ≥ 50% at ≤ 10 μ M screening concentration. The query syntax was the following (we are omitting the PREFIX section this time):

SELECT DISTINCT ?compound ?endpoint ?type ?moaType ?responseValue ?screeningConc ?assay

WHERE {

# from endpoints

?endpoint rdf:type bao:BAO_0000179.

?endpoint bao:BAO_ 0000196 ?activation.

# has a mode of action activation

?activation rdf:type bao:BAO_ 0000087.

?activation rdf:type ?moaType.

?moaType rdfs:subClassOf bao:BAO_ 0000084.

# perturbagen concentration endpoint

?endpoint bao:BAO_ 0000336 ?screeningConc.

# has concentration unit micro molar

?endpoint bao:BAO_0000183 bao:BAO_0000107.

# has percent response

?endpoint bao:BAO_0000337 ?percentResponse.

?percentResponse bao:BAO_0000195 ?responseValue.

?endpoint rdf:type ?type.

?type rdfs:subClassOf bao:BAO_0000180.

# response endpoint

UNION {

?endpoint bao:BAO_ 0000196 ?activation.

?activation rdf:type bao:BAO_ 0000087.

?activation rdf:type ?moaType.

?moaType rdfs:subClassOf bao:BAO_ 0000084.

?endpoint bao:BAO_0000338 ?pert.

?pert bao:BAO_0000183 bao:BAO _0000107.

?pert bao:BAO _0000336 ?screeningConc.

?pert bao:BAO_0000183 bao:BAO_ 0000107.

?endpoint bao:BAO _0000195 ?responseValue.

?endpoint rdf:type ?type.

?type rdfs:subClassOf bao:BAO_ 0000181.

}

?endpoint bao:BAO_0000185 ?compound.

?endpoint rdf:type ?type.

?assay bao:BAO _0000209 ?measureGroup.

?measureGroup bao:BAO_ 0000208 ?endpoint.

# screening concentration <= 10 micro molar && # percent

# response >= 50%

FILTER(?screeningConc <= 10 && ?responseValue >= 50)

}

Similar to example 1, the system returns different types of relevant results. In addition to assays with compounds that have an endpoint "percent activation" of 50% at <10 μ M, this query also returns assays with an EC50 or an AC50 value of <10 μ M. Moreover, this example demonstrates one of the constructive reasoning mechanisms in BAO where "activation" was defined as equivalent to "stimulation" (among other equivalent classes, e.g. agonist). As the reasoning system returns results that satisfy the original query and the inferred query, searching "activation" (BAO_0000087) returns exactly the same results as querying for "stimulation" (BAO_0000093) independent from the specific term used to describe the pharmacological action. Selected results are:

(1) (?compound=<bao#individual_BAO_0000021_653469>)

(?endpoint=<bao#individual_BAO_0000188_2_5524>)

(?type=<bao#BAO_ 0000180>)

(?moaType=<bao#BAO_ 0000087>)

(?responseValue="50"xsd:float) ^^

(?screeningConc="2.154"xsd:float) ^^

(?assay=<bao#individual_BAO_0000015_695>)

(5) (?compound=<bao#individual_BAO_0000021_653469>)

(?endpoint=<bao#individual_BAO_0000188_2_5524>)

(?type=<bao#BAO _0000188>)

(?moaType=<bao#BAO_ 0000093>)

(?responseValue="50"xsd:float) ^^

(?screeningConc="2.154"xsd:float) ^^

(?assay=<bao#individual_BAO_0000015_695>)

(5130) (?compound=<bao#individual_BAO_0000021_645132>)

(?endpoint=<bao#individual_BAO_0000200 1_464>)

(?type=<bao#BAO_ 0000181>)

(?moaType=<bao#BAO_ 0000087>)

(?responseValue="132.52"xsd:float) ^^

(?screeningConc="5.7"xsd:float) ^^

(?assay=<bao#individual_BAO_0000015_1318>)

(5131) (?Compound=<bao#individual_BAO_00000021_645132>)

(?endpoint=<bao#individual_BAO_0000200_1_464>)

(?type=<bao#BAO_ 0000200>)

(?moaType=<bao#BAO_ 0000093>)

(?responseValue="132.52"xsd:float) ^^

(?screeningConc="5.7"xsd:float) ^^

(?assay=<bao#individual_BAO_0000015_1318>)

The first result (1) refers to AID 695 [[30](/articles/10.1186/1471-2105-12-257#ref-CR30 "NCBI Assay ID 695[ http://www.ncbi.nlm.nih.gov/sites/entrez?db=pcassay&term=695

            ] [Last checked on 6/3/2011]")\]. As before, the formal definition of "mode of action" in the ontology and the reasoning system make it possible to retrieve relevant results by inference, which could not be returned from a relational database system (e.g. agonist if one searched for activation).

Example 3: With this example, we demonstrate a specific case concerning three concepts: endpoint, bioassay, and perturbagen. Figure 5 shows the relevant relationships between these concepts (note: the concept "measure group" exists to accommodate multiplexed assays; it is not used in this example); it is a more detailed representation of some of the concepts in Figure 1. Of particular interest is the relation "has perturbagen" that holds between endpoint and perturbagen as well as bioassay and perturbagen. The ontology specifies that this property has an inverse relationship with "is perturbagen of". Here we show how these relationships (with their characteristics) are used to retrieve eligible instances (individuals) by inference. This reasoning mechanism thus makes it possible to retrieve perturbagens based on more complex concepts, for example a class of promiscuous compounds (compounds that are active in several assays - see below).

Figure 5

figure 5

Relationships between BioAssay, EndPoint, and Perturbagen.

Full size image

To illustrate this, we queried for all perturbagens that have a percentage response of ≥50% in at least three assays. The SPARQL query was as follows:

SELECT ?pert

WHERE

{ ?pert rdf:type bao:BAO_0000021.

?pert bao:BAO_0000361 ?assay.

?assay bao:BAO_0000209 ?measureGroup.

?measureGroup bao:BAO_ 0000208 ?endpoint.

?endpoint bao:BAO_0000195 ?percentResponseValue.

} UNION

{ ?pert rdf:type bao:BAO_0000021.

?pert bao:BAO_0000361_?assay.

?assay bao:BAO _0000209 ?measureGroup.

?measureGroup bao:BAO_0000208 ?endpoint.

?endpoint bao:BAO_0000337 ?percentResponse.

?percentResponse bao:BAO_0000195 ?percentResponseValue.

}

FILTER (?percentResponseValue >= 50)

}

GROUP BY ?pert

HAVING (count(distinct ?assay) >= 3)

In this query, we used the inferred relation "is perturbagen of", which points to either an endpoint or a bioassay. The query separately checked for bioassay instances and endpoint instances. The syntax allowed for the expression of the notion of "at least" in a simple way. Specifically, we used the syntactic extensions available in the ARQ SPARQL [[31](/articles/10.1186/1471-2105-12-257#ref-CR31 "ARQ SPARQL[ http://jena.sourceforge.net/ARQ/group-by.html

            ] [Last checked on 6/21/2011]")\] implementation. The "GROUP BY" extended clause grouped the unique "?pert" result set (?pert is a variable here) in a row-by-row basis. The "HAVING" clause applied the lter "count(distinct ?assay))" to the result set after grouping. The results of the query were as follows. First, we queried for the compound and obtained:

(1) (?pert=<bao#individual_BAO_0000021_646704>)

We then used this result (bao:individual_BAO_0000021_646704) for the next query:

SELECT ?assay ?percentResponseValue

WHERE

{ bao:individual_BAO_0000021_646704 bao:BAO_0000361 ?assay.

?assay bao:BAO_0000209 ?mg.

?mg bao:BAO_0000208 ?endpoint.

bao:individual_BAO_0000021_646704 bao:BAO_0000361 ?endpoint.

?endpoint bao:BAO_0000195 ?percentResponseValue.

} UNION

{ bao:individual_BAO_0000021_646704bao:BAO_0000361 ?assay

?assay bao:BAO_0000209 ?mg.

bao:individual_BAO_0000021_646704 bao:BAO_0000361 ?endpoint.

?endpoint bao:BAO_0000337 ?percentResponse.

?percentResponse bao:BAO_000195 ?percentResponseValue.

}

FILTER (?rv >= 50)

}

Here are the final results:

(1) (?assay=<bao#individual_BAO_0000015_1262>)

(?percentResponseValue="116.84"xsd:float) ^^

(2) (?assay=<bao#individual_BAO_0000015_1306>)

(?percentResponseValue="106.48"xsd:float) ^^

(3) (?assay=<bao#individual_BAO_0000015_1316>)

(?percentResponseValue="99.42"xsd:float) ^^

Example 3 is a simple illustration to identify compounds with a specific profile (here, active in three assays). The query actually retrieved inferred information, facilitated by the inverse relationship "is perturbagen of". Further specification of this query, e.g. by BAO meta target or design sub-classes, would allow to quickly identify individuals based on more complex concepts, for example compounds that are promiscuously active in assays of a specific design and which are therefore likely artifacts.

The three query examples illustrate some of the features that can be used in complex search queries with an underlying DL-based ontology. Other features such as role hierarchies, quantifiers, nominals etc. were also used in our ontology.