William Hsiao - Academia.edu (original) (raw)

Papers by William Hsiao

There is an increasing awareness within private and public organizations that ontologies (globall... more There is an increasing awareness within private and public organizations that ontologies (globally accessible and uniquely identified terms that have both natural language definitions and logic relations which can be queried and reasoned over by computers) are useful in solving interoperability quagmires between data silos and the add-hoc data dictionaries that describe them. However, the complexity of implementing evolving ontologies in content management and federated data querying applications is formidable. The Genomic Epidemiology Entity Mart (GEEM) web platform is a proof-of-concept web portal designed to provide non-ontologist users with an ontology-driven interface for examining data standards related to genomic sequence repository records. GEEM provides web forms that show labels and allowed-values for easy review. It also provides software developers with downloadable specifications in JSON and other data formats that can be used without the need for ontology expertise. New systems can adopt ontology-driven standards specifications from the start, and the same specifications can be used to facilitate and validate the conversion of legacy data.

JOWO, 2021

The nomenclature of seafood species and their products is one of the very important areas which n... more The nomenclature of seafood species and their products is one of the very important areas which needs to be curated and regularly updated in FoodOn. In this paper, we present a semiautomated ROBOT template-driven approach we designed for aligning FoodOn with the FDA issued 'Seafood List', together with other established resources e.g., NCBI, ITIS, and Wikipedia. The basic data in the FDA Seafood List, which included Type, Common Name, FDA Law, FDA Acceptable Market Name(s) and Scientific Name was exported in an Excel format. FDA Seafood Labels (Scientific Name) were mapped against NCBITaxons and NCBI GenBank Names using ETE 3 toolkit and around 90% of labels were correctly matched. ITIS TSNs were available for over 85% of seafood labels which were fetched using a locally installed ITIS database. Wikipedia-URL was retrieved as a cross-referenced database using the FoodOn-Wikipedia tool. In some cases, Wikidata was also used as an interface to connect to NCBITaxon. The curated seafood data was then converted from a tab-delimited TSV template file to a Web Ontology Language OWL file format using ROBOT template. This method will not only help FoodOn to regularly update seafood organisms but will also help to maximize the seafood product coverage and data interoperability.

As public health laboratories expand their genomic sequencing and bioinformatics capacity for the... more As public health laboratories expand their genomic sequencing and bioinformatics capacity for the surveillance of different pathogens, labs must carry out robust validation, training, and optimization of wet-and dry-lab procedures. Achieving these goals for algorithms, pipelines and instruments often requires that lower-quality datasets be made available for analysis and comparison alongside those of higher-quality. This range of data quality in reference sets can complicate the sharing of sub-optimal datasets that are vital for the community and for the reproducibility of assays. Sharing of useful, but sub-optimal datasets requires careful annotation and documentation of known issues to enable appropriate interpretation, avoid being mistaken for better quality information, and for these data (and their derivatives) to be easily identifiable in repositories. Unfortunately, there are currently no standardized attributes or mechanisms for tagging poor-quality datasets, or datasets generated for a specific purpose, to maximize their utility, searchability, accessibility and reuse. The Public Health Alliance for Genomic Epidemiology (PHA4GE) is an international community of scientists from public health, industry and academia focused on improving the reproducibility, interoperability, portability and openness of public health bioinformatic software, skills, tools and data. To address the challenges of sharing lower quality datasets, PHA4GE has developed a set of standardized contextual data tags, namely fields and terms, that can be included in public repository submissions as a means of flagging pathogen sequence data with known quality issues, Disclaimer/Publisher's Note: The statements, opinions, and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions, or products referred to in the content.

There are numerous past and current examples of ontology-driven projects that provide auto-genera... more There are numerous past and current examples of ontology-driven projects that provide auto-generated user interfaces for managing entities and relations, each presenting its own varied and complex data model. Our Datum Proof Sheet application aims to simplify the application development landscape by building community consensus about the way basic categorical, textual and numeric datum fields should be described within the OBOFoundry community of ontologies. The proof sheet shows selected datums (grouped under the context of an OBI "data representational model" item) as form inputs on an HTML page, enabling an application ontology's contents to be presented to end users (ranging in our case from epidemiologists to software developers) for review without necessarily having a working application to showcase them in. The basic relations and cases necessary for presenting datums in a user interface are mostly satisfied by OBI's design, but we introduce a few extra elements to bring more clarity to datum specifications, and to provide user interface term labels and definitions that may differ from those that ontologists prefer in the "backend".

The Public Health Alliance for Genomic Epidemiology (PHA4GE) (https://pha4ge.org) is a global coa... more The Public Health Alliance for Genomic Epidemiology (PHA4GE) (https://pha4ge.org) is a global coalition that is actively working to establish consensus standards, document and share best practices, improve the availability of critical bioinformatic tools and resources, and advocate for greater openness, interoperability, accessibility and reproducibility in public health microbial bioinformatics. In the face of the current pandemic, PHA4GE has identified a clear and present need for a fit-for-purpose, open source SARS-CoV-2 contextual data standard. As such, we have developed an extension to the INSDC pathogen package, providing a SARS-CoV-2 contextual data specification based on harmonisable, publicly available, community standards. The specification is implementable via a collection template, as well as an array of protocols and tools to support the harmonisation and submission of sequence data and contextual information to public repositories. Well-structured, rich contextual data adds value, promotes reuse, and enables aggregation and integration of disparate data sets. Adoption of the proposed standard and practices will better enable interoperability between datasets and systems, improve the consistency and utility of generated data, and ultimately facilitate novel insights and discoveries in SARS-CoV-2 and COVID-19.

Bioinformatics, Dec 12, 2015

Motivation: There are various reasons for rerunning bioinformatics tools and pipelines on sequenc... more Motivation: There are various reasons for rerunning bioinformatics tools and pipelines on sequencing data, including reproducing a past result, validation of a new tool or workflow using a known dataset, or tracking the impact of database changes. For identical results to be achieved, regularly updated reference sequence databases must be versioned and archived. Database administrators have tried to fill the requirements by supplying users with one-off versions of databases, but these are time consuming to set up and are inconsistent across resources. Disk storage and data backup performance has also discouraged maintaining multiple versions of databases since databases such as NCBI nr can consume 50 Gb or more disk space per version, with growth rates that parallel Moore's law. Results: Our end-to-end solution combines our own Kipper software package-a simple key-value large file versioning system-with BioMAJ (software for downloading sequence databases), and Galaxy (a web-based bioinformatics data processing platform). Available versions of databases can be recalled and used by command-line and Galaxy users. The Kipper data store format makes publishing curated FASTA databases convenient since in most cases it can store a range of versions into a file marginally larger than the size of the latest version. Availability and implementation: Kipper v1.0.

Microbial genomics, Jun 8, 2017

The recent widespread application of whole-genome sequencing (WGS) for microbial disease investig... more The recent widespread application of whole-genome sequencing (WGS) for microbial disease investigations has spurred the development of new bioinformatics tools, including a notable proliferation of phylogenomics pipelines designed for infectious disease surveillance and outbreak investigation. Transitioning the use of WGS data out of the research laboratory and into the front lines of surveillance and outbreak response requires user-friendly, reproducible and scalable pipelines that have been well validated. Single Nucleotide Variant Phylogenomics (SNVPhyl) is a bioinformatics pipeline for identifying highquality single-nucleotide variants (SNVs) and constructing a whole-genome phylogeny from a collection of WGS reads and a reference genome. Individual pipeline components are integrated into the Galaxy bioinformatics framework, enabling data analysis in a user-friendly, reproducible and scalable environment. We show that SNVPhyl can detect SNVs with high sensitivity and specificity, and identify and remove regions of high SNV density (indicative of recombination). SNVPhyl is able to correctly distinguish outbreak from non-outbreak isolates across a range of variant-calling settings, sequencing-coverage thresholds or in the presence of contamination. SNVPhyl is available as a Galaxy workflow, Docker and virtual machine images, and a Unix-based command-line application.

Several resources and standards for indexing food descriptors currently exist, but their content ... more Several resources and standards for indexing food descriptors currently exist, but their content and interrelations are not semantically and logically coherent. Simultaneously, the need to represent knowledge about food is central to many fields including biomedicine and sustainable development. FoodON is a new ontology built to interoperate with the OBO Library and to represent entities which bear a "food role". It encompasses materials in natural ecosystems and food webs as well as humancentric categorization and handling of food. The latter will be the initial focus of the ontology, and we aim to develop semantics for food safety, food security, the agricultural and animal husbandry practices linked to food production, culinary, nutritional and chemical ingredients and processes. The scope of FoodON is ambitious and will require input from multiple domains. FoodON will import or map to material in existing ontologies and standards and will create content to cover gaps in the representation of food-related products and processes. As a robust food ontology can only be created by consensus and wide adoption, we are currently forming an international consortium to build partnerships, solicit domain expertise, and gather use cases to guide the ontology's development. The products of this work are being applied to research and clinical datasets such as those associated with the Canadian Healthy Infant Longitudinal Development (CHILD) study which examines the causal factors of asthma and allergy development in children, and the Integrated Rapid Infectious Disease Analysis (IRIDA) platform for genomic epidemiology and foodborne outbreak investigation.

Population medicine, Apr 26, 2023

Population Medicine considers the following types of articles: • Research Papers-reports of data ... more Population Medicine considers the following types of articles: • Research Papers-reports of data from original research or secondary dataset analyses. • Review Papers-comprehensive, authoritative, reviews within the journal's scope. These include both systematic reviews and narrative reviews. • Short Reports-brief reports of data from original research. • Policy Case Studies-brief articles on policy development at a regional or national level. • Study Protocols-articles describing a research protocol of a study. • Methodology Papers-papers that present different methodological approaches that can be used to investigate problems in a relevant scientific field and to encourage innovation. • Methodology Papers-papers that present different methodological approaches that can be used to investigate problems in a relevant scientific field and to encourage innovation. • Letters to the Editor-a response to authors of an original publication, or a very small article that may be relevant to readers. • Editorials-articles written by the Editorial Board or by invited experts on a specific topic. Research Papers Articles reporting research may be full length or brief reports. These should report original research findings within the journal's scope. Papers should generally be a maximum of 4000 words in length, excluding tables, references, and abstract and key points of the article, whilst it is recommended that the number of references should not exceed 36.

GigaScience, 2022

Background: The Public Health Alliance for Genomic Epidemiology (PHA4GE) (https://pha4ge.org) is ... more Background: The Public Health Alliance for Genomic Epidemiology (PHA4GE) (https://pha4ge.org) is a global coalition that is actively working to establish consensus standards, document and share best practices, improve the availability of critical bioinformatics tools and resources, and advocate for greater openness, interoperability, accessibility, and reproducibility in public health microbial bioinformatics. In the face of the current pandemic, PHA4GE has identified a need for a fit-for-purpose, open-source SARS-CoV-2 contextual data standard. Results: As such, we have developed a SARS-CoV-2 contextual data specification package based on harmonizable, publicly available community standards. The specification can be implemented via a collection template, as well as an array of protocols and tools to support both the harmonization and submission of sequence data and contextual information to public biorepositories.

PLOS ONE

An application ontology often reuses terms from other related, compatible ontologies. The extent ... more An application ontology often reuses terms from other related, compatible ontologies. The extent of this interconnectedness is not readily apparent when browsing through larger textual presentations of term class hierarchies, be it Manchester text format OWL files or within an ontology editor like Protege. Users must either note ontology sources in term identifiers, or look at ontology import file term origins. Diagrammatically, this same information may be easier to perceive in 2 dimensional network or hierarchical graphs that visually code ontology term origins. However, humans, having stereoscopic vision and navigational acuity around colored and textured shapes, should benefit even more from a coherent 3-dimensional interactive visualization of ontology that takes advantage of perspective to offer both foreground focus on content and a stable background context. We present OntoTrek, a 3D ontology visualizer that enables ontology stakeholders—students, software developers, curati...

BMJ Open, Feb 1, 2023

Objectives COVID-19 research has significantly contributed to pandemic response and the enhanceme... more Objectives COVID-19 research has significantly contributed to pandemic response and the enhancement of public health capacity. COVID-19 data collected by provincial/territorial health authorities in Canada are valuable for research advancement yet not readily available to the public, including researchers. To inform developments in public health data-sharing in Canada, we explored Canadians' opinions of public health authorities sharing deidentified individual-level COVID-19 data publicly. Design/setting/interventions/outcomes A national cross-sectional survey was administered in Canada in March 2022, assessing Canadians' opinions on publicly sharing COVID-19 datatypes. Market research firm Léger was employed for recruitment and data collection. Participants Anyone greater than or equal to 18 years and currently living in Canada. Results 4981 participants completed the survey with a 92.3% response rate. 79.7% were supportive of provincial/ territorial authorities publicly sharing deidentified COVID-19 data, while 20.3% were hesitant/averse/ unsure. Datatypes most supported for being shared publicly were symptoms (83.0% in support), geographical region (82.6%) and COVID-19 vaccination status (81.7%). Datatypes with the most aversion were employment sector (27.4% averse), postal area (26.7%) and international travel history (19.7%). Generally supportive Canadians were characterised as being ≥50 years, with higher education, and being vaccinated against COVID-19 at least once. Vaccination status was the most influential predictor of data-sharing opinion, with respondents who were ever vaccinated being 4.20 times more likely (95% CI 3.21 to 5.48, p=0.000) to be generally supportive of data-sharing than those unvaccinated. Conclusions These findings suggest that the Canadian public is generally favourable to deidentified datasharing. Identifying factors that are likely to improve attitudes towards data-sharing are useful to stakeholders involved in data-sharing initiatives, such as public health agencies, in informing the development of public health communication and data-sharing policies. As Canada progresses through the COVID-19 pandemic, and with limited testing and reporting of COVID-19 data, it is essential to improve deidentified data-sharing given the public's general support for these efforts. ⇒ Participants were recruited exclusively through online platforms, which may under-represent groups of people who don't engage with online surveys.

Semantic Web

People often value the sensual, celebratory, and health aspects of food, but behind this experien... more People often value the sensual, celebratory, and health aspects of food, but behind this experience exists many other value-laden agricultural production, distribution, manufacturing, and physiological processes that support or undermine a healthy population and a sustainable future. The complexity of such processes is evident in both every-day food preparation of recipes and in industrial food manufacturing, packaging and storage, each of which depends critically on human or machine agents, chemical or organismal ingredient references, and the explicit instructions and implicit procedures held in formulations or recipes. An integrated ontology landscape does not yet exist to cover all the entities at work in this farm to fork journey. It seems necessary to construct such a vision by reusing expert-curated fit-to-purpose ontology subdomains and their relationship, material, and more abstract organization and role entities. The challenge is to make this merger be, by analogy, one lan...

Nucleic Acids Research

The Comprehensive Antibiotic Resistance Database (CARD; card.mcmaster.ca) combines the Antibiotic... more The Comprehensive Antibiotic Resistance Database (CARD; card.mcmaster.ca) combines the Antibiotic Resistance Ontology (ARO) with curated AMR gene (ARG) sequences and resistance-conferring mutations to provide an informatics framework for annotation and interpretation of resistomes. As of version 3.2.4, CARD encompasses 6627 ontology terms, 5010 reference sequences, 1933 mutations, 3004 publications, and 5057 AMR detection models that can be used by the accompanying Resistance Gene Identifier (RGI) software to annotate genomic or metagenomic sequences. Focused curation enhancements since 2020 include expanded β-lactamase curation, incorporation of likelihood-based AMR mutations for Mycobacterium tuberculosis, addition of disinfectants and antiseptics plus their associated ARGs, and systematic curation of resistance-modifying agents. This expanded curation includes 180 new AMR gene families, 15 new drug classes, 1 new resistance mechanism, and two new ontological relationships: evolut...

Pathogen genomics is a critical tool for public health surveillance, infection control, outbreak ... more Pathogen genomics is a critical tool for public health surveillance, infection control, outbreak investigations, as well as research. In order to make use of pathogen genomics data, it must be interpreted using contextual data (metadata). Contextual data includes sample metadata, laboratory methods, patient demographics, clinical outcomes, and epidemiological information. However, the variability in how contextual information is captured by different authorities and how it is encoded in different databases poses challenges for data interpretation, integration, and its use/re-use. The DataHarmonizer is a template-driven spreadsheet application for harmonizing, validating, and transforming genomics contextual data into submission-ready formats for public or private repositories. The tool’s web browser-based JavaScript environment enables validation and its offline functionality and local installation increases data security. The DataHarmonizer was developed to address the data sharing...

Frontiers in Genetics

COVID-19 was declared to be a pandemic in March 2020 by the World Health Organization. Timely sha... more COVID-19 was declared to be a pandemic in March 2020 by the World Health Organization. Timely sharing of viral genomic sequencing data accompanied by a minimal set of contextual data is essential for informing regional, national, and international public health responses. Such contextual data is also necessary for developing, and improving clinical therapies and vaccines, and enhancing the scientific community’s understanding of the SARS-CoV-2 virus. The Canadian COVID-19 Genomics Network (CanCOGeN) was launched in April 2020 to coordinate and upscale existing genomics-based COVID-19 research and surveillance efforts. CanCOGeN is performing large-scale sequencing of both the genomes of SARS-CoV-2 virus samples (VirusSeq) and affected Canadians (HostSeq). This paper addresses the privacy concerns associated with sharing the viral sequence data with a pre-defined set of contextual data describing the sample source and case attribute of the sequence data in the Canadian context. Curren...

An application ontology often reuses terms from other related, compatible, upper-level or domain-... more An application ontology often reuses terms from other related, compatible, upper-level or domain-specific ontologies. The extent of this interconnectedness is not readily apparent when browsing through larger textual presentations of term class hierarchies, be it Manchester text format OWL files or as presented in an ontology editor like Stanford Protégé, where one either mentally notes the location or frequency of ontology prefixes in term identifiers as the encompassing ontology is browsed, or one selects an ontology import file to view individually, out of context of the whole. Interconnectedness may be easier to perceive in two-dimensional hierarchical graphs that visually code ontology term origins, but canvass size and multiple inheritance links that break tree layouts become challenging at scale. We present OntoTrek, a visualization tool that explores the benefits of interactive threedimensional class hierarchy presentation. Our aim is to develop features, such as a consisten...

Ontologies are seen as a possible lingua-franca for enabling data sharing across databases from d... more Ontologies are seen as a possible lingua-franca for enabling data sharing across databases from different agencies where database content is actually about the same subject material, as for example arises in public health epidemiological food-borne disease outbreak investigations that cut across international borders and different levels of government. However, coming to consensus about how to encode database content according to OWL-driven vocabulary is not finished business. To explore a possible solution, the Hsiao Public Health Bioinformatics Lab has created a system to enable the crafting of ontology-driven data specifications in alignment with the Ontology for Biomedical Investigations data item and value specification framework. An initial vision was presented in an ICBO 2016 poster which led to funding for the effort, followed by a mid-project presentation at JOWO 2017, and now our working first release, called the Genomic Epidemiology Entity Mart (GEEM) portal, named after ...