# MEGGASENSE - The Metagenome/Genome Annotated Sequence Natural Language Search Engine: A Platform for 
the Construction of Sequence Data Warehouses (original) (raw)

CAZymes Analysis Toolkit (CAT): Web service for searching and analyzing carbohydrate-active enzymes in a newly sequenced organism using CAZy database

Glycobiology, 2010

The Carbohydrate-Active Enzyme (CAZy) database provides a rich set of manually annotated enzymes that degrade, modify, or create glycosidic bonds. Despite rich and invaluable information stored in the database, software tools utilizing this information for annotation of newly sequenced genomes by CAZy families are limited. We have employed two annotation approaches to fill the gap between manually curated high-quality protein sequences collected in the CAZy database and the growing number of other protein sequences produced by genome or metagenome sequencing projects. The first approach is based on a similarity search against the entire nonredundant sequences of the CAZy database. The second approach performs annotation using links or correspondences between the CAZy families and protein family domains. The links were discovered using the association rule learning algorithm applied to sequences from the CAZy database. The approaches complement each other and in combination achieved high specificity and sensitivity when cross-evaluated with the manually curated genomes of Clostridium thermocellum ATCC 27405 and Saccharophagus degradans 2-40. The capability of the proposed framework to predict the function of unknown protein domains and of hypothetical proteins in the genome of Neurospora crassa is demonstrated. The framework is implemented as a Web service, the CAZymes Analysis Toolkit, and is available at http://cricket.ornl. gov/cgi-bin/cat.cgi.

Data mining of metagenomes to find novel enzymes: a non-computationally intensive method

3 Biotech

Currently, there is a need of non-computationally-intensive bioinformatics tools to cope with the increase of large datasets produced by Next Generation Sequencing technologies. We present a simple and robust bioinformatics pipeline to search for novel enzymes in metagenomic sequences. The strategy is based on pattern searching using as reference conserved motifs coded as regular expressions. As a case study, we applied this scheme to search for novel proteases S8A in a publicly available metagenome. Briefly, (1) the metagenome was assembled and translated into amino acids; (2) patterns were matched using regular expressions; (3) retrieved sequences were annotated; and (4) diversity analyses were conducted. Following this pipeline, we were able to identify nine sequences containing an S8 catalytic triad, starting from a metagenome containing 9,921,136 Illumina reads. Identity of these nine sequences was confirmed by BLASTp against databases at NCBI and MEROPS. Identities ranged from 62 to 89% to their respective nearest ortholog, which belonged to phyla Proteobacteria, Actinobacteria, Planctomycetes, Bacterioidetes, and Cyanobacteria, consistent with the most abundant phyla reported for this metagenome. All these results support the idea that they all are novel S8 sequences and strongly suggest that our methodology is robust and suitable to detect novel enzymes.

MetaBioME: a database to explore commercially useful enzymes in metagenomic datasets

Nucleic Acids Research, 2010

Microbial enzymes have many known applications as biocatalysts in biotechnology, agriculture, medical and other industries. However, only a few enzymes are currently employed for such commercial applications. In this scenario, the current onslaught of metagenomic data provides a new unexplored treasure trove of genomic wealth that can not only enhance the enzyme repertoire by the discovery of novel commercially useful enzymes (CUEs) but can also reveal better functional variants for existing CUEs. We prepared a catalogue of CUEs using text mining of PubMed abstracts and other publicly available information, and manually curated the data to identify 510 CUEs. Further, in order to identify novel homologues of these CUEs, we identified potential ORFs in publicly available metagenomic datasets from 10 diverse sources. Using this strategy, we have developed a resource called MetaBioME (http://metasystems.riken .jp/metabiome/) that comprises (i) a database of CUEs and (ii) a comprehensive platform to facilitate homology-based computational identification of novel homologous CUEs from metagenomic and bacterial genomic datasets. Using MetaBioME, we have identified several novel homologues to known CUEs that can potentially serve as leads for further experimental verification.

Enzyme-specific profiles for genome annotation: PRIAM

Nucleic Acids Research, 2003

The advent of fully sequenced genomes opens the ground for the reconstruction of metabolic pathways on the basis of the identi®cation of enzymecoding genes. Here we describe PRIAM, a method for automated enzyme detection in a fully sequenced genome, based on the classi®cation of enzymes in the ENZYME database. PRIAM relies on sets of position-speci®c scoring matrices (`pro®les') automatically tailored for each ENZYME entry. Automatically generated logical rules de®ne which of these pro®les is required in order to infer the presence of the corresponding enzyme in an organism. As an example, PRIAM was applied to identify potential metabolic pathways from the complete genome of the nitrogen-®xing bacterium Sinorhizobium meliloti. The results of this automated method were compared with the original genome annotation and visualised on KEGG graphs in order to facilitate the interpretation of metabolic pathways and to highlight potentially missing enzymes.

eCAMI: simultaneous classification and motif identification for enzyme annotation

Bioinformatics

Motivation Carbohydrate-active enzymes (CAZymes) are extremely important to bioenergy, human gut microbiome, and plant pathogen researches and industries. Here we developed a new amino acid k-mer-based CAZyme classification, motif identification and genome annotation tool using a bipartite network algorithm. Using this tool, we classified 390 CAZyme families into thousands of subfamilies each with distinguishing k-mer peptides. These k-mers represented the characteristic motifs (in the form of a collection of conserved short peptides) of each subfamily, and thus were further used to annotate new genomes for CAZymes. This idea was also generalized to extract characteristic k-mer peptides for all the Swiss-Prot enzymes classified by the EC (enzyme commission) numbers and applied to enzyme EC prediction. Results This new tool was implemented as a Python package named eCAMI. Benchmark analysis of eCAMI against the state-of-the-art tools on CAZyme and enzyme EC datasets found that: (i) e...

Natural Product Discovery through Improved Functional Metagenomics in Streptomyces

Journal of the American Chemical Society, 2016

Because the majority of environmental bacteria are not easily culturable, access to many bacterially encoded secondary metabolites will be dependent on the development of improved functional metagenomic screening methods. In this study, we examined a collection of diverse Streptomyces species for the best innate ability to heterologously express biosynthetic gene clusters. We then optimized methods for constructing high quality meta-genomic cosmid libraries in the best Streptomyces host. An initial screen of a 1.5 million-membered metagenomic library constructed in Streptomyces albus, the species that exhibited the highest propensity for heterologous expression of gene clusters, led to the identification of the novel natural product metatricycloene (1). Metatricy-cloene is a tricyclic polyene encoded by a reductive, iterative polyketide-like gene cluster. Related gene clusters found in sequenced genomes appear to encode a largely unexplored collection of structurally diverse, polyene-based metabolites.

GeneHunt for rapid domain-specific annotation of glycoside hydrolases

The identification of glycoside hydrolases (GHs) for efficient polysaccharide deconstruction is essential for the development of biofuels. Here, we investigate the potential of sequential HMM-profile identification for the rapid and precise identification of the multi-domain architecture of GHs from various datasets. First, as a validation, we successfully reannotated >98% of the biochemically characterized enzymes listed on the CAZy database. Next, we analyzed the 43 million non-redundant sequences from the M5nr data and identified 322,068 unique GHs. Finally, we searched 129 assembled metagenomes retrieved from MG-RAST for environmental GHs and identified 160,790 additional enzymes. Although most identified sequences corresponded to single domain enzymes, many contained several domains, including known accessory domains and some domains never identified in association with GH. several sequences displayed multiple catalytic domains and few of these potential multi-activity proteins combined potentially synergistic domains. Finally, we produced and confirmed the biochemical activities of a GH5-GH10 cellulase-xylanase and a GH11-CE4 xylanase-esterase. Globally, this "gene to enzyme pipeline" provides a rationale for mining large datasets in order to identify new catalysts combining unique properties for the efficient deconstruction of polysaccharides. Glycoside Hydrolases (GHs) are Carbohydrate-Active Enzymes (CAZy) that catalyze the hydrolysis of the glyco-sidic linkage in polysaccharides (e.g., cellulose, chitin) and oligosaccharides (e.g., cellobiose, chitobiose) 1. GHs are found as single domain proteins (SDGHs) or associated with accessory domains such as carbohydrate binding modules (CBMs) within multi-domain GHs (MDGHs) 2. In MDGHs, CBMs enhance enzyme-substrate interaction by anchoring the catalytic domain to the substrate 3. The anchoring reduces diffusion from the substrate and locally increases the concentration of catalytic domains 4 , thus improving the overall polysaccharide degradation 3. GHs support essential processes for ecosystem function and for biotechnology. Among others, in land ecosystems , the deconstruction of plant biomass by microbial GHs is essential 5,6 , whereas the breakdown of chitin, from arthropods and fungi, is important in both marine 7,8 and terrestrial ecosystems 9-12. Next, in the gut of animals, microbial GHs target polysaccharides, supplement the lack of endogenous enzymes 13,14 and thus contribute to the processing of complex carbohydrates during digestion 15-18. Finally, GHs are essential for the biofuel industry, as plant based polysaccharides constitute a major source of sustainable and renewable material capable of providing liquid transportation fuel 19-22. Many GH-genes and proteins have been identified in a growing number of sequenced genomes and environmental samples thanks to the use of activity-driven screening 23,24 and bioinformatic annotation systems 12,16,18,25. The precise identification of GH-genes and proteins is essential in order to understand how microbes support key functions across ecosystems 17,25,26 and to identify new enzymes for biotechnological application 18,21,27. In order to identify new catalysts for biomass degradation, we examined the performance of sequential Hidden Markov Model (HMM) identifications 28 combined with publicly accessible HMM-profiles from the PFam database 29 , here referred to as the GeneHunt approach 2,30 , to detect GH-sequences and investigate their detailed architecture (i.e., the precise domain organization of MDGHs) 2. More precisely, we first validated the GeneHunt approach by re-annotating the biochemically characterized GHs listed on the CAZy database (as of June 2018) 1. As described for cellulases, xylanases, and chitinases 30 , we expected the PFam-based annotation to correctly identify most of the proteins from the major GH families, although rare and recently identified GH-families would display inconsistencies. Next, we identified GHs in the M5nr database (version 13.12.15) containing 43,098,145 non-redundant, mostly microbial, protein sequences 31. This collection of sequences derived from the major sequence database serves as the reference database for the MG-RAST annotation pipeline 31,32. Finally, we identified the detailed multi-domain architecture of GH proteins in assembled, publicly accessible, metagenomes from

Microbial metagenomes: moving forward industrial biotechnology

Journal of Chemical Technology & Biotechnology, 2007

Biotechnology, in terms of exploitation of catalytic activities for industrial applications, is increasingly recognized as one of the pillars of the knowledge-based economy that we are heading for. Comprehensive knowledge of enzymology should be of practical importance for effective intervention on whole cell processes and enzymatic networks. Over the last decade metagenome-based technologies have been developed to take us farther and deeper into the enzyme universe from uncultivable microbes. This sophisticated platform, which identifies new enzymes from vast genetic pools available, and assesses their potential for novel chemical applications, should be increasingly important in the discovery of advanced biotechnological resources.

Carbohydrate actives enzymes derived from metagenomes: from microbial ecology to enzymology

"In the last ten years, the intensive mining of various environmental metagenomes has led to the discovery of numerous new genes and corresponding putative enzymes. Some enzymes were isolated for their ability to hydrolyze carbohydrates, including starch, xylan, chitin and cellulose derivatives. The accurate characterization of these proteins highlights their variability and their biophysical adaptation in order to cope with specific environmental conditions. In this perspective, the sampling of extreme environments for metagenomic library construction resulted in the isolation of enzymes harbouring tailor made properties aimed at their implementation in various industrial processes. Although these new catalysts appear to be of particular interest for biotechnological applications, little is known about their physiological functions in their natural host. In the field of glycosides hydrolases, different functions have been suggested including both hydrolysis and synthesis of polymers. On the one hand, indeed in the environment microorganisms compete for ecological niches by producing enzymes active against vast ranges of substrates which allow them to thrive on various carbon sources. On the other hand, production of structural (cellulose) or reserve (glycogen) polymers by bacteria such as Gluconacetobacter sp. was reported. Polysaccharides can be associated with bacterial biofilm and feed stock, compounds that are required for bacteria to live in various environments. Interestingly, the synthesis of these polymers requires enzymes which act on carbohydrate including enzymes referred to as glycoside hydrolases acting as transglycosylases. In this chapter, a review of the representative glycoside hydrolases isolated by metagenomic and their possible physiological functions are presented."

Exploration of metagenomes for new enzymes useful in food biotechnology - a review

Polish Journal of Food and Nutrition Sciences, 2008

Metagenomics is the genomic analysis of the collective genomes of an assemblage of organisms, or the metagenome. Metagenomic analysis has been applied to diverse problems in microbiology and has yielded insight into the physiology of uncultured organisms to access the potentially useful enzymes and secondary metabolites they produce. DNA isolation methods have to be strictly adapted to the type of isolated biological material; of great importance is also the size of the obtained DNA. Small DNA fragments are usually sufficient for an analysis of individual genes or their small groups, whereas large inserts are required for analysing metabolic pathways, genome structures or sequencing large DNA fragments. There are two types of methods of extracting genomic DNA. One of them consists of the direct extraction of nucleic acids from an environmental sample after the cell lysis (in situ), followed by purification of the obtained DNA. The other method consists of the separation of bacterial...