Jean-christophe Gelly | Université Paris Cité (original) (raw)
Papers by Jean-christophe Gelly
Briefings in Bioinformatics
In the era of constantly increasing amounts of the available protein data, a relevant and interpr... more In the era of constantly increasing amounts of the available protein data, a relevant and interpretable visualization becomes crucial, especially for tasks requiring human expertise. Poincaré disk projection has previously demonstrated its important efficiency for visualization of biological data such as single-cell RNAseq data. Here, we develop a new method PoincaréMSA for visual representation of complex relationships between protein sequences based on Poincaré maps embedding. We demonstrate its efficiency and potential for visualization of protein family topology as well as evolutionary and functional annotation of uncharacterized sequences. PoincaréMSA is implemented in open source Python code with available interactive Google Colab notebooks as described at https://www.dsimb.inserm.fr/POINCARE_MSA.
Methods in Molecular Biology
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2022
β-bulges are irregularities inside the β-sheets. They represent more than 3 percent of the protei... more β-bulges are irregularities inside the β-sheets. They represent more than 3 percent of the protein residues, i.e., they are as frequent as 3.10 helices. In terms of evolution, β-bulges are not more conserved than any other local protein conformations within homologous protein structures. In a first of its kind study, we have investigated the dynamical behaviour of β-bulges using the largest known set of protein molecular dynamics simulations. We observed that more than 50 percent of the existing β-bulges in protein crystal structures remained stable during dynamics while more than1/6th were not stable at all and disappeared entirely. Surprisingly, 1.1 percent of β-bulges that appeared remained stable. β-bulges have been categorized in different subtypes. The most common β-bulges’ types are the smallest insertion in β-strands (namely AC and AG); they are found as stable as the whole β-bulges dataset. Low occurring types (namely PC and AS), that have the largest insertions, are significantly more stable than expected. Thus, this pioneer study allowed to precisely quantify the stability of the β-bulges, demonstrating their structural robustness, with few unexpected cases raising structural questions.
<b>Copyright information:</b>Taken from "EvDTree: structure-dependent substituti... more <b>Copyright information:</b>Taken from "EvDTree: structure-dependent substitution profiles based on decision tree classification of 3D environments"BMC Bioinformatics 2005;6():4-4.Published online 10 Jan 2005PMCID:PMC545998.Copyright © 2005 Gelly et al; licensee BioMed Central Ltd. Grey bars: Ala with pac > 38 and ss1 = 1; Black bars: Asp with pac > 20 and ss2 = 1. (B) Two profiles for the same amino acid (Leu) in different structural environments. Grey bars: Leu with pol > 53 and ss1 = 1; Black bars: Leu with pol > 53 and ss1! = 1.
<b>Copyright information:</b>Taken from "EvDTree: structure-dependent substituti... more <b>Copyright information:</b>Taken from "EvDTree: structure-dependent substitution profiles based on decision tree classification of 3D environments"BMC Bioinformatics 2005;6():4-4.Published online 10 Jan 2005PMCID:PMC545998.Copyright © 2005 Gelly et al; licensee BioMed Central Ltd. Gap opening and gap extension penalties have been separately optimized for each scoring function.
<b>Copyright information:</b>Taken from "EvDTree: structure-dependent substituti... more <b>Copyright information:</b>Taken from "EvDTree: structure-dependent substitution profiles based on decision tree classification of 3D environments"BMC Bioinformatics 2005;6():4-4.Published online 10 Jan 2005PMCID:PMC545998.Copyright © 2005 Gelly et al; licensee BioMed Central Ltd. The structural environment for a given cluster is defined by the edge labels along its path from the root cluster . For example, the nodes colored in gray indicate the partial classification path of a Leucine observed in a native structural environment whose descriptors , and verify = = 3%, = = 7% and = = 5.4A.
protein fold recognition with hybrid profiles combining sequence and structure evolution
This is an Open Access article distributed under the terms of the Creative Commons Attribution Li... more This is an Open Access article distributed under the terms of the Creative Commons Attribution License
F1000Research, 2017
Most of the humoral response in Camelids is mediated by a special group of antibodies known as He... more Most of the humoral response in Camelids is mediated by a special group of antibodies known as Heavy Chain only antibodies. The Variable region (VH of HCab-VHH) can be stably expressed and by itself bind to the epitope, making them a promising agent in therapeutic/biotechnological applications. The architecture of VHH is similar to its VH counterparts with alternating relatively less variable Framework Regions (FRs) and hyper variable Complementarity Determining Regions (CDRs). We use a fine description of local protein structures, namely a Structural Alphabet (SA), to perform the first extensive study on structural diversity of VHH FRs, where we uncovered the existence of various structural clusters with unexpected variations.
Motivation: The object of this study is to propose a new method to identify small compact units t... more Motivation: The object of this study is to propose a new method to identify small compact units that compose protein three-dimensional structures. These fragments, called ‘protein units (PU)’, are a new level of description towell understandandanalyze theorganizationof protein structures. The method only works from the contact probability matrix, i.e. the inter Ca-distances translated into probabilities. It uses the principle of conventional hierarchical clustering, leading to a series of nested partitions of the 3D structure. Every step aims at dividing optimally a unit into 2 or 3 subunits according to a criterion called ‘partition index’ assessing the structural independence of the subunits newly defined. Moreover, an entropy-derived squared correlation R is used for assessing globally the protein structure dissection. The method is compared to other splitting algorithms and shows relevant
Journal of Molecular Biology, 2021
Information on the protein flexibility is essential to understand crucial molecular mechanisms su... more Information on the protein flexibility is essential to understand crucial molecular mechanisms such as protein stability, interactions with other molecules and protein functions in general. B-factor obtained in the X-ray crystallography experiments is the most common flexibility descriptor available for the majority of the resolved protein structures. Since the gap between the number of the resolved protein structures and available protein sequences is continuously growing, it is important to provide computational tools for protein flexibility prediction from amino acid sequence. In the current study, we report a Deep Learning based protein flexibility prediction tool MEDUSA (https://www.dsimb.inserm.fr/MEDUSA). MEDUSA uses evolutionary information extracted from protein homologous sequences and amino acid physico-chemical properties as input for a convolutional neural network to assign a flexibility class to each protein sequence position. Trained on a non-redundant dataset of X-ray structures, MEDUSA provides flexibility prediction in two, three and five classes. MEDUSA is freely available as a web-server providing a clear visualization of the prediction results as well as a standalone utility (https://github.com/DSIMB/medusa). Analysis of the MEDUSA output allows a user to identify the potentially highly deformable protein regions and general dynamic properties of the protein.
Bioinformatics (Oxford, England), Aug 19, 2016
The experimental determination of membrane protein orientation within the lipid bilayer is extrem... more The experimental determination of membrane protein orientation within the lipid bilayer is extremely challenging, such that computational methods are most often the only solution. Moreover, obtaining all-atom 3D structures of membrane proteins is also technically difficult, and many of the available data are either experimental low-resolution structures or theoretical models, whose structural quality needs to be evaluated. Here, to address these two crucial problems, we propose OREMPRO, a web server capable of both (i) positioning a-helical and b-sheet transmembrane domains in the lipid bilayer and (ii) assessing their structural quality. Most importantly, OREMPRO uses the sole alpha carbon coordinates, which makes it the only web server compatible with both high and low structural resolutions. Finally, OREMPRO is also interesting in its ability to process coarse-grained protein models, by using coordinates of backbone beads in place of alpha carbons.
Biochimie, 2015
Knowing the structure of a protein is essential to characterize its function and mechanism at the... more Knowing the structure of a protein is essential to characterize its function and mechanism at the molecular level. Despite major advances in solving structures experimentally, most membrane protein native conformations remain unknown. This lack of available structures, along with the physical constraints imposed by the lipid bilayer environment, constitutes a difficulty for the modelling of membrane protein structures. Assessing the quality of membrane protein models is therefore critical. Using a non-redundant set of 66 membrane protein structures (41 alpha and 25 beta), we have developed an empirical energy function for the structural assessment of alpha-helical and beta-sheet transmembrane domains. This statistical potential quantifies the interatomic distance between residues located in the lipid bilayer. To minimize the problem of insufficient sampling, we have used kernel density estimations of the distance distributions. Following a leave-one-out cross-validation procedure, we show that our method outperforms current statistical potentials in discriminating correct from incorrect membrane protein models. Furthermore, the comparison of our distance-dependent statistical potential with one optimized on globular proteins provides insights into the rules by which residues interact within the lipid bilayer.
The noncoding genome plays an important role in de novo gene birth and in the emergence of geneti... more The noncoding genome plays an important role in de novo gene birth and in the emergence of genetic novelty. Nevertheless, how noncoding sequences’ properties could promote the birth of novel genes and shape the evolution and the structural diversity of proteins remains unclear. Therefore, by combining different bioinformatic approaches, we characterized the fold potential diversity of the amino acid sequences encoded by all intergenic ORFs (Open Reading Frames) of S. cerevisiae with the aim of (i) exploring whether the large structural diversity observed in proteomes is already present in noncoding sequences, and (ii) estimating the potential of the noncoding genome to produce novel protein bricks that can either give rise to novel genes or be integrated into pre-existing proteins, thus participating in protein structure diversity and evolution. We showed that amino acid sequences encoded by most yeast intergenic ORFs contain the elementary building blocks of protein structures. Mor...
ABSTRACTEGFR plays key roles in multiple cellular processes such as cell differentiation, cell pr... more ABSTRACTEGFR plays key roles in multiple cellular processes such as cell differentiation, cell proliferation, migration and epithelia homeostasis. Phosphorylation of the receptor, intracellular signaling and trafficking are major events regulating EGFR functions. Galectin-7, a soluble lectin expressed in epithelia such as the skin, has been shown to be involved in cell differentiation. Through this study we demonstrate that galectin-7 regulates EGFR function by a direct interaction with its extracellular domain hence modifying its downstream signaling and endocytic pathway. From observations in mice we focused on the molecular mechanisms deciphering the glycosylation dependent interaction between EGFR and galectin-7. Interestingly, we also revealed that galectin-7 is a direct binder of both EGFR and E-cadherin bridging them together. Strikingly this study not only deciphers a new molecular mechanism of EGFR regulation but also points out a novel molecular interaction between EGFR an...
Analysis of the architecture and organization of protein structures is a major challenge to bette... more Analysis of the architecture and organization of protein structures is a major challenge to better understand protein flexibility, folding, functions and interactions with their partners and to design new drugs. Protein structures are often described as series of α-helices and β-sheets, or at a higher level as an arrangement of protein domains. Due to the lack of an intermediate vision which could give a good understanding and description of protein structure architecture, we have proposed a novel intermediate view, the Protein Units (PUs). They are novel level of protein structure description between secondary structures and domains. A PU is defined as a compact sub-region of the 3D structure corresponding to one sequence fragment, defined by a high number of intra-PU contacts and a low number of inter-PU contacts. The methodology to obtain PUs from the protein structures is named Protein Peeling (PP). For the algorithm, the protein structures are described as a succession of Cα. The distances between Cα are translated into contact probabilities using a logistic function. Protein Peeling only uses this contact probability matrix. An optimization procedure, based on the Matthews' coefficient correlation (MCC) between contacts probability sub matrices, defines optimal cutting points that separate the region examined into two or three PUs. The process is iterated until the compactness of the resulting PUs reaches a given limit. An index assesses the compactness quality and relative independence of each PU. Protein Peeling is a tool to better understand and analyze the organization of protein structures. We have developed a dedicated bioinformatic web server: Protein Peeling 2 (PP2). Given the 3D coordinates of a protein, it proposes an automatic identification of protein units (PUs). The interface component consists of a web page (HTML) and common gateway interface (CGI). The user can set many parameters and upload a given structure in PDB file format to a perl core instance. This last component is a module that embeds all the Protein Peeling webserver 3 information necessary for two others softwares (mainly coded in C to perform most of the computation tasks and R for the analysis). Results are given both textually and graphically using JMol applet and PyMol software. The server can be accessed from http://www.dsimb.inserm.fr/dsimb_tools/peeling/. Only one equivalent on line methodology is available.
Briefings in Bioinformatics
In the era of constantly increasing amounts of the available protein data, a relevant and interpr... more In the era of constantly increasing amounts of the available protein data, a relevant and interpretable visualization becomes crucial, especially for tasks requiring human expertise. Poincaré disk projection has previously demonstrated its important efficiency for visualization of biological data such as single-cell RNAseq data. Here, we develop a new method PoincaréMSA for visual representation of complex relationships between protein sequences based on Poincaré maps embedding. We demonstrate its efficiency and potential for visualization of protein family topology as well as evolutionary and functional annotation of uncharacterized sequences. PoincaréMSA is implemented in open source Python code with available interactive Google Colab notebooks as described at https://www.dsimb.inserm.fr/POINCARE_MSA.
Methods in Molecular Biology
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2022
β-bulges are irregularities inside the β-sheets. They represent more than 3 percent of the protei... more β-bulges are irregularities inside the β-sheets. They represent more than 3 percent of the protein residues, i.e., they are as frequent as 3.10 helices. In terms of evolution, β-bulges are not more conserved than any other local protein conformations within homologous protein structures. In a first of its kind study, we have investigated the dynamical behaviour of β-bulges using the largest known set of protein molecular dynamics simulations. We observed that more than 50 percent of the existing β-bulges in protein crystal structures remained stable during dynamics while more than1/6th were not stable at all and disappeared entirely. Surprisingly, 1.1 percent of β-bulges that appeared remained stable. β-bulges have been categorized in different subtypes. The most common β-bulges’ types are the smallest insertion in β-strands (namely AC and AG); they are found as stable as the whole β-bulges dataset. Low occurring types (namely PC and AS), that have the largest insertions, are significantly more stable than expected. Thus, this pioneer study allowed to precisely quantify the stability of the β-bulges, demonstrating their structural robustness, with few unexpected cases raising structural questions.
<b>Copyright information:</b>Taken from "EvDTree: structure-dependent substituti... more <b>Copyright information:</b>Taken from "EvDTree: structure-dependent substitution profiles based on decision tree classification of 3D environments"BMC Bioinformatics 2005;6():4-4.Published online 10 Jan 2005PMCID:PMC545998.Copyright © 2005 Gelly et al; licensee BioMed Central Ltd. Grey bars: Ala with pac > 38 and ss1 = 1; Black bars: Asp with pac > 20 and ss2 = 1. (B) Two profiles for the same amino acid (Leu) in different structural environments. Grey bars: Leu with pol > 53 and ss1 = 1; Black bars: Leu with pol > 53 and ss1! = 1.
<b>Copyright information:</b>Taken from "EvDTree: structure-dependent substituti... more <b>Copyright information:</b>Taken from "EvDTree: structure-dependent substitution profiles based on decision tree classification of 3D environments"BMC Bioinformatics 2005;6():4-4.Published online 10 Jan 2005PMCID:PMC545998.Copyright © 2005 Gelly et al; licensee BioMed Central Ltd. Gap opening and gap extension penalties have been separately optimized for each scoring function.
<b>Copyright information:</b>Taken from "EvDTree: structure-dependent substituti... more <b>Copyright information:</b>Taken from "EvDTree: structure-dependent substitution profiles based on decision tree classification of 3D environments"BMC Bioinformatics 2005;6():4-4.Published online 10 Jan 2005PMCID:PMC545998.Copyright © 2005 Gelly et al; licensee BioMed Central Ltd. The structural environment for a given cluster is defined by the edge labels along its path from the root cluster . For example, the nodes colored in gray indicate the partial classification path of a Leucine observed in a native structural environment whose descriptors , and verify = = 3%, = = 7% and = = 5.4A.
protein fold recognition with hybrid profiles combining sequence and structure evolution
This is an Open Access article distributed under the terms of the Creative Commons Attribution Li... more This is an Open Access article distributed under the terms of the Creative Commons Attribution License
F1000Research, 2017
Most of the humoral response in Camelids is mediated by a special group of antibodies known as He... more Most of the humoral response in Camelids is mediated by a special group of antibodies known as Heavy Chain only antibodies. The Variable region (VH of HCab-VHH) can be stably expressed and by itself bind to the epitope, making them a promising agent in therapeutic/biotechnological applications. The architecture of VHH is similar to its VH counterparts with alternating relatively less variable Framework Regions (FRs) and hyper variable Complementarity Determining Regions (CDRs). We use a fine description of local protein structures, namely a Structural Alphabet (SA), to perform the first extensive study on structural diversity of VHH FRs, where we uncovered the existence of various structural clusters with unexpected variations.
Motivation: The object of this study is to propose a new method to identify small compact units t... more Motivation: The object of this study is to propose a new method to identify small compact units that compose protein three-dimensional structures. These fragments, called ‘protein units (PU)’, are a new level of description towell understandandanalyze theorganizationof protein structures. The method only works from the contact probability matrix, i.e. the inter Ca-distances translated into probabilities. It uses the principle of conventional hierarchical clustering, leading to a series of nested partitions of the 3D structure. Every step aims at dividing optimally a unit into 2 or 3 subunits according to a criterion called ‘partition index’ assessing the structural independence of the subunits newly defined. Moreover, an entropy-derived squared correlation R is used for assessing globally the protein structure dissection. The method is compared to other splitting algorithms and shows relevant
Journal of Molecular Biology, 2021
Information on the protein flexibility is essential to understand crucial molecular mechanisms su... more Information on the protein flexibility is essential to understand crucial molecular mechanisms such as protein stability, interactions with other molecules and protein functions in general. B-factor obtained in the X-ray crystallography experiments is the most common flexibility descriptor available for the majority of the resolved protein structures. Since the gap between the number of the resolved protein structures and available protein sequences is continuously growing, it is important to provide computational tools for protein flexibility prediction from amino acid sequence. In the current study, we report a Deep Learning based protein flexibility prediction tool MEDUSA (https://www.dsimb.inserm.fr/MEDUSA). MEDUSA uses evolutionary information extracted from protein homologous sequences and amino acid physico-chemical properties as input for a convolutional neural network to assign a flexibility class to each protein sequence position. Trained on a non-redundant dataset of X-ray structures, MEDUSA provides flexibility prediction in two, three and five classes. MEDUSA is freely available as a web-server providing a clear visualization of the prediction results as well as a standalone utility (https://github.com/DSIMB/medusa). Analysis of the MEDUSA output allows a user to identify the potentially highly deformable protein regions and general dynamic properties of the protein.
Bioinformatics (Oxford, England), Aug 19, 2016
The experimental determination of membrane protein orientation within the lipid bilayer is extrem... more The experimental determination of membrane protein orientation within the lipid bilayer is extremely challenging, such that computational methods are most often the only solution. Moreover, obtaining all-atom 3D structures of membrane proteins is also technically difficult, and many of the available data are either experimental low-resolution structures or theoretical models, whose structural quality needs to be evaluated. Here, to address these two crucial problems, we propose OREMPRO, a web server capable of both (i) positioning a-helical and b-sheet transmembrane domains in the lipid bilayer and (ii) assessing their structural quality. Most importantly, OREMPRO uses the sole alpha carbon coordinates, which makes it the only web server compatible with both high and low structural resolutions. Finally, OREMPRO is also interesting in its ability to process coarse-grained protein models, by using coordinates of backbone beads in place of alpha carbons.
Biochimie, 2015
Knowing the structure of a protein is essential to characterize its function and mechanism at the... more Knowing the structure of a protein is essential to characterize its function and mechanism at the molecular level. Despite major advances in solving structures experimentally, most membrane protein native conformations remain unknown. This lack of available structures, along with the physical constraints imposed by the lipid bilayer environment, constitutes a difficulty for the modelling of membrane protein structures. Assessing the quality of membrane protein models is therefore critical. Using a non-redundant set of 66 membrane protein structures (41 alpha and 25 beta), we have developed an empirical energy function for the structural assessment of alpha-helical and beta-sheet transmembrane domains. This statistical potential quantifies the interatomic distance between residues located in the lipid bilayer. To minimize the problem of insufficient sampling, we have used kernel density estimations of the distance distributions. Following a leave-one-out cross-validation procedure, we show that our method outperforms current statistical potentials in discriminating correct from incorrect membrane protein models. Furthermore, the comparison of our distance-dependent statistical potential with one optimized on globular proteins provides insights into the rules by which residues interact within the lipid bilayer.
The noncoding genome plays an important role in de novo gene birth and in the emergence of geneti... more The noncoding genome plays an important role in de novo gene birth and in the emergence of genetic novelty. Nevertheless, how noncoding sequences’ properties could promote the birth of novel genes and shape the evolution and the structural diversity of proteins remains unclear. Therefore, by combining different bioinformatic approaches, we characterized the fold potential diversity of the amino acid sequences encoded by all intergenic ORFs (Open Reading Frames) of S. cerevisiae with the aim of (i) exploring whether the large structural diversity observed in proteomes is already present in noncoding sequences, and (ii) estimating the potential of the noncoding genome to produce novel protein bricks that can either give rise to novel genes or be integrated into pre-existing proteins, thus participating in protein structure diversity and evolution. We showed that amino acid sequences encoded by most yeast intergenic ORFs contain the elementary building blocks of protein structures. Mor...
ABSTRACTEGFR plays key roles in multiple cellular processes such as cell differentiation, cell pr... more ABSTRACTEGFR plays key roles in multiple cellular processes such as cell differentiation, cell proliferation, migration and epithelia homeostasis. Phosphorylation of the receptor, intracellular signaling and trafficking are major events regulating EGFR functions. Galectin-7, a soluble lectin expressed in epithelia such as the skin, has been shown to be involved in cell differentiation. Through this study we demonstrate that galectin-7 regulates EGFR function by a direct interaction with its extracellular domain hence modifying its downstream signaling and endocytic pathway. From observations in mice we focused on the molecular mechanisms deciphering the glycosylation dependent interaction between EGFR and galectin-7. Interestingly, we also revealed that galectin-7 is a direct binder of both EGFR and E-cadherin bridging them together. Strikingly this study not only deciphers a new molecular mechanism of EGFR regulation but also points out a novel molecular interaction between EGFR an...
Analysis of the architecture and organization of protein structures is a major challenge to bette... more Analysis of the architecture and organization of protein structures is a major challenge to better understand protein flexibility, folding, functions and interactions with their partners and to design new drugs. Protein structures are often described as series of α-helices and β-sheets, or at a higher level as an arrangement of protein domains. Due to the lack of an intermediate vision which could give a good understanding and description of protein structure architecture, we have proposed a novel intermediate view, the Protein Units (PUs). They are novel level of protein structure description between secondary structures and domains. A PU is defined as a compact sub-region of the 3D structure corresponding to one sequence fragment, defined by a high number of intra-PU contacts and a low number of inter-PU contacts. The methodology to obtain PUs from the protein structures is named Protein Peeling (PP). For the algorithm, the protein structures are described as a succession of Cα. The distances between Cα are translated into contact probabilities using a logistic function. Protein Peeling only uses this contact probability matrix. An optimization procedure, based on the Matthews' coefficient correlation (MCC) between contacts probability sub matrices, defines optimal cutting points that separate the region examined into two or three PUs. The process is iterated until the compactness of the resulting PUs reaches a given limit. An index assesses the compactness quality and relative independence of each PU. Protein Peeling is a tool to better understand and analyze the organization of protein structures. We have developed a dedicated bioinformatic web server: Protein Peeling 2 (PP2). Given the 3D coordinates of a protein, it proposes an automatic identification of protein units (PUs). The interface component consists of a web page (HTML) and common gateway interface (CGI). The user can set many parameters and upload a given structure in PDB file format to a perl core instance. This last component is a module that embeds all the Protein Peeling webserver 3 information necessary for two others softwares (mainly coded in C to perform most of the computation tasks and R for the analysis). Results are given both textually and graphically using JMol applet and PyMol software. The server can be accessed from http://www.dsimb.inserm.fr/dsimb_tools/peeling/. Only one equivalent on line methodology is available.