Wim Verleyen - Academia.edu (original) (raw)
Papers by Wim Verleyen
Morphology during tumour invasion 127 4.1 Tumour invasion 128 4.2 Automated image analysis 129 4.... more Morphology during tumour invasion 127 4.1 Tumour invasion 128 4.2 Automated image analysis 129 4.2.1 Morphological measures 130 6 4.3 Bayesian network 131 4.4 Discriminative capacity of morphological measures 132 4.4.1 Cell-cell contact 133 4.4.2 Group area 133 4.4.3 Surface roughness 134 4.4.4 Length/width ratio 135 4.5 Discussion 135 Conclusion 137 5.1 Contributions 137 5.1.1 Biomarker discovery for ovarian cancer 137 5.1.2 Tumour invasion 138 5.2 Future work 138 5.2.1 Biomarker discovery for ovarian cancer 138 5.2.2 Tumour invasion 139 Bibliography 141 Appendix A: data sets 161 7.1 Edinburgh Ovarian Cancer Register (EOCR) 161 7.1.1 Original data set for feature selection and training set for 1YM-PFS and 3YM-OS classifiers 161 7.1.2 Additional validation data set for the validation of 1YM-PFS and 3YM-OS classifiers 161 7.2 Tumour invasion data set 162 Index 165 8 2.14 Analysis of the proportional hazards assumption with cox.zph function in R. The flatness of the fitted line illustrates that the Stage parameter does now violates the proportional hazards assumption over the survival time (Time). 78 2.15 Forward phase protein array and reverse phase protein array have a different configuration of analytes and antibodies. 86 2.16 An example of an RPPA two-by-two grid plate. A set of 9 proteins are measure for a time-series with 17 intervals. Each dot on the figure represents the expression of an antibody of a corresponding target. 87 2.17 Immunofluorescence images of a tissue microarrays assay (Blue = DAPI nuclei; Green = cytokeratin tumour mask Red = antibodyconjugated flourophores) (From Fig. 1 of Faratian et al. [Faratian et al., 2011]). 88 3.1 Frequencies of stages, and histological types in the data. Not all the different combinations of stage and histological type are equally distributed in this data set. Later stage ovarian carcinoma have a higher frequency compared to the early stage ovarian cancer. 91 3.2 The difference in time between overall survival (OS), and progressionfree survival (PFS) (Dx, Sx: date of histological diagnosis, CRx: date of treatment diagnosis, Re/Prog: date of first signs of disease recurrence, DLS/Death: date of death from any cause). 91 3.3 A survival function for the progression-free survival (PFS) and overall survival (OS) for patients under different treatment regimen (Regimen 1: platinum and Regimen 2: platinum combined with taxane). 92 3.4 The biological circuit of known interactions active in the proteomics profile for ovarian cancer [Hanahan and Weinberg, 2000,Hanahan and Weinberg, 2011]. 94 3.5 Bayesian network of the proteomics profile. 98 3.6 Bayesian network of the clinicopathological measurements. 99 3.7 Three-layered Bayesian network with a first layer of clinicopathological measurements, a second layer of candidate proteomes biomarkers, and third layer of progression-free survival (PFS) and overall survival (OS) outputs. 100 3.8 The following scheme illustrates the different discriminative machine learning methodologies used during this research: survival analysis and classification. First, feature selection is executed on the clinicopathological, proteomics data, and the combination of both. The selected features are plugged into the survival analysis, and into the classification model. The survival model is verified with the following performance measure: c-index, p value of a Monte Carlo experiment, and the shrinkage. The classification models are verified with area under ROC curve (AUC), a hybrid metric (SAR), and precisionrecall F measure (F). 101 3.9 These block schemes provide an overview of the selected features and the performance measures: 10-fold cross validated c-index, pvalue of the Monte Carlo experiment, and the shrinkage for the Cox proportional hazards regression models for PFS. 103 9 3.10 The Monte Carlo distribution and the performance measures: 10fold cross validated c-index, p-value of the Monte Carlo experiment, and the shrinkage for the Cox proportional hazards regression models for PFS. 105 3.11 These block schemes provide an overview of the selected features, and the performance measures: 10-fold cross validated c-index, pvalue of the Monte Carlo experiment, and the shrinkage for the Cox proportional hazards regression models for OS. 107 3.12 The Monte Carlo distribution and the performance measures: 10fold cross validated c-index, p-value of the Monte Carlo experiment, and the shrinkage for the Cox proportional hazards regression models for OS. 108 3.13 The block scheme and the Monte Carlo distribution of the Cox proportional hazards regression model illustrate the performance for predicting PFS in the case of papillary serous in stage 3. 110 3.14 The block scheme and the Monte Carlo distribution of the Cox proportional hazards regression model illustrate the performance for predicting PFS in the case of papillary serous in stage 4. 110 3.15 The block scheme and the Monte Carlo distribution of the Cox proportional hazards regression model illustrate the performance for predicting PFS in the case of endometrioid in stage 3. 111 3.16 The block scheme and the Monte Carlo distribution of the Cox proportional hazards regression model illustrate the performance for predicting PFS in the case of mullerian in stage 3. 111 3.17 The block scheme and the Monte Carlo distribution of the Cox proportional hazards regression model illustrate the performance for predicting OS in the case of papillary serous in stage 3. 112 3.18 The block scheme and the Monte Carlo distribution of the Cox proportional hazards regression model illustrate the performance for predicting OS in the case of papillary serous in stage 4. 112 3.19 The block scheme and the Monte Carlo distribution of the Cox proportional hazards regression model illustrate the performance for predicting OS in the case of endometrioid in stage 3. 113 3.20 The block scheme and the Monte Carlo distribution of the Cox proportional hazards regression model illustrate the performance for predicting OS in the case of mixed mullerian in stage 3. 113 3.21 Performance measures, AUC, F-measure, and SAR for 1YM-PFS classifier constructed with logistic regression, Cox proportional hazards regression, and support vector machines. 115 3.22 The performance measures, AUC, F-measure, and SAR, for 1YM-PFS classifier constructed with logistic regression, Cox proportional hazards regression, and support vector machines. The classifiers are constructed based on the clinicopathological data segmentation for papillary serous in stage 3 and stage 4, endometrioid in stage 3, and mixed mullerian in stage 3. 116 3.23 Performance measures, AUC, F-measure, and SAR for 3YM-OS classifier constructed with logistic regression, Cox proportional hazards regression, and support vector machines. 117 10 3.24 The performance measures, AUC, F-measure, and SAR, for 3YM-OS classifier constructed with logistic regression, Cox proportional hazards regression, and support vector machines. The classifiers are constructed based on the clinicopathological data segmentation for papillary serous in stage 3 and stage 4, endometrioid in stage 3, and mixed mullerian in stage 3. 118 3.25 The ROC-and precision/recall plot for the classification model (1YM-PFS) after 10-fold cross validation. 121 3.26 The ROC-and precision-recall plot for the classification model (1YM-PFS) after validation with a separate data set. 122 3.27 The ROC-and precision/recall plot for the classification model (3YM-OS) after 10-fold cross validation. 123 3.28 The ROC-and precision-recall plot for the classification model (1YM-OS) after validation with a separate data set. 124 4.1 Boxplot for the different invasion types per cell line (C35pool and C35hi) [Katz et al., 2011]. 129 4.2 A fluorescent-stained image from invasion assay. Pan-cytokeratin rabbit polychonal antibody is used to select epithelial cells, and visualization is performed by anti-rabbit-Cy3. DAPI counterstain was used to identify nuclei. 129 4.3 The cognition network technology (CNT) applied for the detection of tumours in the invasion assay. 130 4.4 Bayesian network constructs a graph of statistical dependencies between morphological measures and tumour invasion types. 132 4.5 Histogram cell-cell contact for the comparison between individualand collective invasion. 133 4.6 Histogram group area for the comparison between individual-and collective invasion. 134 4.7-1cm 134 4.8 Histogram roughness for the comparison between individual-and collective invasion. 134 4.9 Histogram length/width ratio for the comparison between individualand collective invasion.
2022 IEEE Aerospace Conference (AERO)
This framework enables C suite executive leaders to define a business plan and manage technologic... more This framework enables C suite executive leaders to define a business plan and manage technological dependencies for building AI/ML Solutions. The business plan of this framework provides components and background information to define strategy and analyze cost. Furthermore, the business plan represents the fundamentals of AI/ML Innovation and AI/ML Solutions. Therefore, the framework provides a menu for managing and investing in AI/ML. Finally, this framework is constructed with an interdisciplinary and holistic view of AI/ML Innovation and builds on advances in business strategy in harmony with technological progress for AI/ML. This framework incorporates value chain, supply chain, and ecosystem strategies.
Bioinformatics, 2014
Motivation: Network-based gene function inference methods have proliferated in recent years, but ... more Motivation: Network-based gene function inference methods have proliferated in recent years, but measurable progress remains elusive. We wished to better explore performance trends by controlling data and algorithm implementation, with a particular focus on the performance of aggregate predictions. Results: Hypothesizing that popular methods would perform well without hand-tuning, we used well-characterized algorithms to produce verifiably 'untweaked' results. We find that most stateof-the-art machine learning methods obtain 'gold standard' performance as measured in critical assessments in defined tasks. Across a broad range of tests, we see close alignment in algorithm performances after controlling for the underlying data being used. We find that algorithm aggregation provides only modest benefits, with a 17% increase in area under the ROC (AUROC) above the mean AUROC. In contrast, data aggregation gains are enormous with an 88% improvement in mean AUROC. Altogether, we find substantial evidence to support the view that additional algorithm development has little to offer for gene function prediction. Availability and implementation: The supplementary information contains a description of the algorithms, the network data parsed from different biological data resources and a guide to the source code
Systems pathology attempts to introduce more holistic approaches towards pathology and attempts t... more Systems pathology attempts to introduce more holistic approaches towards pathology and attempts to integrate clinicopathological information with “-omics” technology. This doctorate researches two examples of a systems approach for pathology: (1) a personalized patient output prediction for ovarian cancer and (2) an analytical approach differentiates between individual and collective tumour invasion. During the personalized patient output prediction for ovarian cancer study, clinicopathological measurements and proteomic biomarkers are analysed with a set of newly engineered bioinformatic tools. These tools are based upon feature selection, survival analysis with Cox proportional hazards regression, and a novel Monte Carlo approach. Clinical and pathological data proves to have highly significant information content, as expected; however, molecular data has little information content alone, and is only significant when selected most-informative variables are placed in the context of...
Bioinformatics, Dec 14, 2015
Motivation: Gene networks have become a central tool in the analysis of genomic data but are wide... more Motivation: Gene networks have become a central tool in the analysis of genomic data but are widely regarded as hard to interpret. This has motivated a great deal of comparative evaluation and research into best practices. We explore the possibility that this may lead to overfitting in the field as a whole. Results: We construct a model of 'research communities' sampling from real gene network data and machine learning methods to characterize performance trends. Our analysis reveals an important principle limiting the value of replication, namely that targeting it directly causes 'easy' or uninformative replication to dominate analyses. We find that when sampling across network data and algorithms with similar variability, the relationship between replicability and accuracy is positive (Spearman's correlation, r s 0.33)butwherenosuchconstraintisimposed,therelationshipbecomesnegativeforagivengenefunction(rs0.33) but where no such constraint is imposed, the relationship becomes negative for a given gene function (r s 0.33)butwherenosuchconstraintisimposed,therelationshipbecomesnegativeforagivengenefunction(rs À0.13). We predict factors driving replicability in some prior analyses of gene networks and show that they are unconnected with the correctness of the original result, instead reflecting replicable biases. Without these biases, the original results also vanish replicably. We show these effects can occur quite far upstream in network data and that there is a strong tendency within protein-protein interaction data for highly replicable interactions to be associated with poor quality control.
Scientific reports, Jan 27, 2015
Current clinical practice in cancer stratifies patients based on tumour histology to determine pr... more Current clinical practice in cancer stratifies patients based on tumour histology to determine prognosis. Molecular profiling has been hailed as the path towards personalised care, but molecular data are still typically analysed independently of known clinical information. Conventional clinical and histopathological data, if used, are added only to improve a molecular prediction, placing a high burden upon molecular data to be informative in isolation. Here, we develop a novel Monte Carlo analysis to evaluate the usefulness of data assemblages. We applied our analysis to varying assemblages of clinical data and molecular data in an ovarian cancer dataset, evaluating their ability to discriminate one-year progression-free survival (PFS) and three-year overall survival (OS). We found that Cox proportional hazard regression models based on both data types together provided greater discriminative ability than either alone. In particular, we show that proteomics data assemblages that alo...
Bioinformatics, 2015
Motivation: RNA-seq co-expression analysis is in its infancy and reasonable practices remain poor... more Motivation: RNA-seq co-expression analysis is in its infancy and reasonable practices remain poorly defined. We assessed a variety of RNA-seq expression data to determine factors affecting functional connectivity and topology in co-expression networks. Results: We examine RNA-seq co-expression data generated from 1970 RNA-seq samples using a Guilt-By-Association framework, in which genes are assessed for the tendency of co-expression to reflect shared function. Minimal experimental criteria to obtain performance on par with microarrays were >20 samples with read depth >10 M per sample. While the aggregate network constructed shows good performance (area under the receiver operator characteristic curve $0.71), the dependency on number of experiments used is nearly identical to that present in microarrays, suggesting thousands of samples are required to obtain 'gold-standard' co-expression. We find a major topological difference between RNA-seq and microarray co-expression in the form of low overlaps between hub-like genes from each network due to changes in the correlation of expression noise within each technology.
Analytical cellular pathology (Amsterdam), 2011
Tumour cells employ a variety of mechanisms to invade their environment and to form metastases. A... more Tumour cells employ a variety of mechanisms to invade their environment and to form metastases. An important property is the ability of tumour cells to transition between individual cell invasive mode and collective mode. The switch from collective to individual cell invasion in the breast was shown recently to determine site of subsequent metastasis. Previous studies have suggested a range of invasion modes from single cells to large clusters. Here, we use a novel image analysis method to quantify and categorise invasion. We have developed a process using automated imaging for data collection, unsupervised morphological examination of breast cancer invasion using cognition network technology (CNT) to determine how many patterns of invasion can be reliably discriminated. We used Bayesian network analysis to probabilistically connect morphological variables and therefore determine that two categories of invasion are clearly distinct from one another. The Bayesian network separated in...
Scientific reports, Jan 27, 2015
Current clinical practice in cancer stratifies patients based on tumour histology to determine pr... more Current clinical practice in cancer stratifies patients based on tumour histology to determine prognosis. Molecular profiling has been hailed as the path towards personalised care, but molecular data are still typically analysed independently of known clinical information. Conventional clinical and histopathological data, if used, are added only to improve a molecular prediction, placing a high burden upon molecular data to be informative in isolation. Here, we develop a novel Monte Carlo analysis to evaluate the usefulness of data assemblages. We applied our analysis to varying assemblages of clinical data and molecular data in an ovarian cancer dataset, evaluating their ability to discriminate one-year progression-free survival (PFS) and three-year overall survival (OS). We found that Cox proportional hazard regression models based on both data types together provided greater discriminative ability than either alone. In particular, we show that proteomics data assemblages that alo...
Morphology during tumour invasion 127 4.1 Tumour invasion 128 4.2 Automated image analysis 129 4.... more Morphology during tumour invasion 127 4.1 Tumour invasion 128 4.2 Automated image analysis 129 4.2.1 Morphological measures 130 6 4.3 Bayesian network 131 4.4 Discriminative capacity of morphological measures 132 4.4.1 Cell-cell contact 133 4.4.2 Group area 133 4.4.3 Surface roughness 134 4.4.4 Length/width ratio 135 4.5 Discussion 135 Conclusion 137 5.1 Contributions 137 5.1.1 Biomarker discovery for ovarian cancer 137 5.1.2 Tumour invasion 138 5.2 Future work 138 5.2.1 Biomarker discovery for ovarian cancer 138 5.2.2 Tumour invasion 139 Bibliography 141 Appendix A: data sets 161 7.1 Edinburgh Ovarian Cancer Register (EOCR) 161 7.1.1 Original data set for feature selection and training set for 1YM-PFS and 3YM-OS classifiers 161 7.1.2 Additional validation data set for the validation of 1YM-PFS and 3YM-OS classifiers 161 7.2 Tumour invasion data set 162 Index 165 8 2.14 Analysis of the proportional hazards assumption with cox.zph function in R. The flatness of the fitted line illustrates that the Stage parameter does now violates the proportional hazards assumption over the survival time (Time). 78 2.15 Forward phase protein array and reverse phase protein array have a different configuration of analytes and antibodies. 86 2.16 An example of an RPPA two-by-two grid plate. A set of 9 proteins are measure for a time-series with 17 intervals. Each dot on the figure represents the expression of an antibody of a corresponding target. 87 2.17 Immunofluorescence images of a tissue microarrays assay (Blue = DAPI nuclei; Green = cytokeratin tumour mask Red = antibodyconjugated flourophores) (From Fig. 1 of Faratian et al. [Faratian et al., 2011]). 88 3.1 Frequencies of stages, and histological types in the data. Not all the different combinations of stage and histological type are equally distributed in this data set. Later stage ovarian carcinoma have a higher frequency compared to the early stage ovarian cancer. 91 3.2 The difference in time between overall survival (OS), and progressionfree survival (PFS) (Dx, Sx: date of histological diagnosis, CRx: date of treatment diagnosis, Re/Prog: date of first signs of disease recurrence, DLS/Death: date of death from any cause). 91 3.3 A survival function for the progression-free survival (PFS) and overall survival (OS) for patients under different treatment regimen (Regimen 1: platinum and Regimen 2: platinum combined with taxane). 92 3.4 The biological circuit of known interactions active in the proteomics profile for ovarian cancer [Hanahan and Weinberg, 2000,Hanahan and Weinberg, 2011]. 94 3.5 Bayesian network of the proteomics profile. 98 3.6 Bayesian network of the clinicopathological measurements. 99 3.7 Three-layered Bayesian network with a first layer of clinicopathological measurements, a second layer of candidate proteomes biomarkers, and third layer of progression-free survival (PFS) and overall survival (OS) outputs. 100 3.8 The following scheme illustrates the different discriminative machine learning methodologies used during this research: survival analysis and classification. First, feature selection is executed on the clinicopathological, proteomics data, and the combination of both. The selected features are plugged into the survival analysis, and into the classification model. The survival model is verified with the following performance measure: c-index, p value of a Monte Carlo experiment, and the shrinkage. The classification models are verified with area under ROC curve (AUC), a hybrid metric (SAR), and precisionrecall F measure (F). 101 3.9 These block schemes provide an overview of the selected features and the performance measures: 10-fold cross validated c-index, pvalue of the Monte Carlo experiment, and the shrinkage for the Cox proportional hazards regression models for PFS. 103 9 3.10 The Monte Carlo distribution and the performance measures: 10fold cross validated c-index, p-value of the Monte Carlo experiment, and the shrinkage for the Cox proportional hazards regression models for PFS. 105 3.11 These block schemes provide an overview of the selected features, and the performance measures: 10-fold cross validated c-index, pvalue of the Monte Carlo experiment, and the shrinkage for the Cox proportional hazards regression models for OS. 107 3.12 The Monte Carlo distribution and the performance measures: 10fold cross validated c-index, p-value of the Monte Carlo experiment, and the shrinkage for the Cox proportional hazards regression models for OS. 108 3.13 The block scheme and the Monte Carlo distribution of the Cox proportional hazards regression model illustrate the performance for predicting PFS in the case of papillary serous in stage 3. 110 3.14 The block scheme and the Monte Carlo distribution of the Cox proportional hazards regression model illustrate the performance for predicting PFS in the case of papillary serous in stage 4. 110 3.15 The block scheme and the Monte Carlo distribution of the Cox proportional hazards regression model illustrate the performance for predicting PFS in the case of endometrioid in stage 3. 111 3.16 The block scheme and the Monte Carlo distribution of the Cox proportional hazards regression model illustrate the performance for predicting PFS in the case of mullerian in stage 3. 111 3.17 The block scheme and the Monte Carlo distribution of the Cox proportional hazards regression model illustrate the performance for predicting OS in the case of papillary serous in stage 3. 112 3.18 The block scheme and the Monte Carlo distribution of the Cox proportional hazards regression model illustrate the performance for predicting OS in the case of papillary serous in stage 4. 112 3.19 The block scheme and the Monte Carlo distribution of the Cox proportional hazards regression model illustrate the performance for predicting OS in the case of endometrioid in stage 3. 113 3.20 The block scheme and the Monte Carlo distribution of the Cox proportional hazards regression model illustrate the performance for predicting OS in the case of mixed mullerian in stage 3. 113 3.21 Performance measures, AUC, F-measure, and SAR for 1YM-PFS classifier constructed with logistic regression, Cox proportional hazards regression, and support vector machines. 115 3.22 The performance measures, AUC, F-measure, and SAR, for 1YM-PFS classifier constructed with logistic regression, Cox proportional hazards regression, and support vector machines. The classifiers are constructed based on the clinicopathological data segmentation for papillary serous in stage 3 and stage 4, endometrioid in stage 3, and mixed mullerian in stage 3. 116 3.23 Performance measures, AUC, F-measure, and SAR for 3YM-OS classifier constructed with logistic regression, Cox proportional hazards regression, and support vector machines. 117 10 3.24 The performance measures, AUC, F-measure, and SAR, for 3YM-OS classifier constructed with logistic regression, Cox proportional hazards regression, and support vector machines. The classifiers are constructed based on the clinicopathological data segmentation for papillary serous in stage 3 and stage 4, endometrioid in stage 3, and mixed mullerian in stage 3. 118 3.25 The ROC-and precision/recall plot for the classification model (1YM-PFS) after 10-fold cross validation. 121 3.26 The ROC-and precision-recall plot for the classification model (1YM-PFS) after validation with a separate data set. 122 3.27 The ROC-and precision/recall plot for the classification model (3YM-OS) after 10-fold cross validation. 123 3.28 The ROC-and precision-recall plot for the classification model (1YM-OS) after validation with a separate data set. 124 4.1 Boxplot for the different invasion types per cell line (C35pool and C35hi) [Katz et al., 2011]. 129 4.2 A fluorescent-stained image from invasion assay. Pan-cytokeratin rabbit polychonal antibody is used to select epithelial cells, and visualization is performed by anti-rabbit-Cy3. DAPI counterstain was used to identify nuclei. 129 4.3 The cognition network technology (CNT) applied for the detection of tumours in the invasion assay. 130 4.4 Bayesian network constructs a graph of statistical dependencies between morphological measures and tumour invasion types. 132 4.5 Histogram cell-cell contact for the comparison between individualand collective invasion. 133 4.6 Histogram group area for the comparison between individual-and collective invasion. 134 4.7-1cm 134 4.8 Histogram roughness for the comparison between individual-and collective invasion. 134 4.9 Histogram length/width ratio for the comparison between individualand collective invasion.
2022 IEEE Aerospace Conference (AERO)
This framework enables C suite executive leaders to define a business plan and manage technologic... more This framework enables C suite executive leaders to define a business plan and manage technological dependencies for building AI/ML Solutions. The business plan of this framework provides components and background information to define strategy and analyze cost. Furthermore, the business plan represents the fundamentals of AI/ML Innovation and AI/ML Solutions. Therefore, the framework provides a menu for managing and investing in AI/ML. Finally, this framework is constructed with an interdisciplinary and holistic view of AI/ML Innovation and builds on advances in business strategy in harmony with technological progress for AI/ML. This framework incorporates value chain, supply chain, and ecosystem strategies.
Bioinformatics, 2014
Motivation: Network-based gene function inference methods have proliferated in recent years, but ... more Motivation: Network-based gene function inference methods have proliferated in recent years, but measurable progress remains elusive. We wished to better explore performance trends by controlling data and algorithm implementation, with a particular focus on the performance of aggregate predictions. Results: Hypothesizing that popular methods would perform well without hand-tuning, we used well-characterized algorithms to produce verifiably 'untweaked' results. We find that most stateof-the-art machine learning methods obtain 'gold standard' performance as measured in critical assessments in defined tasks. Across a broad range of tests, we see close alignment in algorithm performances after controlling for the underlying data being used. We find that algorithm aggregation provides only modest benefits, with a 17% increase in area under the ROC (AUROC) above the mean AUROC. In contrast, data aggregation gains are enormous with an 88% improvement in mean AUROC. Altogether, we find substantial evidence to support the view that additional algorithm development has little to offer for gene function prediction. Availability and implementation: The supplementary information contains a description of the algorithms, the network data parsed from different biological data resources and a guide to the source code
Systems pathology attempts to introduce more holistic approaches towards pathology and attempts t... more Systems pathology attempts to introduce more holistic approaches towards pathology and attempts to integrate clinicopathological information with “-omics” technology. This doctorate researches two examples of a systems approach for pathology: (1) a personalized patient output prediction for ovarian cancer and (2) an analytical approach differentiates between individual and collective tumour invasion. During the personalized patient output prediction for ovarian cancer study, clinicopathological measurements and proteomic biomarkers are analysed with a set of newly engineered bioinformatic tools. These tools are based upon feature selection, survival analysis with Cox proportional hazards regression, and a novel Monte Carlo approach. Clinical and pathological data proves to have highly significant information content, as expected; however, molecular data has little information content alone, and is only significant when selected most-informative variables are placed in the context of...
Bioinformatics, Dec 14, 2015
Motivation: Gene networks have become a central tool in the analysis of genomic data but are wide... more Motivation: Gene networks have become a central tool in the analysis of genomic data but are widely regarded as hard to interpret. This has motivated a great deal of comparative evaluation and research into best practices. We explore the possibility that this may lead to overfitting in the field as a whole. Results: We construct a model of 'research communities' sampling from real gene network data and machine learning methods to characterize performance trends. Our analysis reveals an important principle limiting the value of replication, namely that targeting it directly causes 'easy' or uninformative replication to dominate analyses. We find that when sampling across network data and algorithms with similar variability, the relationship between replicability and accuracy is positive (Spearman's correlation, r s 0.33)butwherenosuchconstraintisimposed,therelationshipbecomesnegativeforagivengenefunction(rs0.33) but where no such constraint is imposed, the relationship becomes negative for a given gene function (r s 0.33)butwherenosuchconstraintisimposed,therelationshipbecomesnegativeforagivengenefunction(rs À0.13). We predict factors driving replicability in some prior analyses of gene networks and show that they are unconnected with the correctness of the original result, instead reflecting replicable biases. Without these biases, the original results also vanish replicably. We show these effects can occur quite far upstream in network data and that there is a strong tendency within protein-protein interaction data for highly replicable interactions to be associated with poor quality control.
Scientific reports, Jan 27, 2015
Current clinical practice in cancer stratifies patients based on tumour histology to determine pr... more Current clinical practice in cancer stratifies patients based on tumour histology to determine prognosis. Molecular profiling has been hailed as the path towards personalised care, but molecular data are still typically analysed independently of known clinical information. Conventional clinical and histopathological data, if used, are added only to improve a molecular prediction, placing a high burden upon molecular data to be informative in isolation. Here, we develop a novel Monte Carlo analysis to evaluate the usefulness of data assemblages. We applied our analysis to varying assemblages of clinical data and molecular data in an ovarian cancer dataset, evaluating their ability to discriminate one-year progression-free survival (PFS) and three-year overall survival (OS). We found that Cox proportional hazard regression models based on both data types together provided greater discriminative ability than either alone. In particular, we show that proteomics data assemblages that alo...
Bioinformatics, 2015
Motivation: RNA-seq co-expression analysis is in its infancy and reasonable practices remain poor... more Motivation: RNA-seq co-expression analysis is in its infancy and reasonable practices remain poorly defined. We assessed a variety of RNA-seq expression data to determine factors affecting functional connectivity and topology in co-expression networks. Results: We examine RNA-seq co-expression data generated from 1970 RNA-seq samples using a Guilt-By-Association framework, in which genes are assessed for the tendency of co-expression to reflect shared function. Minimal experimental criteria to obtain performance on par with microarrays were >20 samples with read depth >10 M per sample. While the aggregate network constructed shows good performance (area under the receiver operator characteristic curve $0.71), the dependency on number of experiments used is nearly identical to that present in microarrays, suggesting thousands of samples are required to obtain 'gold-standard' co-expression. We find a major topological difference between RNA-seq and microarray co-expression in the form of low overlaps between hub-like genes from each network due to changes in the correlation of expression noise within each technology.
Analytical cellular pathology (Amsterdam), 2011
Tumour cells employ a variety of mechanisms to invade their environment and to form metastases. A... more Tumour cells employ a variety of mechanisms to invade their environment and to form metastases. An important property is the ability of tumour cells to transition between individual cell invasive mode and collective mode. The switch from collective to individual cell invasion in the breast was shown recently to determine site of subsequent metastasis. Previous studies have suggested a range of invasion modes from single cells to large clusters. Here, we use a novel image analysis method to quantify and categorise invasion. We have developed a process using automated imaging for data collection, unsupervised morphological examination of breast cancer invasion using cognition network technology (CNT) to determine how many patterns of invasion can be reliably discriminated. We used Bayesian network analysis to probabilistically connect morphological variables and therefore determine that two categories of invasion are clearly distinct from one another. The Bayesian network separated in...
Scientific reports, Jan 27, 2015
Current clinical practice in cancer stratifies patients based on tumour histology to determine pr... more Current clinical practice in cancer stratifies patients based on tumour histology to determine prognosis. Molecular profiling has been hailed as the path towards personalised care, but molecular data are still typically analysed independently of known clinical information. Conventional clinical and histopathological data, if used, are added only to improve a molecular prediction, placing a high burden upon molecular data to be informative in isolation. Here, we develop a novel Monte Carlo analysis to evaluate the usefulness of data assemblages. We applied our analysis to varying assemblages of clinical data and molecular data in an ovarian cancer dataset, evaluating their ability to discriminate one-year progression-free survival (PFS) and three-year overall survival (OS). We found that Cox proportional hazard regression models based on both data types together provided greater discriminative ability than either alone. In particular, we show that proteomics data assemblages that alo...