Henry Horng-Shing Lu | National Chiao Tung University (original) (raw)
Papers by Henry Horng-Shing Lu
2019 3rd International Conference on Informatics and Computational Sciences (ICICoS)
Deep learning development nowadays has attracted a lot of attention because of its effectiveness ... more Deep learning development nowadays has attracted a lot of attention because of its effectiveness and good performance. The performance of deep learning in medical images analysis already can compete with medical image experts. However, there are experts that still believe deep learning only efficient for the big datasets, because of deep learning performance in small datasets still not satisfying enough. In this study, it is aimed to build a deep learning model for image classification that can achieve high accuracy using chest X-ray images with a relatively small dataset. We classify chest X-ray into a binary classification which is a normal image and image with abnormalities. We built and experimented our model using the public dataset of Shenzen Hospital dataset. We also use a different type of input based on different images preprocessing so that the model can perform accurate classification. Based on the result, pre-trained CheXNet with a newly trained fully connected network on the cropped dataset can achieve the accuracy 0.8761, the sensitivity 0.8909, and the specificity 0.8621. The performance of the model also influenced by the certain region inside the images, such as other regions outside the lung region and black colored region outside the body region.
Biomedicines
Automated glaucoma detection using deep learning may increase the diagnostic rate of glaucoma to ... more Automated glaucoma detection using deep learning may increase the diagnostic rate of glaucoma to prevent blindness, but generalizable models are currently unavailable despite the use of huge training datasets. This study aims to evaluate the performance of a convolutional neural network (CNN) classifier trained with a limited number of high-quality fundus images in detecting glaucoma and methods to improve its performance across different datasets. A CNN classifier was constructed using EfficientNet B3 and 944 images collected from one medical center (core model) and externally validated using three datasets. The performance of the core model was compared with (1) the integrated model constructed by using all training images from the four datasets and (2) the dataset-specific model built by fine-tuning the core model with training images from the external datasets. The diagnostic accuracy of the core model was 95.62% but dropped to ranges of 52.5–80.0% on the external datasets. Data...
Entropy
Nowadays, deep learning methods with high structural complexity and flexibility inevitably lean o... more Nowadays, deep learning methods with high structural complexity and flexibility inevitably lean on the computational capability of the hardware. A platform with high-performance GPUs and large amounts of memory could support neural networks having large numbers of layers and kernels. However, naively pursuing high-cost hardware would probably drag the technical development of deep learning methods. In the article, we thus establish a new preprocessing method to reduce the computational complexity of the neural networks. Inspired by the band theory of solids in physics, we map the image space into a noninteraction physical system isomorphically and then treat image voxels as particle-like clusters. Then, we reconstruct the Fermi–Dirac distribution to be a correction function for the normalization of the voxel intensity and as a filter of insignificant cluster components. The filtered clusters at the circumstance can delineate the morphological heterogeneity of the image voxels. We us...
We study the maximum likelihood model in emission tomography and propose a new family of algorith... more We study the maximum likelihood model in emission tomography and propose a new family of algorithms for its solution, called String-Averaging Expectation-Maximization (SAEM). In the String-Averaging algorithmic regime, the index set of all underlying equations is split into subsets, called “strings, ” and the algorithm separately proceeds along each string, possibly in parallel. Then, the end-points of all strings are averaged to form the next iterate. SAEM algorithms with 2several strings presents better practical merits than the classical Row-Action Maximum-Likelihood Algorithm (RAMLA). We present numerical experiments showing the effectiveness of the algo-rithmic scheme in realistic situations. Performance is evaluated from the computational cost and
for identifying transcription factor binding sites in yeast
The solid line refers to the case where the points in each group are chosen randomly, while the d... more The solid line refers to the case where the points in each group are chosen randomly, while the dash line refers to the case where the points in each group are chosen from neighbors. In all cases, is fixed to 200.<b>Copyright information:</b>Taken from "Multidimensional scaling for large genomic data sets"http://www.biomedcentral.com/1471-2105/9/179BMC Bioinformatics 2008;9():179-179.Published online 4 Apr 2008PMCID:PMC2375126.
data dimension.<b>Copyright information:</b>Taken from "Multidimensional scaling... more data dimension.<b>Copyright information:</b>Taken from "Multidimensional scaling for large genomic data sets"http://www.biomedcentral.com/1471-2105/9/179BMC Bioinformatics 2008;9():179-179.Published online 4 Apr 2008PMCID:PMC2375126.
2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence), 2021
Intermittent demand data is one of the data types with a very random pattern, for example demand ... more Intermittent demand data is one of the data types with a very random pattern, for example demand data. The data will have value (not zero) if there is a demand. If there is no demand, the data is zero. Intermittent demand data is usually called customer demand data or sales data for an item that is not sold every time. The general problem is that demand is not always continuous but intermittent. This natural fact makes intermittent data not easy to predict. Standard methods used to predict intermittent demand data are Croston. Single exponential smoothing (SES) is also commonly used in practice. The Croston and exponential smoothing generally produce static forecasts. This study proposes a deep learning method i.e. recurrent neural network (RNN) and deep neural network (DNN) to predict intermittent data. The simulation study was carried out by generating data with 72 different design parameters and doing 50 repetitions. Also, the empirical study uses M5 competition data from the Kaggle website. This study aims to measure the performance of RNN and DNN, compared to Croston and SES as the benchmark methods, in predicting intermittent demand data. The performance measurement uses mean absolute error (MAE) and root mean squared scaled error (RMSSE). Deep learning performance in simulation studies successfully outperformed Croston and SES in several design parameters. In empirical studies, only the RNN method outperformed the benchmark methods. This study also found other information that the measurement of MAE is more robust than RMSSE.
Scientific Reports, 2021
The extraction of brain tumor tissues in 3D Brain Magnetic Resonance Imaging (MRI) plays an impor... more The extraction of brain tumor tissues in 3D Brain Magnetic Resonance Imaging (MRI) plays an important role in diagnosis before the gamma knife radiosurgery (GKRS). In this article, the post-contrast T1 whole-brain MRI images had been collected by Taipei Veterans General Hospital (TVGH) and stored in DICOM format (dated from 1999 to 2018). The proposed method starts with the active contour model to get the region of interest (ROI) automatically and enhance the image contrast. The segmentation models are trained by MRI images with tumors to avoid imbalanced data problem under model construction. In order to achieve this objective, a two-step ensemble approach is used to establish such diagnosis, first, classify whether there is any tumor in the image, and second, segment the intracranial metastatic tumors by ensemble neural networks based on 2D U-Net architecture. The ensemble for classification and segmentation simultaneously also improves segmentation accuracy. The result of classif...
Scientific Reports, 2020
Acute lower respiratory infection is the leading cause of child death in developing countries. Cu... more Acute lower respiratory infection is the leading cause of child death in developing countries. Current strategies to reduce this problem include early detection and appropriate treatment. Better diagnostic and therapeutic strategies are still needed in poor countries. Artificial-intelligence chest X-ray scheme has the potential to become a screening tool for lower respiratory infection in child. Artificial-intelligence chest X-ray schemes for children are rare and limited to a single lung disease. We need a powerful system as a diagnostic tool for most common lung diseases in children. To address this, we present a computer-aided diagnostic scheme for the chest X-ray images of several common pulmonary diseases of children, including bronchiolitis/bronchitis, bronchopneumonia/interstitial pneumonitis, lobar pneumonia, and pneumothorax. The study consists of two main approaches: first, we trained a model based on YOLOv3 architecture for cropping the appropriate location of the lung fi...
Applied Sciences, 2019
Techniques of automatic medical image segmentation are the most important methods for clinical in... more Techniques of automatic medical image segmentation are the most important methods for clinical investigation, anatomic research, and modern medicine. Various image structures constructed from imaging apparatus achieve a diversity of medical applications. However, the diversified structures are also a burden of contemporary techniques. Performing an image segmentation with a tremendously small size (<25 pixels by 25 pixels) or tremendously large size (>1024 pixels by 1024 pixels) becomes a challenge in perspectives of both technical feasibility and theoretical development. Noise and pixel pollution caused by the imaging apparatus even aggravate the difficulty of image segmentation. To simultaneously overcome the mentioned predicaments, we propose a new method of medical image segmentation with adjustable computational complexity by introducing data density functionals. Under this theoretical framework, several kernels can be assigned to conquer specific predicaments. A square-r...
Journal of Computational and Applied Mathematics, 2019
The eigenvalue problem of a graph Laplacian matrix L arising from a simple, connected and undirec... more The eigenvalue problem of a graph Laplacian matrix L arising from a simple, connected and undirected graph has been given more attention due to its extensive applications, such as spectral clustering, community detection, complex network, image processing and so on. The associated graph Laplacian matrix is symmetric, positive semi-definite, and is usually large and sparse. Computing some smallest positive eigenvalues and corresponding eigenvectors is often of interest. However, the singularity of L makes the classical eigensolvers inefficient since we need to factorize L for the purpose of solving large and sparse linear systems exactly. The next difficulty is that it is usually time consuming or even unavailable to factorize a large and sparse matrix arising from real network problems from big data such as social media transactional databases, and sensor systems because there is in general not only local connections. In this paper, we propose an eignsolver based on the inexact residual Arnoldi [18,19] method together with an implicit remedy of the singularity and an effective deflation for convergent eigenvalues. Numerical experiments reveal that the integrated eigensolver outperforms the classical Arnoldi/Lanczos method for computing some smallest positive eigeninformation provided the LU factorization is not available.
Journal of the Formosan Medical Association = Taiwan yi zhi, Jan 27, 2018
To investigate the knowledge and learning ability of glaucoma patients regarding their anti-glauc... more To investigate the knowledge and learning ability of glaucoma patients regarding their anti-glaucoma topical medications. Patients on regular follow-up at the Glaucoma Clinic at Hsin-Chu General Hospital were recruited. After detailed ocular examinations, the participants were asked to recall and identify their glaucoma eye drops. The same test was repeated 3 months later. The results of both tests, the learning ability of patients regarding their glaucoma drugs, and the relationship between learning ability and demographic variables were evaluated. Two hundred eighty-seven glaucoma patients participated in this study. Of the study population, 25.8% and 57.1% could recall their topical mediation at the first and second tests, whereas 72.1% and 88.5% could identify their prescribed eye drops at the first and second tests, respectively. Approximately 34% of the participants showed improved knowledge at the repeat test, whereas 40% of the participants showed no improvement. Participant...
PloS one, 2017
The great amount of gene expression data has brought a big challenge for the discovery of Gene Re... more The great amount of gene expression data has brought a big challenge for the discovery of Gene Regulatory Network (GRN). For network reconstruction and the investigation of regulatory relations, it is desirable to ensure directness of links between genes on a map, infer their directionality and explore candidate biological functions from high-throughput transcriptomic data. To address these problems, we introduce a Boolean Function Network (BFN) model based on techniques of hidden Markov model (HMM), likelihood ratio test and Boolean logic functions. BFN consists of two consecutive tests to establish links between pairs of genes and check their directness. We evaluate the performance of BFN through the application to S. cerevisiae time course data. BFN produces regulatory relations which show consistency with succession of cell cycle phases. Furthermore, it also improves sensitivity and specificity when compared with alternative methods of genetic network reverse engineering. Moreov...
BMC Bioinformatics, 2016
Background: It has been a challenging task to build a genome-wide phylogenetic tree for a large g... more Background: It has been a challenging task to build a genome-wide phylogenetic tree for a large group of species containing a large number of genes with long nucleotides sequences. The most popular method, called feature frequency profile (FFP-k), finds the frequency distribution for all words of certain length k over the whole genome sequence using (overlapping) windows of the same length. For a satisfactory result, the recommended word length (k) ranges from 6 to 15 and it may not be a multiple of 3 (codon length). The total number of possible words needed for FFP-k can range from 4 6 = 4096 to 4 15. Results: We propose a simple improvement over the popular FFP method using only a typical word length of 3. A new method, called Trinucleotide Usage Profile (TUP), is proposed based only on the (relative) frequency distribution using non-overlapping windows of length 3. The total number of possible words needed for TUP is 4 3 = 64, which is much less than the total count for the recommended optimal "resolution" for FFP. To build a phylogenetic tree, we propose first representing each of the species by a TUP vector and then using an appropriate distance measure between pairs of the TUP vectors for the tree construction. In particular, we propose summarizing a DNA sequence by a matrix of three rows corresponding to three reading frames, recording the frequency distribution of the non-overlapping words of length 3 in each of the reading frame. We also provide a numerical measure for comparing trees constructed with various methods. Conclusions: Compared to the FFP method, our empirical study showed that the proposed TUP method is more capable of building phylogenetic trees with a stronger biological support. We further provide some justifications on this from the information theory viewpoint. Unlike the FFP method, the TUP method takes the advantage that the starting of the first reading frame is (usually) known. Without this information, the FFP method could only rely on the frequency distribution of overlapping words, which is the average (or mixture) of the frequency distributions of three possible reading frames. Consequently, we show (from the entropy viewpoint) that the FFP procedure could dilute important gene information and therefore provides less accurate classification.
Biostatistics (Oxford, England), Jul 24, 2015
Sufficient dimension reduction is widely applied to help model building between the response [For... more Sufficient dimension reduction is widely applied to help model building between the response [Formula: see text] and covariate [Formula: see text]. In some situations, we also collect additional covariate [Formula: see text] that has better performance in predicting [Formula: see text], but has a higher obtaining cost, than [Formula: see text]. While constructing a predictive model for [Formula: see text] based on [Formula: see text] is straightforward, this strategy is not applicable since [Formula: see text] is not available for future observations in which the constructed model is to be applied. As a result, the aim of the study is to build a predictive model for [Formula: see text] based on [Formula: see text] only, where the available data is [Formula: see text]. A naive method is to conduct analysis using [Formula: see text] directly, but ignoring [Formula: see text] can cause the problem of inefficiency. On the other hand, it is not trivial to utilize the information of [Form...
Springer Handbooks Comp.Statistics
ABSTRACT
Trends in Genetics, 2002
Research Update For more than 30 years, expression divergence has been considered as a major reas... more Research Update For more than 30 years, expression divergence has been considered as a major reason for retaining duplicated genes in a genome, but how often and how fast duplicate genes diverge in expression has not been studied at the genomic level. Using yeast microarray data, we show that expression divergence between duplicate genes is significantly correlated with their synonymous divergence (K S) and also with their nonsynonymous divergence (K A) if K A ≤ ≤ 0.3. Thus, expression divergence increases with evolutionary time, and K A is initially coupled with expression divergence. More interestingly, a large proportion of duplicate genes have diverged quickly in expression and the vast majority of gene pairs eventually become divergent in expression. Indeed, more than 40% of gene pairs show expression divergence even when K S is ≤ ≤ 0.10, and this proportion becomes > >80% for K S > > 1.5. Only a small fraction of ancient gene pairs do not show expression divergence.
2019 3rd International Conference on Informatics and Computational Sciences (ICICoS)
Deep learning development nowadays has attracted a lot of attention because of its effectiveness ... more Deep learning development nowadays has attracted a lot of attention because of its effectiveness and good performance. The performance of deep learning in medical images analysis already can compete with medical image experts. However, there are experts that still believe deep learning only efficient for the big datasets, because of deep learning performance in small datasets still not satisfying enough. In this study, it is aimed to build a deep learning model for image classification that can achieve high accuracy using chest X-ray images with a relatively small dataset. We classify chest X-ray into a binary classification which is a normal image and image with abnormalities. We built and experimented our model using the public dataset of Shenzen Hospital dataset. We also use a different type of input based on different images preprocessing so that the model can perform accurate classification. Based on the result, pre-trained CheXNet with a newly trained fully connected network on the cropped dataset can achieve the accuracy 0.8761, the sensitivity 0.8909, and the specificity 0.8621. The performance of the model also influenced by the certain region inside the images, such as other regions outside the lung region and black colored region outside the body region.
Biomedicines
Automated glaucoma detection using deep learning may increase the diagnostic rate of glaucoma to ... more Automated glaucoma detection using deep learning may increase the diagnostic rate of glaucoma to prevent blindness, but generalizable models are currently unavailable despite the use of huge training datasets. This study aims to evaluate the performance of a convolutional neural network (CNN) classifier trained with a limited number of high-quality fundus images in detecting glaucoma and methods to improve its performance across different datasets. A CNN classifier was constructed using EfficientNet B3 and 944 images collected from one medical center (core model) and externally validated using three datasets. The performance of the core model was compared with (1) the integrated model constructed by using all training images from the four datasets and (2) the dataset-specific model built by fine-tuning the core model with training images from the external datasets. The diagnostic accuracy of the core model was 95.62% but dropped to ranges of 52.5–80.0% on the external datasets. Data...
Entropy
Nowadays, deep learning methods with high structural complexity and flexibility inevitably lean o... more Nowadays, deep learning methods with high structural complexity and flexibility inevitably lean on the computational capability of the hardware. A platform with high-performance GPUs and large amounts of memory could support neural networks having large numbers of layers and kernels. However, naively pursuing high-cost hardware would probably drag the technical development of deep learning methods. In the article, we thus establish a new preprocessing method to reduce the computational complexity of the neural networks. Inspired by the band theory of solids in physics, we map the image space into a noninteraction physical system isomorphically and then treat image voxels as particle-like clusters. Then, we reconstruct the Fermi–Dirac distribution to be a correction function for the normalization of the voxel intensity and as a filter of insignificant cluster components. The filtered clusters at the circumstance can delineate the morphological heterogeneity of the image voxels. We us...
We study the maximum likelihood model in emission tomography and propose a new family of algorith... more We study the maximum likelihood model in emission tomography and propose a new family of algorithms for its solution, called String-Averaging Expectation-Maximization (SAEM). In the String-Averaging algorithmic regime, the index set of all underlying equations is split into subsets, called “strings, ” and the algorithm separately proceeds along each string, possibly in parallel. Then, the end-points of all strings are averaged to form the next iterate. SAEM algorithms with 2several strings presents better practical merits than the classical Row-Action Maximum-Likelihood Algorithm (RAMLA). We present numerical experiments showing the effectiveness of the algo-rithmic scheme in realistic situations. Performance is evaluated from the computational cost and
for identifying transcription factor binding sites in yeast
The solid line refers to the case where the points in each group are chosen randomly, while the d... more The solid line refers to the case where the points in each group are chosen randomly, while the dash line refers to the case where the points in each group are chosen from neighbors. In all cases, is fixed to 200.<b>Copyright information:</b>Taken from "Multidimensional scaling for large genomic data sets"http://www.biomedcentral.com/1471-2105/9/179BMC Bioinformatics 2008;9():179-179.Published online 4 Apr 2008PMCID:PMC2375126.
data dimension.<b>Copyright information:</b>Taken from "Multidimensional scaling... more data dimension.<b>Copyright information:</b>Taken from "Multidimensional scaling for large genomic data sets"http://www.biomedcentral.com/1471-2105/9/179BMC Bioinformatics 2008;9():179-179.Published online 4 Apr 2008PMCID:PMC2375126.
2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence), 2021
Intermittent demand data is one of the data types with a very random pattern, for example demand ... more Intermittent demand data is one of the data types with a very random pattern, for example demand data. The data will have value (not zero) if there is a demand. If there is no demand, the data is zero. Intermittent demand data is usually called customer demand data or sales data for an item that is not sold every time. The general problem is that demand is not always continuous but intermittent. This natural fact makes intermittent data not easy to predict. Standard methods used to predict intermittent demand data are Croston. Single exponential smoothing (SES) is also commonly used in practice. The Croston and exponential smoothing generally produce static forecasts. This study proposes a deep learning method i.e. recurrent neural network (RNN) and deep neural network (DNN) to predict intermittent data. The simulation study was carried out by generating data with 72 different design parameters and doing 50 repetitions. Also, the empirical study uses M5 competition data from the Kaggle website. This study aims to measure the performance of RNN and DNN, compared to Croston and SES as the benchmark methods, in predicting intermittent demand data. The performance measurement uses mean absolute error (MAE) and root mean squared scaled error (RMSSE). Deep learning performance in simulation studies successfully outperformed Croston and SES in several design parameters. In empirical studies, only the RNN method outperformed the benchmark methods. This study also found other information that the measurement of MAE is more robust than RMSSE.
Scientific Reports, 2021
The extraction of brain tumor tissues in 3D Brain Magnetic Resonance Imaging (MRI) plays an impor... more The extraction of brain tumor tissues in 3D Brain Magnetic Resonance Imaging (MRI) plays an important role in diagnosis before the gamma knife radiosurgery (GKRS). In this article, the post-contrast T1 whole-brain MRI images had been collected by Taipei Veterans General Hospital (TVGH) and stored in DICOM format (dated from 1999 to 2018). The proposed method starts with the active contour model to get the region of interest (ROI) automatically and enhance the image contrast. The segmentation models are trained by MRI images with tumors to avoid imbalanced data problem under model construction. In order to achieve this objective, a two-step ensemble approach is used to establish such diagnosis, first, classify whether there is any tumor in the image, and second, segment the intracranial metastatic tumors by ensemble neural networks based on 2D U-Net architecture. The ensemble for classification and segmentation simultaneously also improves segmentation accuracy. The result of classif...
Scientific Reports, 2020
Acute lower respiratory infection is the leading cause of child death in developing countries. Cu... more Acute lower respiratory infection is the leading cause of child death in developing countries. Current strategies to reduce this problem include early detection and appropriate treatment. Better diagnostic and therapeutic strategies are still needed in poor countries. Artificial-intelligence chest X-ray scheme has the potential to become a screening tool for lower respiratory infection in child. Artificial-intelligence chest X-ray schemes for children are rare and limited to a single lung disease. We need a powerful system as a diagnostic tool for most common lung diseases in children. To address this, we present a computer-aided diagnostic scheme for the chest X-ray images of several common pulmonary diseases of children, including bronchiolitis/bronchitis, bronchopneumonia/interstitial pneumonitis, lobar pneumonia, and pneumothorax. The study consists of two main approaches: first, we trained a model based on YOLOv3 architecture for cropping the appropriate location of the lung fi...
Applied Sciences, 2019
Techniques of automatic medical image segmentation are the most important methods for clinical in... more Techniques of automatic medical image segmentation are the most important methods for clinical investigation, anatomic research, and modern medicine. Various image structures constructed from imaging apparatus achieve a diversity of medical applications. However, the diversified structures are also a burden of contemporary techniques. Performing an image segmentation with a tremendously small size (<25 pixels by 25 pixels) or tremendously large size (>1024 pixels by 1024 pixels) becomes a challenge in perspectives of both technical feasibility and theoretical development. Noise and pixel pollution caused by the imaging apparatus even aggravate the difficulty of image segmentation. To simultaneously overcome the mentioned predicaments, we propose a new method of medical image segmentation with adjustable computational complexity by introducing data density functionals. Under this theoretical framework, several kernels can be assigned to conquer specific predicaments. A square-r...
Journal of Computational and Applied Mathematics, 2019
The eigenvalue problem of a graph Laplacian matrix L arising from a simple, connected and undirec... more The eigenvalue problem of a graph Laplacian matrix L arising from a simple, connected and undirected graph has been given more attention due to its extensive applications, such as spectral clustering, community detection, complex network, image processing and so on. The associated graph Laplacian matrix is symmetric, positive semi-definite, and is usually large and sparse. Computing some smallest positive eigenvalues and corresponding eigenvectors is often of interest. However, the singularity of L makes the classical eigensolvers inefficient since we need to factorize L for the purpose of solving large and sparse linear systems exactly. The next difficulty is that it is usually time consuming or even unavailable to factorize a large and sparse matrix arising from real network problems from big data such as social media transactional databases, and sensor systems because there is in general not only local connections. In this paper, we propose an eignsolver based on the inexact residual Arnoldi [18,19] method together with an implicit remedy of the singularity and an effective deflation for convergent eigenvalues. Numerical experiments reveal that the integrated eigensolver outperforms the classical Arnoldi/Lanczos method for computing some smallest positive eigeninformation provided the LU factorization is not available.
Journal of the Formosan Medical Association = Taiwan yi zhi, Jan 27, 2018
To investigate the knowledge and learning ability of glaucoma patients regarding their anti-glauc... more To investigate the knowledge and learning ability of glaucoma patients regarding their anti-glaucoma topical medications. Patients on regular follow-up at the Glaucoma Clinic at Hsin-Chu General Hospital were recruited. After detailed ocular examinations, the participants were asked to recall and identify their glaucoma eye drops. The same test was repeated 3 months later. The results of both tests, the learning ability of patients regarding their glaucoma drugs, and the relationship between learning ability and demographic variables were evaluated. Two hundred eighty-seven glaucoma patients participated in this study. Of the study population, 25.8% and 57.1% could recall their topical mediation at the first and second tests, whereas 72.1% and 88.5% could identify their prescribed eye drops at the first and second tests, respectively. Approximately 34% of the participants showed improved knowledge at the repeat test, whereas 40% of the participants showed no improvement. Participant...
PloS one, 2017
The great amount of gene expression data has brought a big challenge for the discovery of Gene Re... more The great amount of gene expression data has brought a big challenge for the discovery of Gene Regulatory Network (GRN). For network reconstruction and the investigation of regulatory relations, it is desirable to ensure directness of links between genes on a map, infer their directionality and explore candidate biological functions from high-throughput transcriptomic data. To address these problems, we introduce a Boolean Function Network (BFN) model based on techniques of hidden Markov model (HMM), likelihood ratio test and Boolean logic functions. BFN consists of two consecutive tests to establish links between pairs of genes and check their directness. We evaluate the performance of BFN through the application to S. cerevisiae time course data. BFN produces regulatory relations which show consistency with succession of cell cycle phases. Furthermore, it also improves sensitivity and specificity when compared with alternative methods of genetic network reverse engineering. Moreov...
BMC Bioinformatics, 2016
Background: It has been a challenging task to build a genome-wide phylogenetic tree for a large g... more Background: It has been a challenging task to build a genome-wide phylogenetic tree for a large group of species containing a large number of genes with long nucleotides sequences. The most popular method, called feature frequency profile (FFP-k), finds the frequency distribution for all words of certain length k over the whole genome sequence using (overlapping) windows of the same length. For a satisfactory result, the recommended word length (k) ranges from 6 to 15 and it may not be a multiple of 3 (codon length). The total number of possible words needed for FFP-k can range from 4 6 = 4096 to 4 15. Results: We propose a simple improvement over the popular FFP method using only a typical word length of 3. A new method, called Trinucleotide Usage Profile (TUP), is proposed based only on the (relative) frequency distribution using non-overlapping windows of length 3. The total number of possible words needed for TUP is 4 3 = 64, which is much less than the total count for the recommended optimal "resolution" for FFP. To build a phylogenetic tree, we propose first representing each of the species by a TUP vector and then using an appropriate distance measure between pairs of the TUP vectors for the tree construction. In particular, we propose summarizing a DNA sequence by a matrix of three rows corresponding to three reading frames, recording the frequency distribution of the non-overlapping words of length 3 in each of the reading frame. We also provide a numerical measure for comparing trees constructed with various methods. Conclusions: Compared to the FFP method, our empirical study showed that the proposed TUP method is more capable of building phylogenetic trees with a stronger biological support. We further provide some justifications on this from the information theory viewpoint. Unlike the FFP method, the TUP method takes the advantage that the starting of the first reading frame is (usually) known. Without this information, the FFP method could only rely on the frequency distribution of overlapping words, which is the average (or mixture) of the frequency distributions of three possible reading frames. Consequently, we show (from the entropy viewpoint) that the FFP procedure could dilute important gene information and therefore provides less accurate classification.
Biostatistics (Oxford, England), Jul 24, 2015
Sufficient dimension reduction is widely applied to help model building between the response [For... more Sufficient dimension reduction is widely applied to help model building between the response [Formula: see text] and covariate [Formula: see text]. In some situations, we also collect additional covariate [Formula: see text] that has better performance in predicting [Formula: see text], but has a higher obtaining cost, than [Formula: see text]. While constructing a predictive model for [Formula: see text] based on [Formula: see text] is straightforward, this strategy is not applicable since [Formula: see text] is not available for future observations in which the constructed model is to be applied. As a result, the aim of the study is to build a predictive model for [Formula: see text] based on [Formula: see text] only, where the available data is [Formula: see text]. A naive method is to conduct analysis using [Formula: see text] directly, but ignoring [Formula: see text] can cause the problem of inefficiency. On the other hand, it is not trivial to utilize the information of [Form...
Springer Handbooks Comp.Statistics
ABSTRACT
Trends in Genetics, 2002
Research Update For more than 30 years, expression divergence has been considered as a major reas... more Research Update For more than 30 years, expression divergence has been considered as a major reason for retaining duplicated genes in a genome, but how often and how fast duplicate genes diverge in expression has not been studied at the genomic level. Using yeast microarray data, we show that expression divergence between duplicate genes is significantly correlated with their synonymous divergence (K S) and also with their nonsynonymous divergence (K A) if K A ≤ ≤ 0.3. Thus, expression divergence increases with evolutionary time, and K A is initially coupled with expression divergence. More interestingly, a large proportion of duplicate genes have diverged quickly in expression and the vast majority of gene pairs eventually become divergent in expression. Indeed, more than 40% of gene pairs show expression divergence even when K S is ≤ ≤ 0.10, and this proportion becomes > >80% for K S > > 1.5. Only a small fraction of ancient gene pairs do not show expression divergence.