Shipeng Yu - Academia.edu (original) (raw)

Papers by Shipeng Yu

Lecture Notes in Computer Science, 2003

A new web content structure based on visual representation is proposed in this paper. Many web ap... more A new web content structure based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this structure. This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on his visual perception. Comparing to other existing techniques, our approach is independent to underlying documentation representation such as HTML and works well even when the HTML structure is far different from layout structure. Experiments show satisfactory results.

Proceedings of the twelfth international conference on World Wide Web - WWW '03, 2003

In contrast to traditional document retrieval, a web page as a whole is not a good information un... more In contrast to traditional document retrieval, a web page as a whole is not a good information unit to search because it often contains multiple topics and a lot of irrelevant information from navigation, decoration, and interaction part of the page. In this paper, we propose a VIsion-based Page Segmentation (VIPS) algorithm to detect the semantic content structure in a web page. Compared with simple DOM based segmentation method, our page segmentation scheme utilizes useful visual cues to obtain a better partition of a page at the semantic level. By using our VIPS algorithm to assist the selection of query expansion terms in pseudo-relevance feedback in web information retrieval, we achieve 27% performance improvement on Web Track dataset.

Proceedings of the 27th annual international conference on Research and development in information retrieval - SIGIR '04, 2004

Multiple-topic and varying-length of web pages are two negative factors significantly affecting t... more Multiple-topic and varying-length of web pages are two negative factors significantly affecting the performance of web search. In this paper, we explore the use of page segmentation algorithms to partition web pages into blocks and investigate how to take advantage of block-level evidence to improve retrieval performance in the web context. Because of the special characteristics of web pages, different page segmentation method will have different impact on web search performance. We compare four types of methods, including fixed-length page segmentation, DOM-based page segmentation, vision-based page segmentation, and a combined method which integrates both semantic and fixed-length properties. Experiments on block-level query expansion and retrieval are performed. Among the four approaches, the combined method achieves the best performance for web search. Our experimental results also show that such a semantic partitioning of web pages effectively deals with the problem of multiple drifting topics and mixed lengths, and thus has great potential to boost up the performance of current web search engines.

Proceedings of the 20th ACM international conference on Information and knowledge management - CIKM '11, 2011

Classification is one of the core problems in Computer-Aided Diagnosis (CAD), targeting for early... more Classification is one of the core problems in Computer-Aided Diagnosis (CAD), targeting for early cancer detection using 3D medical imaging interpretation. High detection sensitivity with desirably low false positive (FP) rate is critical for a CAD system to be accepted as a valuable or even indispensable tool in radiologists' workflow. Given various spurious imagery noises which cause observation uncertainties, this remains a very challenging task. In this paper, we propose a novel, two-tiered coarse-to-fine (CTF) classification cascade framework to tackle this problem. We first obtain classification-critical data samples (e.g., samples on the decision boundary) extracted from the holistic data distributions using a robust parametric model (e.g., [13]); then we build a graph-embedding based nonparametric classifier on sampled data, which can more accurately preserve or formulate the complex classification boundary. These two steps can also be considered as effective "sample pruning" and "feature pursuing + kNN/template matching", respectively. Our approach is validated comprehensively in colorectal polyp detection and lung nodule detection CAD systems, as the top two deadly cancers, using hospital scale, multi-site clinical datasets. The results show that our method achieves overall better classification/detection performance than existing state-of-the-art algorithms using single-layer classifiers, such as the support vector machine variants , boosting [15], logistic regression [11], relevance vector machine [13], k-nearest neighbor [9] or spectral projections on graph [2].

Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD 08, 2008

Multi-label problems arise in various domains such as multitopic document categorization and prot... more Multi-label problems arise in various domains such as multitopic document categorization and protein function prediction. One natural way to deal with such problems is to construct a binary classifier for each label, resulting in a set of independent binary classification problems. Since the multiple labels share the same input space, and the semantics conveyed by different labels are usually correlated, it is essential to exploit the correlation information contained in different labels. In this paper, we consider a general framework for extracting shared structures in multi-label classification. In this framework, a common subspace is assumed to be shared among multiple labels. We show that the optimal solution to the proposed formulation can be obtained by solving a generalized eigenvalue problem, though the problem is nonconvex. For high-dimensional problems, direct computation of the solution is expensive, and we develop an efficient algorithm for this case. One appealing feature of the proposed framework is that it includes several well-known algorithms as special cases, thus elucidating their intrinsic relationships. We have conducted extensive experiments on eleven multitopic web page categorization tasks, and results demonstrate the effectiveness of the proposed formulation in comparison with several representative algorithms.

2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 2005

Dimensionality reduction via feature projection has been widely used in pattern recognition and m... more Dimensionality reduction via feature projection has been widely used in pattern recognition and machine learning. It is often beneficial to derive the projections not only based on the inputs but also on the target values in the training data set. This is of particular importance in predicting multivariate or structured outputs which is an area of growing interest. In this paper we introduce a novel projection framework which is sensitive to both input features and outputs. Based on the derived features prediction accuracy can be greatly improved. We validate our approach in two applications. The first is to model users' preferences on a set of paintings. The second application is concerned with image categorization where each image may belong to multiple categories. The proposed algorithm produces very encouraging results in both settings.

Proceedings of the 27th annual international conference on Research and development in information retrieval - SIGIR '04, 2004

Information filtering has made considerable progress in recent years.The predominant approaches a... more Information filtering has made considerable progress in recent years.The predominant approaches are content-based methods and collaborative methods. Researchers have largely concentrated on either of the two approaches since a principled unifying framework is still lacking. This paper suggests that both approaches can be combined under a hierarchical Bayesian framework. Individual content-based user profiles are generated and collaboration between various user models is achieved via a common learned prior distribution. However, it turns out that a parametric distribution (e.g. Gaussian) is too restrictive to describe such a common learned prior distribution. We thus introduce a nonparametric common prior, which is a sample generated from a Dirichlet process which assumes the role of a hyper prior. We describe effective means to learn this nonparametric distribution, and apply it to learn users' information needs. The resultant algorithm is simple and understandable, and offers a principled solution to combine content-based filtering and collaborative filtering. Within our framework, we are now able to interpret various existing techniques from a unifying point of view. Finally we demonstrate the empirical success of the proposed information filtering methods.

Lecture Notes in Computer Science, 2006

This paper studies a Bayesian framework for density modeling with mixture of exponential family d... more This paper studies a Bayesian framework for density modeling with mixture of exponential family distributions. Variational Bayesian Dirichlet-Multinomial allocation (VBDMA) is introduced, which performs inference and learning efficiently using variational Bayesian methods and performs automatic model selection. The model is closely related to Dirichlet process mixture models and demonstrates similar automatic model selection in the variational Bayesian context.

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '05, 2005

Latent semantic indexing (LSI) is a well-known unsupervised approach for dimensionality reduction... more Latent semantic indexing (LSI) is a well-known unsupervised approach for dimensionality reduction in information retrieval. However if the output information (i.e. category labels) is available, it is often beneficial to derive the indexing not only based on the inputs but also on the target values in the training data set. This is of particular importance in applications with multiple labels, in which each document can belong to several categories simultaneously. In this paper we introduce the multi-label informed latent semantic indexing (MLSI) algorithm which preserves the information of inputs and meanwhile captures the correlations between the multiple outputs. The recovered "latent semantics" thus incorporate the human-annotated category information and can be used to greatly improve the prediction accuracy. Empirical study based on two data sets, Reuters-21578 and RCV1, demonstrates very encouraging results.

Lecture Notes in Computer Science, 2005

Problem solving with experiences that are recorded in text form requires a mapping from text to s... more Problem solving with experiences that are recorded in text form requires a mapping from text to structured cases, so that case comparison can provide informed feedback for reasoning. One of the challenges is to acquire an indexing vocabulary to describe cases. We explore the use of machine learning and statistical techniques to automate aspects of this acquisition task. A propositional semantic indexing tool, PSI, which forms its indexing vocabulary from new features extracted as logical combinations of existing keywords, is presented. We propose that such logical combinations correspond more closely to natural concepts and are more transparent than linear combinations. Experiments show PSIderived case representations to have superior retrieval performance to the original keyword-based representations. PSI also has comparable performance to Latent Semantic Indexing, a popular dimensionality reduction technique for text, which unlike PSI generates linear combinations of the original features.

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '06, 2006

Principal component analysis (PCA) has been extensively applied in data mining, pattern recogniti... more Principal component analysis (PCA) has been extensively applied in data mining, pattern recognition and information retrieval for unsupervised dimensionality reduction. When labels of data are available, e.g., in a classification or regression task, PCA is however not able to use this information. The problem is more interesting if only part of the input data are labeled, i.e., in a semi-supervised setting. In this paper we propose a supervised PCA model called SPPCA and a semi-supervised PCA model called S 2 PPCA, both of which are extensions of a probabilistic PCA model. The proposed models are able to incorporate the label information into the projection phase, and can naturally handle multiple outputs (i.e., in multi-task learning problems). We derive an efficient EM learning algorithm for both models, and also provide theoretical justifications of the model behaviors. SPPCA and S 2 PPCA are compared with other supervised projection methods on various learning tasks, and show not only promising performance but also good scalability.

Proceedings of the 24th international conference on Machine learning - ICML '07, 2007

Most current multi-task learning frameworks ignore the robustness issue, which means that the pre... more Most current multi-task learning frameworks ignore the robustness issue, which means that the presence of "outlier" tasks may greatly reduce overall system performance. We introduce a robust framework for Bayesian multitask learning, t-processes (TP), which are a generalization of Gaussian processes (GP) for multi-task learning. TP allows the system to effectively distinguish good tasks from noisy or outlier tasks. Experiments show that TP not only improves overall system performance, but can also serve as an indicator for the "informativeness" of different tasks.

2009 Ninth IEEE International Conference on Data Mining, 2009

Collaborative tagging systems with user generated content have become a fundamental element of we... more Collaborative tagging systems with user generated content have become a fundamental element of websites such as Delicious, Flickr or CiteULike. By sharing common knowledge, massively linked semantic data sets are generated that provide new challenges for data mining. In this paper, we reduce the data complexity in these systems by finding meaningful topics that serve to group similar users and serve to recommend tags or resources to users. We propose a well-founded probabilistic approach that can model every aspect of a collaborative tagging system. By integrating both user information and tag information into the well-known Latent Dirichlet Allocation framework, the developed models can be used to solve a number of important information extraction and retrieval tasks.

Proceedings of the 23rd international conference on Machine learning - ICML '06, 2006

Ordinal regression has become an effective way of learning user preferences, but most research fo... more Ordinal regression has become an effective way of learning user preferences, but most research focuses on single regression problems. In this paper we introduce collaborative ordinal regression, where multiple ordinal regression tasks are handled simultaneously. Rather than modeling each task individually, we explore the dependency between ranking functions through a hierarchical Bayesian model and assign a common Gaussian Process (GP) prior to all individual functions. Empirical studies show that our collaborative model outperforms the individual counterpart in preference learning applications.

Medical physics, 2010

Classic statistical and machine learning models such as support vector machines (SVMs) can be use... more Classic statistical and machine learning models such as support vector machines (SVMs) can be used to predict cancer outcome, but often only perform well if all the input variables are known, which is unlikely in the medical domain. Bayesian network (BN) models have a natural ability to reason under uncertainty and might handle missing data better. In this study, the authors hypothesize that a BN model can predict two-year survival in non-small cell lung cancer (NSCLC) patients as accurately as SVM, but will predict survival more accurately when data are missing. A BN and SVM model were trained on 322 inoperable NSCLC patients treated with radiotherapy from Maastricht and validated in three independent data sets of 35, 47, and 33 patients from Ghent, Leuven, and Toronto. Missing variables occurred in the data set with only 37, 28, and 24 patients having a complete data set. The BN model structure and parameter learning identified gross tumor volume size, performance status, and numb...

This workshop report is an overview of the Predictive Models in Personalized Medicine workshop he... more This workshop report is an overview of the Predictive Models in Personalized Medicine workshop held on Dec. 11, 2010 at 2010 Neural Information Processing Systems (NIPS) Conference in Whistler, Canada. The workshop included 3 keynote talks and 6 oral and 5 poster presentations on peer reviewed submissions. The workshop also featured a panel discussion on the growing trends of the

AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium, 2011

Information extraction from clinical free text is one of the key elements in medical informatics ... more Information extraction from clinical free text is one of the key elements in medical informatics research. In this paper we propose a general framework to improve learning-based information extraction systems with the help of rich annotations (i.e., annotators provide the medical assertion as well as evidences that support the assertion). A special graphical interface was developed to facilitate the annotation process, and we show how to implement this framework with a state-of-the-art context-based question answering system. Empirical studies demonstrate that with about 10% longer annotation time, we can significantly improve the accuracy of the system. An approach to provide supporting evidence for test documents is also briefly discussed with promising preliminary results.

AMIA Joint Summits on Translational Science proceedings AMIA Summit on Translational Science, 2013

One of the important pieces of information in a patient's clinical record is the information ... more One of the important pieces of information in a patient's clinical record is the information about their medications. Besides administering information, it also consists of the category of the medication i.e. whether the patient was taking these medications at Home, were administered in the Emergency Department, during course of stay or on discharge etc. Unfortunately, much of this information is presently embedded in unstructured clinical notes e.g. in ER records, History & Physical documents etc. This information is required for adherence to quality and regulatory guidelines or for retrospective analysis e.g. CMS reporting. It is a manually intensive process to extract such information. This paper explains in detail a statistical NLP system developed to extract such information. We have trained a Maximum Entropy Markov model to categorize instances of medication names into previously defined categories. The system was tested on a variety of clinical notes from different instit...

Abstract In supervised learning when acquiring good quality labels is hard, practitioners resort ... more Abstract In supervised learning when acquiring good quality labels is hard, practitioners resort to getting the data labeled by multiple noisy annotators. Various methods have been proposed to estimate the consensus labels for binary and categorical labels. A commonly ...

A combination of radiotherapy and chemotherapy, is often the treatment of choice for cancer patie... more A combination of radiotherapy and chemotherapy, is often the treatment of choice for cancer patients. Recent developments in the treatment of patients have lead to improved survival. However, traditionally used clinical variables have poor accuracy for the prediction of survival and radiation treatment side effects. The objective of this work is to develop and validate improved predictive model for a large group of nonsmall cell lung cancer (NSCLC) patients and a group of rectal cancer patients. The main goal is to predict survival for both groups of patients and radiation induced side-effects for the NSCLC patients. Given sufficiently accurate predictions of these models, they can then be used to optimize the treatment of each individual patient, which is the goal of personalized medicine. Our improved predictive models are obtained by using the recently proposed Likelihood gamble pricing (LGP), which is a decision-theoretic approach to statistical inference that marries the likelihood principle of statistics with Von Neumann-Morgensterns axiomatic approach to decision making.

Lecture Notes in Computer Science, 2003

Proceedings of the twelfth international conference on World Wide Web - WWW '03, 2003

Proceedings of the 27th annual international conference on Research and development in information retrieval - SIGIR '04, 2004

Proceedings of the 20th ACM international conference on Information and knowledge management - CIKM '11, 2011

Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD 08, 2008

2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 2005

Proceedings of the 27th annual international conference on Research and development in information retrieval - SIGIR '04, 2004

Lecture Notes in Computer Science, 2006

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '05, 2005

Lecture Notes in Computer Science, 2005

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '06, 2006

Proceedings of the 24th international conference on Machine learning - ICML '07, 2007

2009 Ninth IEEE International Conference on Data Mining, 2009

Proceedings of the 23rd international conference on Machine learning - ICML '06, 2006

Medical physics, 2010

AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium, 2011

AMIA Joint Summits on Translational Science proceedings AMIA Summit on Translational Science, 2013