Olivia Liu Sheng - Academia.edu (original) (raw)

Papers by Olivia Liu Sheng

Web site design is critical to the success of electronic commerce and digital government. Effecti... more Web site design is critical to the success of electronic commerce and digital government. Effective design requires appropriate evaluation methods and measurement metrics. The current research examines Web site navigability, a fundamental structural aspect of Web site design. We define Web site navigability as the extent to which a visitor can use a Web site’s hyperlink structure to locate target contents successfully in an easy and efficient manner. We propose a systematic Web site navigability evaluation method built on Web mining techniques. To complement the subjective self-reported metrics commonly used by previous research, we develop three objective metrics for measuring Web site navigability on the basis of the Law of Surfing. We illustrate the use of the proposed methods and measurement metrics with two large Web sites.

Journal of Biomedical Informatics, 2020

Objective: To develop an effective and scalable individual-level patient cost prediction method b... more Objective: To develop an effective and scalable individual-level patient cost prediction method by automatically learning hidden temporal patterns from multivariate time series data in patient insurance claims using a convolutional neural network (CNN) architecture. Methods: We used three years of medical and pharmacy claims data from 2013 to 2016 from a healthcare insurer, where data from the first two years were used to build the model to predict costs in the third year. The data consisted of the multivariate time series of cost, visit and medical features that were shaped as images of patients' health status (i.e., matrices with time windows on one dimension and the medical, visit and cost features on the other dimension). Patients' multivariate time series images were given to a CNN method with a proposed architecture. After hyper-parameter tuning, the proposed architecture consisted of three building blocks of convolution and pooling layers with an LReLU activation function and a customized kernel size at each layer for healthcare data. The proposed CNN learned temporal patterns became inputs to a fully connected layer. We benchmarked the proposed method against three other methods: 1) a spike temporal pattern detection method, as the most accurate method for healthcare cost prediction described to date in the literature; 2) a symbolic temporal pattern detection method, as the most common approach for leveraging healthcare temporal data; and 3) the most commonly used CNN architectures for image pattern detection (i.e., AlexNet, VGGNet and ResNet) (via transfer learning). Moreover, we assessed the contribution of each type of data (i.e., cost, visit and medical). Finally, we externally validated the proposed method against a separate cohort of patients. All prediction performances were measured in terms of mean absolute percentage error (MAPE). Results: The proposed CNN configuration outperformed the spike temporal pattern detection and symbolic temporal pattern detection methods with a MAPE of 1.67 versus 2.02 and 3.66, respectively (p<0.01). The proposed CNN outperformed ResNet, AlexNet and VGGNet with MAPEs of 4.59, 4.85 and 5.06, respectively (p<0.01). Removing medical, visit and cost features resulted in MAPEs of 1.98, 1.91 and 2.04, respectively (p<0.01). Conclusions: Feature learning through the proposed CNN configuration significantly improved individual-level healthcare cost prediction. The proposed CNN was able to outperform temporal pattern detection methods that look for a pre-defined set of pattern shapes, since it is capable of extracting a variable number of patterns with various shapes. Temporal patterns learned from medical, visit and cost data made significant contributions to the prediction performance. Hyper-parameter tuning showed that considering three-month data patterns has the highest prediction accuracy. Our results showed that patients' images extracted from multivariate time series data are different from regular images, and hence require unique designs of CNN architectures. The proposed method for converting multivariate time series data of patients into images and tuning them for convolutional learning could be applied in many other healthcare applications with multivariate time series data.

BACKGROUND More than 20% of patients admitted to the intensive care unit (ICU) develop an adverse... more BACKGROUND More than 20% of patients admitted to the intensive care unit (ICU) develop an adverse event (AE) increasing the risk of further complications and mortality. Despite substantial research on AE prediction, no previous study has leveraged patients’ temporal data to extract features using their structural temporal patterns, i.e. trends. OBJECTIVE To improve AE prediction methods by using structural temporal pattern detection for patients admitted to the ICU by extracting features from their temporal pattern data to capture global and local temporal trends and to demonstrate these improvements in the detection of Acute Kidney Injury (AKI). METHODS Using the MIMIC dataset, we extracted both global and local trends using structural pattern detection methods to predict AKI. Classifiers were built using state-of-the-art models; the optimal classifier was selected for comparisons with previous approaches. The classifier with structural pattern detection features was compared with ...

The key to integrated office support in a distributed environment is the integration of hetegerog... more The key to integrated office support in a distributed environment is the integration of hetegerogeneous databases used at different locations for various purposes. The core of heterogeneous database integration is at the logical level. This report presents an approach to logical integration of multiple databases.

As knowledge generated by traditional data mining methods reflects only the current state of the ... more As knowledge generated by traditional data mining methods reflects only the current state of the database, a database needs to be re-mined every time an update occurs. For databases with frequent updates, such operations not only could inflict unbearable costs but also often result in repetitive knowledge identical with previous mining because data in real world applications overlap considerably [10]. To tackle such a problem, in this paper, we propose a monitoring algorithm for association rule mining that can determine whether or not it is necessary to re-mine an updated database. Pattern difference is defined to measure changes in patterns between original data and incremental data (i.e., new data). Based on the significance of pattern difference, our algorithm will determine the necessity of re-mining. This paper also presents experiments that test the reliability and efficiency of the monitoring algorithm by using synthetic data sets generated by the extensively used IBM test data generator 1. The comparison of the proposed algorithm with the traditional Apriori algorithm has shown that the former is not only reliable but also much faster.

The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or i... more The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely afiect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book.

We study Web sessions clustering in order to find groups of similar sessions and discover user ac... more We study Web sessions clustering in order to find groups of similar sessions and discover user access patterns on a Web site. We extend the general page concept presented in (Fu, Sandhu and Shih 2000) by including partial document names and dynamic pages, and use an extended general page (EGP) to represent many individual page URLs sharing the same EGP. We present two extensions of a hierarchical clustering algorithm, ROCK (Guha, Rastogi and Shim 2000). One is a notion of EGP count that we add to the session similarity calculation. The other is a goodness threshold we adopt to restrict certain clusters from merging with others. Further, we propose a set of measurements for assessing the results from clustering boolean and categorical data and help users to identify their desired clustering results. In our experiments, we applied the ROCK and the extended ROCK (E-ROCK) algorithms to cluster a half-month's Web log from a customer service Web site at HP. The experiment results showed that E-ROCK alleviated a large cluster problem of the ROCK algorithm and improved the performance in intra cluster similarity.

A search engine provides two distinct types of results, organic and paid, each of which uses diff... more A search engine provides two distinct types of results, organic and paid, each of which uses different mechanisms for selecting and ranking relevant Web pages for a query. For an e-commerce query, vendors represented by websites in these organic and paid results are expected to have varying reliability ratings, such as a satisfactory or unsatisfactory score from the Business Bureau (BBB) based on overall customer experiences. In this paper we empirically examine how vendors’ reliability ratings in BBB are associated with cues (such as type of result, relative price, number of sites selling the product) that can be observed or derived from search results, and further we attempt to predict vendors’ BBB reliability ratings using those cues.

Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers

Fast-growing interest in telemedicine and increased investment in its enabling technology have ma... more Fast-growing interest in telemedicine and increased investment in its enabling technology have made physician technology acceptance a growing concern for development and management of telemedicine. In this study, we used Theory of Planned Behavior to investigate technology acceptance by physicians who practiced in public tertiary hospitals in Hong Kong. Our data supported the investigated theory, whose explanatory power of physicians' technology acceptance was moderate, however. Overall, physicians showed positive attitudes towards use of telemedicine technology and exhibited moderate intention to use it, particularly for clinical tasks. Furthermore, several implications for development and management of telemedicine can also be drawn from the findings.

Deriving users' interests from their online searching and browsing behaviors is an important rese... more Deriving users' interests from their online searching and browsing behaviors is an important research direction with several applications in content search and management. Manually edited Web directories, such as Open Directory Project 1 (ODP) or Yahoo! 2 directory, provide ontology of concepts (categories) along with pages relevant to those categories. Aiming to evaluate and compare the performance of different cues in searching and browsing activities for user interests modeling, we used four inputs (search query, expanded query, snippet, and page content) and mapped each of them into a set of ODP categories. Then we automatically identified an output category as representative of user interest that initiated a search. Through a controlled experiment, we compared the performance of different inputs in interest mapping using two metrics-hit rate and average hit index. We found that the use of page content achieved the best results, i.e. highest hit rate and lowest average hit index. Also an expanded query (the original query with a few additional terms) was a better input to identify user interests than the original query and the snippet. The expanded query produced hit rates 21-121% higher than those achieved by the original query and had 34% lower average hit index than that by the original query.

Identifying competitors for an individual company or a group of companies is important for busine... more Identifying competitors for an individual company or a group of companies is important for businesses. Although people can consult paid company profile resources such as Hoover's and Mergent, these sources are incomplete in company relationship coverage. We present an approach that uses graph-theoretic measures and machine learning techniques to achieve automated discovery of competitor relationships on the basis of structure of an intercompany network derived from company citations (cooccurrence) in online news articles. We also estimate to what extent our approach could extend the competitor relationships available from the data sources, Hoover's and Mergent.

The importance of identifying competitors and of avoiding "competitive blind spots" in marketplac... more The importance of identifying competitors and of avoiding "competitive blind spots" in marketplace has been well emphasized in research and practice. However, identification of competitors is non-trivial and requires active monitoring of a focal company's competitive environment. The difficulty in such identification is amplified manifold when there are many more than one focal company of interest. As the web presence of companies, their clients/consumers, and their suppliers continues to grow, it is increasingly realistic to assume that the real-world competitive relationships are reflected in the text and linkage structure of the relevant pages on the web. However, finding the appropriate web-based cues that effectively signal competitor relationships remains a challenge. Using web data collected for more than 2500 companies of the Russell 3000 index, we explore the notion that web cues can allow us to discriminate, in a statistically significant manner, between competitors and non-competitors. Based on this analysis, we present an automated technique that uses the most significant web-based cues and applies predictive modeling to identify competitors. We find that several web-based metrics on an average have significantly different values for companies that are competitors as opposed to noncompetitors. We also find that the predictive models built using web-based metrics that we suggest provide high precision, recall, F measure, and accuracy in identifying competitors.

Proceedings of the 33rd Annual Hawaii International Conference on System Sciences

Recent advances in information and biomedicine technology have significantly increased the techni... more Recent advances in information and biomedicine technology have significantly increased the technical feasibility, clinical viability and economic affordability of telemedicine-assisted service collaboration and delivery. The ultimate success of telemedicine in an adopting organization requires the organization's proper addressing both technological and managerial challenges. Based on Tornatzky and Fleischer's framework, we developed and empirically evaluated a research model for healthcare organizations' adoption of telemedicine technology, using a survey study that involved public healthcare organizations in Hong Kong. Results of our exploratory study suggested that the research model exhibited reasonable significance and classification accuracy and that collective attitude of medical staff and perceived service risks were the two most significant factors in organizational adoption of telemedicine technology. Furthermore, several implications for telemedicine management emerged from our study and are discussed as well

ICIS 2011 Proceedings, 2011

This study attempts to discover and evaluate the predictive power of stock micro blog sentiment o... more This study attempts to discover and evaluate the predictive power of stock micro blog sentiment on future stock price directional movements. We construct a set of robust models based on sentiment analysis and data mining algorithms. Using 72,221 micro blog ...

Operations Research, 2013

Knowledge discovery in databases (KDD) techniques have been extensively employed to extract knowl... more Knowledge discovery in databases (KDD) techniques have been extensively employed to extract knowledge from massive data stores to support decision making in a wide range of critical applications. Maintaining the currency of discovered knowledge over evolving data sources is a fundamental challenge faced by all KDD applications. This paper addresses the challenge from the perspective of deciding the right times to refresh knowledge. We define the knowledge-refreshing problem and model it as a Markov decision process. Based on the identified properties of the Markov decision process model, we establish that the optimal knowledge-refreshing policy is monotonically increasing in the system state within every appropriate partition of the state space. We further show that the problem of searching for the optimal knowledge-refreshing policy can be reduced to the problem of finding the optimal thresholds and propose a method for computing the optimal knowledge-refreshing policy. The effecti...

Journal of the American Society for Information Science and Technology, 2005

A large number of studies have investigated the transaction log of general-purpose search engines... more A large number of studies have investigated the transaction log of general-purpose search engines such as Excite and AltaVista, but few studies have reported on the analysis of search logs for search engines that are limited to particular Web sites, namely, Web site search engines. In this article, we report our research on analyzing the search logs of the search engine of the Utah state government Web site. Our results show that some statistics, such as the number of search terms per query, of Web users are the same for general-purpose search engines and Web site search engines, but others, such as the search topics and the terms used, are considerably different. Possible reasons for the differences include the focused domain of Web site search engines and users' different information needs. The findings are useful for Web site developers to improve the performance of their services provided on the Web and for researchers to conduct further research in this area. The analysis also can be applied in e-government research by investigating how information should be delivered to users in government Web sites.

Journal of Management Information Systems, 2012

Web site navigability refers to the degree to which a visitor can follow a Web site's hyperlink s... more Web site navigability refers to the degree to which a visitor can follow a Web site's hyperlink structure to successfully find information with efficiency and ease. In this study, we take a data-driven approach to measure Web site navigability using Web data readily available in organizations. Guided by information foraging and information-processing theories, we identify fundamental navigability dimensions that should be emphasized in metric development. Accordingly, we propose three data-driven metrics-namely, power, efficiency, and directness-that consider Web structure, usage, and content data to measure a Web site's navigability. We also develop a Web mining-based method that processes Web data to enable the calculation of the proposed metrics. We further implement a prototype system based on the Web miningbased method and use it to assess the navigability of two sizable, real-world Web sites with the metrics. To examine the analysis results by the metrics, we perform an evaluation study that involves these two sites and 248 voluntary participants. The evaluation results show that user performance and assessments are consistent with the analysis results revealed by our metrics. Our study demonstrates the viability and practical value of data-driven metrics for measuring Web site navigability, which can be used for evaluative, diagnostic, or predictive purposes.

IEEE Transactions on Engineering Management, 2000

This study was undertaken to investigate the use of e-mail and its implications under a telework ... more This study was undertaken to investigate the use of e-mail and its implications under a telework environment for distributed software engineering. For this, the relative strength between a social influence and individual attributes in affecting teleworkers' e-mail use was studied. Management support was used as the representative social influence, and age, status, and ease of use represented individual attributes. An examination was also made on how e-mail use, individual attributes, and management support affected the perceptions of e-mail's information richness and e-mail productivity. Two different types of surveys, log sheets and perception-based self-reports, as well as interviews and e-mail correspondences composed the data sources. Three hierarchical regression models were defined and tested for the hypothesis validation. Data analysis indicated that management support was a much more powerful indicator for teleworkers' media use than individual characteristics. Furthermore, although labeled as a relatively lean medium from the media richness theory perspective, e-mail could become an effective and richer communication tool through an active social construction process of management support. Finally, the management support and perception of e-mail as a rich medium were both highly influential in creating teleworkers' positive perception on e-mail productivity. This study rendered a strong indication that effective adoption of e-mail by teleworkers as an information-rich medium could benefit distributed work and distributed organizations through enhanced work productivity.

Electronic Commerce Research and Applications, 2011

Identifying competitors is important for businesses. We present an approach that uses graph-theor... more Identifying competitors is important for businesses. We present an approach that uses graph-theoretic measures and machine learning techniques to infer competitor relationships on the basis of structure of an intercompany network derived from company citations (cooccurrence) in online news articles. We also estimate to what extent our approach complements the commercial company profile data sources, such as Hoover's and Mergent.

Decision Support Systems, 2009

Large volumes of online business news provide an opportunity to explore various aspects of compan... more Large volumes of online business news provide an opportunity to explore various aspects of companies. A news story pertaining to a company often cites other companies. Using such company citations we construct an intercompany network, employ social network analysis techniques to identify a set of attributes from the network structure, and feed the attributes to machine learning methods to predict the company revenue relation (CRR) that is based on two companies' relative quantitative financial data. Hence, we seek to understand the power of network structural attributes in predicting CRRs that are not described in the news or known at the time the news was published. The network attributes produce close to 80% precision, recall, and accuracy for all 87,340 company pairs in the network. This approach is scalable and can be extended to private and foreign companies for which financial data is unavailable or hard to procure.