Jiefei Li - Academia.edu (original) (raw)
Papers by Jiefei Li
ArXiv, 2015
To date, there have been massive Semi-Structured Documents (SSDs) during the evolution of the Int... more To date, there have been massive Semi-Structured Documents (SSDs) during the evolution of the Internet. These SSDs contain both unstructured features (e.g., plain text) and metadata (e.g., tags). Most previous works focused on modeling the unstructured text, and recently, some other methods have been proposed to model the unstructured text with specific tags. To build a general model for SSDs remains an important problem in terms of both model fitness and efficiency. We propose a novel method to model the SSDs by a so-called Tag-Weighted Topic Model (TWTM). TWTM is a framework that leverages both the tags and words information, not only to learn the document-topic and topic-word distributions, but also to infer the tag-topic distributions for text mining tasks. We present an efficient variational inference method with an EM algorithm for estimating the model parameters. Meanwhile, we propose three large-scale solutions for our model under the MapReduce distributed computing platform...
In the last decade, latent Dirichlet allocation (LDA) successfully discovers the statistical dist... more In the last decade, latent Dirichlet allocation (LDA) successfully discovers the statistical distribution of the topics over a unstructured text corpus. Meanwhile, more and more document data come up with rich human-provided tag information during the evolution of the Internet, which called semistructured data. The semi-structured data contain both unstructured data (e.g., plain text) and metadata, such as papers with authors and web pages with tags. In general, different tags in a document play different roles with their own weights. To model such semi-structured documents is nontrivial. In this paper, we propose a novel method to model tagged documents by a topic model, called Tag-Weighted Topic Model (TWTM). TWTM is a framework that leverages the tags in each document to infer the topic components for the documents. This allows not only to learn document-topic distributions, but also to infer the tag-topic distributions for text mining (e.g., classification, clustering, and recom...
Text classification is widely used in many realworld applications. To obtain satisfied classifica... more Text classification is widely used in many realworld applications. To obtain satisfied classification performance, most traditional data mining methods require lots of labeled data, which can be costly in terms of both time and human efforts. In reality, there are plenty of such resources in English since it has the largest population in the Internet world, which is not true in many other languages. In this paper, we present a novel transfer learning approach to tackle the cross-language text classification problems. We first align the feature spaces in both domains utilizing some on-line translation service, which makes the two feature spaces under the same coordinate. Although the feature sets in both domains are the same, the distributions of the instances in both domains are different, which violates the i.i.d. assumption in most traditional machine learning methods. For this issue, we propose an iterative feature and instance weighting (Bi-Weighting) method for domain adaptation. We empirically evaluate the effectiveness and efficiency of our approach. The experimental results show that our approach outperforms some baselines including four transfer learning algorithms.
Proceedings of the Twenty-Second international …, 2011
Text classification is widely used in many realworld applications. To obtain satisfied classifica... more Text classification is widely used in many realworld applications. To obtain satisfied classification performance, most traditional data mining methods require lots of labeled data, which can be costly in terms of both time and human efforts. In reality, there are plenty of such resources in English since it has the largest population in the Internet world, which is not true in many other languages. In this paper, we present a novel transfer learning approach to tackle the cross-language text classification problems. We first align the feature spaces in both domains utilizing some on-line translation service, which makes the two feature spaces under the same coordinate. Although the feature sets in both domains are the same, the distributions of the instances in both domains are different, which violates the i.i.d. assumption in most traditional machine learning methods. For this issue, we propose an iterative feature and instance weighting (Bi-Weighting) method for domain adaptation. We empirically evaluate the effectiveness and efficiency of our approach. The experimental results show that our approach outperforms some baselines including four transfer learning algorithms.
Proceedings of the 2013 KDD Cup 2013 Workshop on - KDD Cup '13, 2013
ABSTRACT
ArXiv, 2015
To date, there have been massive Semi-Structured Documents (SSDs) during the evolution of the Int... more To date, there have been massive Semi-Structured Documents (SSDs) during the evolution of the Internet. These SSDs contain both unstructured features (e.g., plain text) and metadata (e.g., tags). Most previous works focused on modeling the unstructured text, and recently, some other methods have been proposed to model the unstructured text with specific tags. To build a general model for SSDs remains an important problem in terms of both model fitness and efficiency. We propose a novel method to model the SSDs by a so-called Tag-Weighted Topic Model (TWTM). TWTM is a framework that leverages both the tags and words information, not only to learn the document-topic and topic-word distributions, but also to infer the tag-topic distributions for text mining tasks. We present an efficient variational inference method with an EM algorithm for estimating the model parameters. Meanwhile, we propose three large-scale solutions for our model under the MapReduce distributed computing platform...
In the last decade, latent Dirichlet allocation (LDA) successfully discovers the statistical dist... more In the last decade, latent Dirichlet allocation (LDA) successfully discovers the statistical distribution of the topics over a unstructured text corpus. Meanwhile, more and more document data come up with rich human-provided tag information during the evolution of the Internet, which called semistructured data. The semi-structured data contain both unstructured data (e.g., plain text) and metadata, such as papers with authors and web pages with tags. In general, different tags in a document play different roles with their own weights. To model such semi-structured documents is nontrivial. In this paper, we propose a novel method to model tagged documents by a topic model, called Tag-Weighted Topic Model (TWTM). TWTM is a framework that leverages the tags in each document to infer the topic components for the documents. This allows not only to learn document-topic distributions, but also to infer the tag-topic distributions for text mining (e.g., classification, clustering, and recom...
Text classification is widely used in many realworld applications. To obtain satisfied classifica... more Text classification is widely used in many realworld applications. To obtain satisfied classification performance, most traditional data mining methods require lots of labeled data, which can be costly in terms of both time and human efforts. In reality, there are plenty of such resources in English since it has the largest population in the Internet world, which is not true in many other languages. In this paper, we present a novel transfer learning approach to tackle the cross-language text classification problems. We first align the feature spaces in both domains utilizing some on-line translation service, which makes the two feature spaces under the same coordinate. Although the feature sets in both domains are the same, the distributions of the instances in both domains are different, which violates the i.i.d. assumption in most traditional machine learning methods. For this issue, we propose an iterative feature and instance weighting (Bi-Weighting) method for domain adaptation. We empirically evaluate the effectiveness and efficiency of our approach. The experimental results show that our approach outperforms some baselines including four transfer learning algorithms.
Proceedings of the Twenty-Second international …, 2011
Text classification is widely used in many realworld applications. To obtain satisfied classifica... more Text classification is widely used in many realworld applications. To obtain satisfied classification performance, most traditional data mining methods require lots of labeled data, which can be costly in terms of both time and human efforts. In reality, there are plenty of such resources in English since it has the largest population in the Internet world, which is not true in many other languages. In this paper, we present a novel transfer learning approach to tackle the cross-language text classification problems. We first align the feature spaces in both domains utilizing some on-line translation service, which makes the two feature spaces under the same coordinate. Although the feature sets in both domains are the same, the distributions of the instances in both domains are different, which violates the i.i.d. assumption in most traditional machine learning methods. For this issue, we propose an iterative feature and instance weighting (Bi-Weighting) method for domain adaptation. We empirically evaluate the effectiveness and efficiency of our approach. The experimental results show that our approach outperforms some baselines including four transfer learning algorithms.
Proceedings of the 2013 KDD Cup 2013 Workshop on - KDD Cup '13, 2013
ABSTRACT