Bag of Words Research Papers (original) (raw)
Image representation using bag of visual words approach is commonly used in image classification. Features are extracted from images and clustered into a visual vocabulary. Images can then be represented as a normalized histogram of... more
Image representation using bag of visual words approach is commonly used in image classification. Features are extracted from images and clustered into a visual vocabulary. Images can then be represented as a normalized histogram of visual words similarly to textual documents represented as a weighted vector of terms. As a result, text categorization techniques are applicable to image classification. In this paper, our contribution is twofold. First, we propose a suitable Term-Frequency and Inverse Document Frequency weighting scheme to characterize the importance of visual words. Second, we present a method to fuse different bag-of-words obtained with different vocabularies. We show that using our tf.idf normalization and the fusion leads to better classification rates than other normalization methods, other fusion schemes or other approaches evaluated on the SIMPLIcity collection.
This paper presents a novel and real-time system for interaction with an application or video game via hand gestures. Our system includes detecting and tracking bare hand in cluttered background using skin detection and hand posture... more
This paper presents a novel and real-time system for interaction with an application or video game via hand gestures. Our system includes detecting and tracking bare hand in cluttered background using skin detection and hand posture contour comparison algorithm after face subtraction, recognizing hand gestures via bag-of-features and multiclass support vector machine (SVM) and building a grammar that generates gesture commands to control an application. In the training stage, after extracting the keypoints for every training image using the scale invariance feature transform (SIFT), a vector quantization technique will map keypoints from every training image into a unified dimensional histogram vector (bag-of-words) after K-means clustering. This histogram is treated as an input vector for a multiclass SVM to build the training classifier. In the testing stage, for every frame captured from a webcam, the hand is detected using our algorithm, then, the keypoints are extracted for every small image that contains the detected hand gesture only and fed into the cluster model to map them into a bag-of-words vector, which is finally fed into the multiclass SVM training classifier to recognize the hand gesture.
This presentation refers to the project doen by Ms. Sidra Mehtab as a part of her MSc (Data Science & Analytics) minor projects series. The project has two parts. In the Part I of the project, we have carried out a sentiment analysis on... more
This presentation refers to the project doen by Ms. Sidra Mehtab as a part of her MSc (Data Science & Analytics) minor projects series. The project has two parts. In the Part I of the project, we have carried out a sentiment analysis on Twitter data which is based on the reviews written by the customers of six US airlines. The tweets are already classified into three categories: “positive”, “negative”, and “neutral”. Using a supervised learning approach of classification we have used a Random Forest classifier model on the tweet data. We have tested the model on the test data and evaluated it on various metrics like “precision”, recall”, F1-score etc. In this second part of the project, we have carried out another important task of Text Mining which is known as Topic Modeling. We have carried out the task of Topic Modeling using Scikit-Learn library of Python. We have used a food review dataset consisting of 50K text reviews on various food items and categorized the reviews into various topics using a method called Latent Dirichlet Allocation (LDA).
Feature selection and extraction are frequently used approaches to solve the computational burden in problems with the classification of texts. An introduction of an extraction method for each class that summarizes the characteristics of... more
Feature selection and extraction are frequently used approaches to solve the computational burden in problems with the classification of texts. An introduction of an extraction method for each class that summarizes the characteristics of the sample documents where the new features bring together information on the amount of proof contained in a document. In order to construct the abstract features of a new feature room with dimensions equal to the number of groups, the high dimensional properties of documents are predicted. This paper is aimed at exploring how various methods of feature extraction of text data are influenced by text classification tests. Two different methods of extraction for Bag of Words are studied, specifically the approaches with Count Vector and TF-IDF. An embedding method, called the GloVe extraction process, is also investigated. A comparison of the effectiveness and improvements of classifiers in standard text classification test sets is made. The findings show that the choice of the extraction method has a substantial effect on the resulting classifications but that no approach outperforms each other consistently. The findings instead indicate the best output for the retrieval methods with GloVe and the best output with the Bag of Words system for the precise measurements. While the main emphasis is on TF-IDF and word embedding methods, various feature extraction methods have been discussed
- by Mihir Jain and +1
- •
- Image Classification, Image Search, Rule Based, Bag of Words
Recent years have seen an increase of interest in vector-based approaches to lexical semantics. These are inspired by the distributional hypothesis, which states that semantic similarity can be modelled by distributional similarity in a... more
Recent years have seen an increase of interest in vector-based approaches to lexical semantics. These are inspired by the distributional hypothesis, which states that semantic similarity can be modelled by distributional similarity in a corpus. However, while there are a variety ...
Text categorization is a well-known task based essentially on statistical approaches using neural networks, Support Vector Machines and other machine learning algorithms. Texts are generally considered as bags of words without any order.... more
Text categorization is a well-known task based essentially on statistical approaches using neural networks, Support Vector Machines and other machine learning algorithms. Texts are generally considered as bags of words without any order. Although these approaches have proven to be efficient, they do not provide users with comprehensive and reusable rules about their data. Such rules are, however, very important
- by Theo Gevers
- •
- CLEF, Large Scale, Bag of Words
Fault Tree Analysis (FTA) is a proven technique for finding out the root cause of the problem and simplifies the problem systematically and logically. In auto parts manufacturing companies, line stoppage is a major problem and thus Bottle... more
Fault Tree Analysis (FTA) is a proven technique for finding out the root cause of the problem and simplifies the problem systematically and logically. In auto parts manufacturing companies, line stoppage is a major problem and thus Bottle Neck Machines are identified. In this case Honing machine was identified as the Bottle Neck Machine, which is being used for Honing of Brake Drums. The problem here was the Seat Check Alarm which was halting the machine and only after cleaning the Break drum surface and holes the machine would restart. This was not only time consuming but also caused a delay in the production of parts with respect to the fixed Takt time. Also the burrs of holes on the fixture seating area used to effect proper seating of the next part on fixture surface area, this would cause further delay in production. This could have been avoided if a chamfer operation was added to the rear face of the drum holes in the initial design and process, but that may have resulted in an additional operation and would require another machine. The proposed approach solves the problem by changing the fixture plate in such a way that the holes will not fall in the seating area and the burrs area will be relieved. This needs a new fixture plate design with proper repositioning of the Seat Check Air Hole keeping clamping area same. The functioning of the machine was studied for a month after mounting the newly designed fixture plate and Seat Check Alarm was not triggered, thus the proposed technique successfully eliminated the stoppage issue thereby improving the production efficiency
Spatial pyramids have been successfully applied to incorporating spatial information into bag-of-words based image representation. However, a major drawback is that it leads to high dimensional image representations. In this paper, we... more
Spatial pyramids have been successfully applied to incorporating spatial information into bag-of-words based image representation. However, a major drawback is that it leads to high dimensional image representations. In this paper, we present a novel framework for obtaining compact pyramid representation. First, we investigate the usage of the divisive information theoretic feature clustering (DITC) algorithm in creating a compact pyramid representation. In many cases this method allows us to reduce the size of a high ...
We propose a novel rule-based model to incorporate contextual information and effect of negation that enhances the performance of sentiment classification performed using bag-of-words models. We employed morphological analysis in feature... more
We propose a novel rule-based model to incorporate contextual information and effect of negation that enhances the performance of sentiment classification performed using bag-of-words models. We employed morphological analysis in feature extraction to ensure feature vector contains only opinionated words in a textual review. Also it reduces the dimensionality of feature vector and, eventually improves the efficiency of the classification algorithm. Further, we consider grammatical relationships to incorporate the context of adjectives and scope of negations within a phrase, to the feature vector. This enables our model to capture contextual polarity of adjectives and impact of negation words. For the morphological analysis we mainly employ Part Of Speech taggers (POS taggers) and grammatical relationships which are obtained using typed dependency parsers. By using dependency-based rules, we relax the conditional independent
assumption of bag-of-words models by way of combining
adjectives and negations to identified target words and, hence
obtain a sentiment classification accuracy that significantly better than baseline performance.
With the increasing popularity of microblogging sites, we are in the era of information explosion. As of June 2011, about 200 million tweets are being generated everyday. Although Twitter provides a list of most popular topics people... more
With the increasing popularity of microblogging sites, we are in the era of information explosion. As of June 2011, about 200 million tweets are being generated everyday. Although Twitter provides a list of most popular topics people tweet about known as Trending Topics in real time, it is often hard to understand what these trending topics are about. Therefore, it is important and necessary to classify these topics into general categories with high accuracy for better information retrieval. To address this problem, we classify Twitter Trending Topics into 18 general categories such as sports, politics, technology, etc. We experiment with 2 approaches for topic classification, (i) the well-known Bag-of-Words approach for text classification and (ii) network-based classification. In text-based classification method, we construct word vectors with trending topic definition and tweets, and the commonly used tf-idf weights are used to classify the topics using a Naive Bayes Multinomial classifier. In network-based classification method, we identify top 5 similar topics for a given topic based on the number of common influential users. The categories of the similar topics and the number of common influential users between the given topic and its similar topics are used to classify the given topic using a C5.0 decision tree learner. Experiments on a database of randomly selected 768 trending topics (over 18 classes) show that classification accuracy of up to 65% and 70% can be achieved using text-based and network-based classification modeling respectively.
In this paper, we address both standard and focused retrieval tasks based on comprehensible language models and interactive query expansion (IQE). Query topics are expanded using an initial set of Multi Word Terms (MWTs) selected from top... more
In this paper, we address both standard and focused retrieval tasks based on comprehensible language models and interactive query expansion (IQE). Query topics are expanded using an initial set of Multi Word Terms (MWTs) selected from top n ranked documents. MWTs are special text units that represent domain concepts and objects. As such, they can better represent query topics than ordinary phrases or n-grams. We tested different query representations: bag-of-words, phrases, flat list of MWTs, subsets of MWTs. We also combined the initial set of MWTs obtained in an IQE process with automatic query expansion (AQE) using language models and smoothing mechanism. We chose as baseline the Indri IR engine based on the language model using Dirichlet smoothing. The experiment is carried out on two benchmarks: TREC Enterprise track (TRECent) 2007 and 2008 collections; INEX 2008 Ad-hoc track using the Wikipedia collection.
We present in this paper a new approach for the automatic annotation of medical images, using the approach of "bag-of-words" to represent the visual content of the medical image combined with text descriptors based approach tf.idf and... more
We present in this paper a new approach for the automatic annotation of medical images, using the approach of "bag-of-words" to represent the visual content of the medical image combined with text descriptors based approach tf.idf and reduced by latent semantic to extract the co-occurrence between terms and visual terms. A medical report is composed of a text describing a medical image. First, we are interested to index the text and extract all relevant terms using a thesaurus containing MeSH medical concepts. In a second phase, the medical image is indexed while recovering areas of interest which are invariant to change in scale, light and tilt. To annotate a new medical image, we use the approach of "bag-of-words" to recover the feature vector. Indeed, we use the vector space model to retrieve similar medical image from the database training. The calculation of the relevance value of an image to the query image is based on the cosine function. We conclude with an experiment carried out on five types of radiological imaging to evaluate the performance of our system of medical annotation. The results showed that our approach works better with more images from the radiology of the skull.
ABSTRACT Matching the right people to the right job considering constraints such as qualifications, availability and cost is the cornerstone of IT projects delivery services. We present a study to improve data accuracy and completeness... more
ABSTRACT Matching the right people to the right job considering constraints such as qualifications, availability and cost is the cornerstone of IT projects delivery services. We present a study to improve data accuracy and completeness for resource matching by integrating unstructured data sources and introducing text mining techniques to dynamically adapt resource profile for resource planning decisions. Our approach discovers resource categories by extracting and learning new patterns from employee resumes; and incorporating resource experience for the job-matching optimization during the resource planning exercise.