Word Embedding Research Papers - Academia.edu (original) (raw)

The massive volume of comments on websites and social networks has made it possible to raise awareness of people's beliefs and preferences regarding products and services on a large scale. For this purpose, sentiment analysis, which... more

The massive volume of comments on websites and social networks has made it possible to raise awareness of people's beliefs and preferences regarding products and services on a large scale. For this purpose, sentiment analysis, which refers to the determination of the sentiment of texts, has been proposed as an intelligent solution. From a methodological point of view, the recent combination of words embedding and deep neural networks (DNNs) has become an effective approach for sentiment analysis. In Persian studies, formal corpuses such as Wikipedia dumps have been used for word embedding. The fundamental difference between formal and informal texts means that the vectors derived from formal texts in informal contexts such as social networks do not result in desirable accuracy. To overcome this drawback, in this paper, we provide a large integrated text corpus of several different sources of informal comments and we also utilize the Fasttext as the word embedding algorithm. In this research, we use Attention-based LSTM, which has been shown to perform more effectively compared to the similar methods in sentiment analysis for the English language. The proposed method is evaluated on the two Persian "Taaghche" and "Filimo" datasets collected in this paper. The experiments on the two Persian datasets prove that utilizing informal vectors in sentiment analysis and applying the attention model improves the prediction accuracy of the DNN in the sentiment analysis of Persian texts.

A major challenge in article clustering is high dimensional, because this will affect directly to the accuracy. However, it is becoming more important due to the huge textual information available online. In this paper, we proposed an... more

A major challenge in article clustering is high dimensional, because this will affect directly to the accuracy. However, it is becoming more important due to the huge textual information available online. In this paper, we proposed an Arabic word net dictionary to extract, select and reduce the features. Additionally, we use the embedding Word2Vector model as feature weighting technique. Finally, for the clustering uses the hierarchy clustering. Our methods are using the Arabic word net dictionary with word embedding, additionally by using the discretization. This method are effective and can enhance improve the accuracy of clustering, which shown in our experimental results.
Keywords: Machine Learning, Clustering, CBOW, SKIP-GRAM, Word Embedding, Arabic Word Net Dictionary.

Word Embeddings can be used by deep layers of neural networks to extract features from them to learn stylo-metric patterns of authors based on context and co-occurrence of the words in the field of Authorship Attribution. In this paper,... more

Word Embeddings can be used by deep layers of neural networks to extract features from them to learn stylo-metric patterns of authors based on context and co-occurrence of the words in the field of Authorship Attribution. In this paper, we investigate the effects of different types of word embeddings in Authorship Attribution of Bengali Literature, specifically the skip-gram and continuous-bag-of-words(CBOW) models generated by Word2Vec and fastText along with the word vectors generated by Glove. We experiment with dense neural network models, such as the convolutional and recurrent neural networks and analyse how different word embedding models effect the performance of the classifiers and discuss their properties in this classification task of Authorship Attribution of Bengali Literature. The experiments are performed on a data set we prepared, consisting of 2400 on-line blog articles from 6 authors of recent times.

Our study explores offensive and hate speech detection for the Arabic language, as previous studies are minimal. Based on two-class, three-class, and six-class Arabic-Twitter datasets, we develop single and ensemble CNN and BiLSTM... more

Our study explores offensive and hate speech detection for the Arabic language, as previous studies are minimal. Based on two-class, three-class, and six-class Arabic-Twitter datasets, we develop single and ensemble CNN and BiLSTM classifiers that we train with non-contextual (Fasttext-SkipGram) and contextual (Multilingual Bert and AraBert) word-embedding models. For each hate/offensive classification task, we conduct a battery of experiments to evaluate the performance of single and ensemble classifiers on testing datasets. The average-based ensemble approach was found to be the best performing, as it returned F-scores of 91%, 84%, and 80% for two-class, three-class and six-class prediction tasks, respectively. We also perform an error analysis of the best ensemble model for each task.

Word embeddings are real-valued word representations able to capture lexical semantics and trained on natural language corpora. Models proposing these representations have gained popularity in the recent years, but the issue of the most... more

Word embeddings are real-valued word representations able to capture lexical semantics and trained on natural language corpora. Models proposing these representations have gained popularity in the recent years, but the issue of the most adequate evaluation method still remains open. This paper presents an extensive overview of the field of word embeddings evaluation, highlighting main problems and proposing a typology of approaches to evaluation, summarizing 16 intrinsic methods and 12 extrinsic methods. I describe both widely-used and experimental methods, systematize information about evaluation datasets and discuss some key challenges.

The paper presents a free and open source toolkit which aim is to quickly deploy web services handling distributed vector models of semantics. It fills in the gap between training such models (many tools are already available for this)... more

The paper presents a free and open source toolkit which aim is to quickly deploy web services handling distributed vector models of semantics. It fills in the gap between training such models (many tools are already available for this) and dissemination of the results to general public.
Our toolkit, WebVectors, provides all the necessary routines for organizing online access to querying trained models via modern web interface. We also describe two demo installations of the toolkit, featuring several efficient models for English, Russian and Norwegian.

A word embedding is a low-dimensional, dense and real-valued vector representation of a word. Word embeddings have been used in many NLP tasks. They are usually generated from a large text corpus. The embedding of a word captures both its... more

A word embedding is a low-dimensional, dense and real-valued vector representation of a word. Word embeddings have been used in many NLP tasks. They are usually generated from a large text corpus. The embedding of a word captures both its syntactic and semantic aspects. Tweets are short, noisy and have unique lexical and semantic features that are different from other types of text. Therefore, it is necessary to have word embeddings learned specifically from tweets. In this paper, we present ten word embedding data sets. In addition to the data sets learned from just tweet data, we also built embedding sets from the general data and the combination of tweets with the general data. The general data consist of news articles, Wikipedia data and other web data. These ten embedding models were learned from about 400 million tweets and 7 billion words from the general text. In this paper, we also present two experiments demonstrating how to use the data sets in some NLP tasks, such as tweet sentiment analysis and tweet topic classification tasks.

A review of Word Embedding Models through a deconstructive approach reveals their several shortcomings and inconsistencies. These include instability of the vector representations, a distorted analogical reasoning, geometric... more

A review of Word Embedding Models through a deconstructive approach reveals their several shortcomings and inconsistencies. These include instability of the vector representations, a distorted analogical reasoning, geometric incompatibility with linguistic features, and the inconsistencies in the corpus data. A new theoretical embedding model, 'Derridian Embedding,' is proposed in this paper. Contemporary embedding models are evaluated qualitatively in terms of how adequate they are in relation to the capabilities of a Derridian Embedding.

In this project, we have used three different deep neural networks: (1) Densely Connected Neural Network, (2) Convolutional Neural Network (CNN), and (3) Long and Short Term Memory (LSTM) Networks, for carrying out sentiment analysis on a... more

In this project, we have used three different deep neural networks: (1) Densely Connected Neural Network, (2) Convolutional Neural Network (CNN), and (3) Long and Short Term Memory (LSTM) Networks, for carrying out sentiment analysis on a large database of textual movie reviews. In our work, we have used the “embedding layer” in Keras and “GloVe word embeddings” to convert text format of reviews into their corresponding numeric values. We have tested all the models, and compared their performances.

How different cultures react and respond given a crisis is predominant in a society's norms and political will to combat the situation. Often, the decisions made are necessitated by events, social pressure, or the need of the hour,... more

How different cultures react and respond given a crisis is predominant in a society's norms and political will to combat the situation. Often, the decisions made are necessitated by events, social pressure, or the need of the hour, which may not represent the nation's will. While some are pleased with it, others might show resentment. Coronavirus (COVID-19) brought a mix of similar emotions from the nations towards the decisions taken by their respective governments. Social media was bombarded with posts containing both positive and negative sentiments on the COVID-19, pandemic, lockdown, and hashtags past couple of months. Despite geographically close, many neighboring countries reacted differently to one another. For instance, Denmark and Sweden, which share many similarities, stood poles apart on the decision taken by their respective governments. Yet, their nation's support was mostly unanimous, unlike the South Asian neighboring countries where people showed a lot o...

Many thousands of patent applications arrive at patent offices around the world every day. One important task when a patent application is submitted is to assign one or more classification codes from the complex and hierarchical patent... more

Many thousands of patent applications arrive at patent offices around the world every day. One important task when a patent application is submitted is to assign one or more classification codes from the complex and hierarchical patent classification schemes that will enable routing of the patent application to a patent expert who is knowledgeable about the specific technical field. This task is typically undertaken by patent professionals, however due to the large number of applications and the potential complexity of an invention, they are usually overwhelmed. Therefore, there is a need for this code assignment manual task to be supported or even fully automated by classification systems that will classify patent applications, hopefully with an accuracy close to patent professionals. Like in many other text analysis problems, in the last years, this intellectually demanding task has been studied using word embeddings and deep learning techniques. In this thesis we present the results of different word embeddings and deep learning techniques, considering different parts of the patent document on automatic patent classification in the level of sub-classes. Compared with previous works we focus on single-label classification exploiting the

labels found on patents in the CLEF-IP 2011 collection.
This report is the result of a master thesis in information retrieval field at International Hellenic University during spring term of 2021.

Students' feedback is an effective mechanism that provides valuable insights about teaching-learning process. Handling opinions of students expressed in reviews is a quite labour-intensive and tedious task as it is typically performed... more

Students' feedback is an effective mechanism that provides valuable insights about teaching-learning process. Handling opinions of students expressed in reviews is a quite labour-intensive and tedious task as it is typically performed manually by the human intervention. While this task may be viable for small-scale courses that involve just a few students' feedback, it is unpractical for large-scale cases as it applies to online courses in general, and MOOCs, in particular. Therefore, to address this issue, we propose in this paper a framework to automatically analyzing opinions of students expressed in reviews. Specifically, the framework relies on aspect-level sentiment analysis and aims to automatically identify sentiment or opinion polarity expressed towards a given aspect related to the MOOC. The proposed framework takes advantage of weakly supervised annotation of MOOC-related aspects and propagates the weak supervision signal to effectively identify the aspect categories discussed in the unlabeled students' reviews. Consequently, it significantly reduces the need for manually annotated data which is the main bottleneck for all deep learning techniques. A large-scale real-world education dataset containing around 105k students' reviews collected from Coursera and a dataset comprising of 5989 students' feedback in traditional classroom settings are used to perform experiments. The experimental results indicate that our proposed framework attains inspiring performance with respect to both the aspect category identification and the aspect sentiment classification. Moreover, the results suggest that the framework leads to more accurate results than the expensive and labour-intensive sentiment analysis techniques relying heavily on manually labelled data.

Learning representations for knowledge base entities and concepts is becoming increasingly important for NLP applications. However, recent entity embedding methods have relied on structured resources that are expensive to create for new... more

Learning representations for knowledge base entities and concepts is becoming increasingly important for NLP applications. However, recent entity embedding methods have relied on structured resources that are expensive to create for new domains and corpora. We present a distantly-supervised method for jointly learning embeddings of entities and text from an unnanotated corpus, using only a list of mappings between entities and surface forms. We learn embeddings from open-domain and biomedical corpora, and compare against prior methods that rely on human-annotated text or large knowledge graph structure. Our embeddings capture entity similarity and relatedness better than prior work, both in existing biomed-ical datasets and a new Wikipedia-based dataset that we release to the community. Results on analogy completion and entity sense disambiguation indicate that entities and words capture complementary information that can be effectively combined for downstream use.

The task of evaluating textual semantic similarity is one of the challenges in the Natural Language Processing area. It is observed in the literature the experimentation with priority use of probabilistic resources, and linguistic aspects... more

The task of evaluating textual semantic similarity is one of the challenges in the Natural Language Processing area. It is observed in the literature the experimentation with priority use of probabilistic resources, and linguistic aspects explored in an incipient way. This paper presents an experiment with a hybrid approach, in which both resources of distributed representation and also lexical and linguistic aspects are integrated for the evaluation of semantic similarity between short sentences in Brazilian Portuguese. The proposed technique was evaluated with a dataset known in the literature and obtained good results.

In the industrial era 5.0, product reviews are necessary for the sustainability of a company. Product reviews are a User Generated Content (UGC) feature which describes customer satisfaction. The researcher used five hotel aspects... more

In the industrial era 5.0, product reviews are necessary for the sustainability of a company. Product reviews are a User Generated Content (UGC) feature which describes customer satisfaction. The researcher used five hotel aspects including location, meal, service, comfort, and cleanliness to measure customer satisfaction. Each product review was preprocessed into a term list document. In this context, we proposed the Probabilistic Latent Semantic Analysis (PLSA) method to produce a hidden topic. Semantic Similarity was used to classify topics into five hotel aspects. The Term Frequency-Inverse Corpus Frequency (TF-ICF) method was used for weighting each term list, which had been expanded from each cluster in the document. The researcher used Word embedding to obtain vector values in the deep learning method from Long Short-Term Memory (LSTM) for sentiment classification. The result showed that the combination of the PLSA + TF ICF 100% + Semantic Similarity method was superior are 0.840 in the fifth categorization of the hotel aspects; the Word Embedding + LSTM method outperformed the sentiment classification at value 0.946; the service aspect received positive sentiment value higher are 45.545 than the other aspects; the comfort aspect received negative sentiment value higher are 12.871 than the other aspects. Other results also showed that sentiment was affected by the aspects.

Recently, considerable attention has been paid to word embedding algorithms inspired by neural network models. Given a large textual corpus, these algorithms attempt to derive a set of vectors which represent the corpus vocabulary in a... more

Recently, considerable attention has been paid to word embedding algorithms inspired by neural network models. Given a large textual corpus, these algorithms attempt to derive a set of vectors which represent the corpus vocabulary in a new embedded space. This representation can provide a useful means of measuring the underlying similarity between words. Here we investigate this property in the context of annotated texts of 19th-century fiction by the authors Jane Austen, Charles Dickens, and Arthur Conan Doyle. We demonstrate that building word embeddings on these texts can provide us with an insight into how characters group differently under different conditions, allowing us to make comparisons across different novels and authors. These results suggest that word embeddings can potentially provide a useful tool in supporting quantitative literary analysis.

We present path2vec, a new approach for learning graph embeddings that relies on structural measures of pairwise node similarities. The model learns representations for nodes in a dense space that approximate a given user-defined graph... more

We present path2vec, a new approach for learning graph embeddings that relies on structural measures of pairwise node similarities. The model learns representations for nodes in a dense space that approximate a given user-defined graph distance measure, such as e.g. the shortest path distance or distance measures that take information beyond the graph structure into account. Evaluation of the proposed model on semantic similarity and word sense disambiguation tasks, using various WordNet-based similarity measures, show that our approach yields competitive results, outperforming strong graph embedding baselines. The model is computationally efficient, being orders of magnitude faster than the direct computation of graph-based distances.

This work uses recursive autoencoders (Socher et al., 2011), word embeddings (Pennington et al., 2014), associative matrices (Schuler, 2014) and lexical overlap features to model human judgments of sentential similarity on SemEval-2015... more

This work uses recursive autoencoders (Socher et al., 2011), word embeddings (Pennington et al., 2014), associative matrices (Schuler, 2014) and lexical overlap features to model human judgments of sentential similarity on SemEval-2015 Task 2: English STS (Agirre et al., 2015). Results show a modest positive correlation between system predictions and human similarity scores, ranking 69th out of 74 submitted systems.

How different cultures react and respond given a crisis is predominant in a society's norms and political will to combat the situation. Often, the decisions made are necessitated by events, social pressure, or the need of the hour, which... more

How different cultures react and respond given a crisis is predominant in a society's norms and political will to combat the situation. Often, the decisions made are necessitated by events, social pressure, or the need of the hour, which may not represent the nation's will. While some are pleased with it, others might show resentment. Coronavirus (COVID-19) brought a mix of similar emotions from the nations towards the decisions taken by their respective governments. Social media was bombarded with posts containing both positive and negative sentiments on the COVID-19, pandemic, lockdown, and hashtags past couple of months. Despite geographically close, many neighboring countries reacted differently to one another. For instance, Denmark and Sweden, which share many similarities, stood poles apart on the decision taken by their respective governments. Yet, their nation's support was mostly unanimous, unlike the South Asian neighboring countries where people showed a lot of anxiety and resentment. The purpose of this study is to analyze reaction of citizens from different cultures to the novel Coronavirus and people's sentiment about subsequent actions taken by different countries. Deep long short-term memory (LSTM) models used for estimating the sentiment polarity and emotions from extracted tweets have been trained to achieve state-of-the-art accuracy on the sentiment140 dataset. The use of emoticons showed a unique and novel way of validating the supervised deep learning models on tweets extracted from Twitter.

Distributed representation of words, or word embeddings, have motivated methods for calculating semantic representations of word sequences such as phrases, sentences and paragraphs. Most of the existing methods to do so either use... more

Distributed representation of words, or word embeddings, have motivated methods for calculating semantic representations of word sequences such as phrases, sentences and paragraphs. Most of the existing methods to do so either use algorithms to learn such representations, or improve on calculating weighted averages of the word vectors. In this work, we experiment with spectral methods of signal representation and summarization as mechanisms for constructing such word-sequence embeddings in an unsupervised fashion. In particular, we explore an algorithm rooted in fluid-dynamics, known as higher-order Dynamic Mode Decomposition, which is designed to capture the eigenfrequencies, and hence the fundamental transition dynamics, of periodic and quasi-periodic systems. It is empirically observed that this approach, which we call EigenSent, can summarize transitions in a sequence of words and generate an embedding that can represent well the sequence itself. To the best of the authors’ knowledge, this is the first application of a spectral decomposition and signal summarization technique on text, to create sentence embeddings. We test the efficacy of this algorithm in creating sentence embeddings on three public datasets, where it performs appreciably well. Moreover it is also shown that, due to the positive combination of their complementary properties, concatenating the embeddings generated
by EigenSent with simple word vector averaging achieves state-of-the-art results.

Los sistemas de Big Data recogen datos de diversas fuentes que necesitas analizar sin saber nada sobre esos datos: algunas técnicas topológicas basadas en la homología permiten hacer esto. Daremos una introducción teórica a estas técnicas... more

Los sistemas de Big Data recogen datos de diversas fuentes que necesitas analizar sin saber nada sobre esos datos: algunas técnicas topológicas basadas en la homología permiten hacer esto. Daremos una introducción teórica a estas técnicas tratando también de mencionar algunos ejemplos prácticos.

Sarcasm detection in media text is a binary classification task where text can be either written straightly or sarcastically (with irony) where the intended meaning is the opposite of what is seemingly expressed. Performing sarcasm... more

Sarcasm detection in media text is a binary classification task where text can be either written straightly or sarcastically (with irony) where the intended meaning is the opposite of what is seemingly expressed. Performing sarcasm detection can be very useful in improving the performance of sentimental analysis where existing models fail to identify sarcasm at all. We examine the use of deep neural networks in this paper to detect the sarcasm in social media text(specifically Twitter data) as well as news headlines and compare the results. Results show that deep neural networks with the inclusion of word embeddings, bidirectional LSTM's and convolutional networks achieve better accuracy of around 88 percent for sarcasm detection.

Analysis of word embedding properties to inform their use in downstream NLP tasks has largely been studied by assessing nearest neighbors. However, geometric properties of the continuous feature space contribute directly to the use of... more

Analysis of word embedding properties to inform their use in downstream NLP tasks has largely been studied by assessing nearest neighbors. However, geometric properties of the continuous feature space contribute directly to the use of embedding features in downstream models, and are largely unexplored. We consider four properties of word embedding geometry, namely: position relative to the origin, distribution of features in the vector space, global pairwise distances , and local pairwise distances. We define a sequence of transformations to generate new embeddings that expose subsets of these properties to downstream models and evaluate change in task performance to understand the contribution of each property to NLP models. We transform publicly available pre-trained embeddings from three popular toolk-its (word2vec, GloVe, and FastText) and evaluate on a variety of intrinsic tasks, which model linguistic information in the vector space, and extrinsic tasks, which use vectors as input to machine learning models. We find that intrinsic evaluations are highly sensitive to absolute position, while extrinsic tasks rely primarily on local similarity. Our findings suggest that future embedding models and post-processing techniques should focus primarily on similarity to nearby points in vector space.

English. This work aims at evaluating and comparing two different frameworks for the unsupervised topic modelling of the CompWHoB Corpus, namely our political-linguistic dataset. The first approach is represented by the application of the... more

English. This work aims at evaluating and comparing two different frameworks for the unsupervised topic modelling of the CompWHoB Corpus, namely our political-linguistic dataset. The first approach is represented by the application of the latent DirichLet Allocation (henceforth LDA), defining the evaluation of this model as baseline of comparison. The second framework employs Word2Vec technique to learn the word vector representations to be later used to topic-model our data. Compared to the previously defined LDA baseline, results show that the use of Word2Vec word embeddings significantly improves topic modelling performance but only when an accurate and task-oriented linguistic pre-processing step is carried out.

How different cultures react and respond given a crisis is predominant in a society's norms and political will to combat the situation. Often, the decisions made are necessitated by events, social pressure, or the need of the hour,... more

How different cultures react and respond given a crisis is predominant in a society's norms and political will to combat the situation. Often, the decisions made are necessitated by events, social pressure, or the need of the hour, which may not represent the nation's will. While some are pleased with it, others might show resentment. Coronavirus (COVID-19) brought a mix of similar emotions from the nations towards the decisions taken by their respective governments. Social media was bombarded with posts containing both positive and negative sentiments on the COVID-19, pandemic, lockdown, and hashtags past couple of months. Despite geographically close, many neighboring countries reacted differently to one another. For instance, Denmark and Sweden, which share many similarities, stood poles apart on the decision taken by their respective governments. Yet, their nation's support was mostly unanimous, unlike the South Asian neighboring countries where people showed a lot o...

Cet article présente des travaux visant à développer un système de rédaction automatique de résu-més de textes économiques et financiers en attachant une attention particulière à l’idiomaticité et à la fluidité de la langue d’arrivée.... more

Cet article présente des travaux visant à développer un système de rédaction automatique de résu-més de textes économiques et financiers en attachant une attention particulière à l’idiomaticité et à la fluidité de la langue d’arrivée. Pour ce faire, l’étude part d’un corpus de rapports périodiques de la Banque de France relevant des discours de conjoncture. Le travail linguistique permet de montrer qu’une rédaction des résumés ne s’attachant qu’à l’extraction terminologique et collocationnelle stricte ignore tout un pan de vocabulaire, saisi ici comme « lexique support », jouant un rôle important dans l’organisation cognitive du domaine. Sur cette base, le travail présenté sur les modèles de langage en apprentissage profond met en avant la pertinence du mécanisme d’auto-attention pour identifier et extraire des schémas lexico-grammaticaux ainsi le lexique support, et l’impact sur le guidage du mo-dèle de résumé abstractif de CamemBERT à travers l'augmentation des données. Une première expé-rimentation utilisant le corpus considéré ainsi que la méthode d'extraction sont présentées.

English. The automatic misogyny identification (AMI) task proposed at IberEval and EVALITA 2018 is an example of the active involvement of scientific Research to face up the online spread of hate contents against women. Considering the... more

English. The automatic misogyny identification (AMI) task proposed at IberEval and EVALITA 2018 is an example of the active involvement of scientific Research to face up the online spread of hate contents against women. Considering the encouraging results obtained for Spanish and English in the precedent edition of AMI, in the EVALITA framework we tested the robustness of a similar approach based on topic and stylistic information on a new collection of Italian and English tweets. Moreover, to deal with the dynamism of the language on social platforms , we also propose an approach based on automatically-enriched lexica. Despite resources like the lexica prove to be useful for a specific domain like misogyny, the analysis of the results reveals the limitations of the proposed approaches.

An open issue in the sentiment classification of texts written in Serbian is the effect of different forms of morphological normalization and the usefulness of leveraging large amounts of unlabeled texts. In this paper, we assess the... more

An open issue in the sentiment classification of texts written in Serbian is the effect of different forms of morphological normalization and the usefulness of leveraging large amounts of unlabeled texts. In this paper, we assess the impact of lemmatizers and stemmers for Serbian on classifiers trained and evaluated on the Serbian Movie Review Dataset. We also consider the effectiveness of using word embeddings, generated from a large unlabeled corpus, as classification features.

Within the last decade, substantial advances have been made in the field of computational linguistics, due in part to the evolution of word embedding algorithms inspired by neural network models. These algorithms attempt to derive a set... more

Within the last decade, substantial advances have been made in the field of computational linguistics, due in part to the evolution of word embedding algorithms inspired by neural network models. These algorithms attempt to derive a set of vectors which represent the vocabulary of a textual corpus in a new embedded space. This new representation can then be used to measure the underlying similarity between words. In this paper, we explore the role an author's gender may play in the selection of words that they choose to construct their narratives. Using a curated corpus of forty-eight 19th century novels, we generate, visualise, and investigate word embedding representations using a list of gender-encoded words. This allows us to explore the different ways in which male and female authors of this corpus use terms relating to contemporary understandings of gender and gender roles.

This vision paper proposes an approach to use the most advanced word embeddings techniques to bridge the gap between the discourses of experts and non-experts and more specifically the terminologies used by the twocommunities. Word... more

This vision paper proposes an approach to use the most advanced word embeddings techniques to bridge the gap between the discourses of experts and non-experts and more specifically the terminologies used by the twocommunities. Word embeddings makes it possible to find equivalent terms between experts and non-experts, byapproach the similarity between words or by revealing hidden semantic relations. Thus, these controlledvocabularies with these new semantic enrichments are exploited in a hybrid recommendation system incorporating content-based ontology and keyword-based ontology to obtain relevant wines recommendations regardless of the level of expertise of the end user. The major aim is to find a non-expert vocabulary from semantic rules to enrich the knowledge of the ontology and improve the indexing of the items (i.e. wine) and the recommendation process.

There is enormous amount information available in different forms of sources and genres. In order to extract useful information from a massive amount of data, automatic mechanism is required. The text summarization systems assist with... more

There is enormous amount information available in different forms of sources and genres. In order to extract useful information from a massive amount of data, automatic mechanism is required. The text summarization systems assist with content reduction keeping the important information and filtering the non-important parts of the text. Good document representation is really important in text summarization to get relevant information. Bag-ofwords cannot give word similarity on syntactic and semantic relationship. Word embedding can give good document representation to capture and encode the semantic relation between words. Therefore, centroid based on word embedding representation is employed in this paper. Myanmar news summarization based on different word embedding is proposed. In this paper, Myanmar local and international news are summarized using centroid-based word embedding summarizer using the effectiveness of word representation approach, word embedding. Experiments were done on Myanmar local and international news dataset using different word embedding models and the results are compared with performance of bag-of-words summarization. Centroid summarization using word embedding performs comprehensively better than centroid summarization using bag-of-words.

In this technical report, we propose an algorithm, called Lex2vec that exploits lexical resources to inject information into word embeddings and name the embedding dimensions by means of knowledge bases. We evaluate the optimal parameters... more

In this technical report, we propose an algorithm, called Lex2vec that exploits lexical resources to inject information into word embeddings and name the embedding dimensions by means of knowledge bases. We evaluate the optimal parameters to extract a number of informative labels that is readable and has a good coverage for the embedding dimensions.

Due to the increasing amount of data on the internet, finding a highly-informative, low-dimensional representation for text is one of the main challenges for efficient natural language processing tasks including text classification. This... more

Due to the increasing amount of data on the internet, finding a highly-informative, low-dimensional representation for text is one of the main challenges for efficient natural language processing tasks including text classification. This representation should capture the semantic information of the text while retaining their relevance level for document classification. This approach maps the documents with similar topics to a similar space in vector space representation. To obtain representation for large text, we propose the utilization of deep Siamese neural networks. To embed document relevance in topics in the distributed representation, we use a Siamese neural network to jointly learn document representations. Our Siamese network consists of two sub-network of multi-layer perceptron. We examine our representation for the text categorization task on BBC news dataset. The results show that the proposed representations outperform the conventional and state-of-the-art representations in the text classification task on this dataset.

In this paper, we explore the applicability of word embeddings retrieved from shallow, two-layer neural networks using the Skip-Gram Word2Vec model devised by Mikolov et al. (2013) in the exploration of literary meaning. Word embeddings... more

In this paper, we explore the applicability of word embeddings retrieved from shallow, two-layer neural networks using the Skip-Gram Word2Vec model devised by Mikolov et al. (2013) in the exploration of literary meaning. Word embeddings represent a relational model of meaning where semantic relations between words are represented as geometric relationships between vectors (Kozlowski, Taddy, and Evans 2018). This approach to modelling meaning draws on the structuralist models of Saussure (1916) and Levi-Strauss (1963) positing that meaning is acquired through the placement of individual signifiers within a complex system of signification. In word embeddings, words sharing a component of meaning tend to cluster together in the vector-space. Word embeddings also enable us to explore the relations between words located distally in the vector space, through analogies. For our analysis, we used the Gensim implementation of Word2Vec packaged by Rehurek and Sojka (2010) to explore the cultural concepts related to the notions of 'man' and 'woman' in the novels "Pride and Prejudice" (PP) by Jane Austen and "50 Shades of Grey" (FSG) by Erika Leonard James. Although different in many respects, these novels share a number of fundamental characteristics-both were written by female authors about female protagonists with asymmetric relations to their partners. The similarity between the gender-marking words 'man' and 'woman' in our model shows a closer relation in PP when compared to FSG. In FSG, the vector space shows a degree of sparsity around 'woman' when compared to 'man', reflecting the novel's focus around the description of the male protagonist. The analysis of adjectives associated with 'man' and 'woman' show a

In this paper, a new approach towards semantic clustering of the results of ambiguous search queries is presented. We propose using distributed vector representations of words trained with the help of prediction-based neural embedding... more

In this paper, a new approach towards semantic clustering of the results of ambiguous search queries is presented. We propose using distributed vector representations of words trained with the help of prediction-based neural embedding models to detect senses of search queries and to cluster search engine results page according to these senses. The words from titles and snippets together with semantic relationships between them form a graph, which is further partitioned into components related to different query senses.
This approach to search engine results clustering is evaluated against a new manually annotated evaluation data set of Russian search queries. We show that in the task of semantically clustering search results, prediction-based models slightly but stably outperform traditional count-based ones, with the same training corpora.

Word Embeddings have shown to be use- ful in wide range of NLP tasks. We ex- plore the methods of using the embed- dings in Dependency Parsing of Hindi, a MoR-FWO (morphologically rich, rel- atively freer word order)... more

Word Embeddings have shown to be use-
ful in wide range of NLP tasks. We ex-
plore the methods of using the embed-
dings in Dependency Parsing of Hindi,
a MoR-FWO (morphologically rich, rel-
atively freer word order) language and
show that they not only help improve the
quality of parsing, but can even act as a
cheap alternative to the traditional features
which are costly to acquire. We demon-
strate that if we use distributed represen-
tation of lexical items instead of features
produced by costly tools such as Morpho-
logical Analyzer, we get competitive re-
sults. This implies that only mono-lingual
corpus will suffice to produce good accu-
racy in case of resource poor languages for
which these tools are unavailable. We also
explored the importance of these represen-
tations for domain adaptation

The study of the dynamics or the progress of science has been widely explored with descriptive and statistical analyses. Also this study has attracted several computational approaches that are labelled together as the Computational... more

The study of the dynamics or the progress of science has been widely explored with descriptive and statistical analyses. Also this study has attracted several computational approaches that are labelled together as the Computational History of Science, especially with the rise of data science and the development of increasingly powerful computers. Among these approaches, some works have studied dynamism in scientific literature by employing text analysis techniques that rely on topic models to study the dynamics of research topics. Unlike topic models that do not delve deeper into the content of scientific publications, for the first time, this paper uses temporal word embeddings to automatically track the dynamics of scientific keywords over time. To this end, we propose Vec2Dynamics, a neural-based computational history approach that reports stability of k-nearest neighbors of scientific keywords over time; the stability indicates whether the keywords are taking new neighborhood du...

In recent years, there has been an exponential growth within the number of complex documents and texts. It requires a deeper understanding of machine learning methods to be ready to accurately classify texts in many applications.... more

In recent years, there has been an exponential growth within the number of complex documents and texts. It requires a deeper understanding of machine learning methods to be ready to accurately classify texts in many applications. Understanding the rapidly growing short text is extremely important. Short text is different from traditional documents in its length. With the recent explosive growth of e-commerce and online communication, a replacement genre of text, short text, has been extensively applied in many areas. Numerous researches specialise in short text mining. It's a challenge to classify the short text due to its natural characters, like sparseness, large-scale, immediacy, non-standardization etc. With the rapid development of the web, Web users and Web service are generating more and more short text, including tweets, search snippets, product reviews then on. There's an urgent demand to know the short text. For instance an honest understanding of tweets can help advertisers put relevant advertisements along the tweets, which makes revenue without hurting user experience Short text classification is one among important tasks in tongue Processing (NLP). Unlike paragraphs or documents, short texts are more ambiguous .They do not have enough contextual information, which poses challenge for classification. We retrieve knowledge from external knowledge source to reinforce the semantic representation of short texts. We take conceptual information as a sort of data and incorporate it into deep neural networks. Here we are going to study different methods available for text classification and categorisation.

In this paper, we present an extension, and an evaluation, to existing Quantum like approaches of word embedding for IR tasks that (1) improves complex features detection of word use (e.g., syntax and semantics), (2) enhances how this... more

In this paper, we present an extension, and an evaluation, to existing Quantum like approaches of word embedding for IR tasks that (1) improves complex features detection of word use (e.g., syntax and semantics), (2) enhances how this method extends these aforementioned uses across linguistic contexts (i.e., to model lexical ambiguity)-specifically Question Classification-, and (3) reduces computational resources needed for training and operating Quantum based neural networks, when confronted with existing models. This approach could also be latter applicable to significantly enhance the state-of the-art across Natural Language Processing (NLP) word-level tasks such as entity recognition, part-of-speech tagging, or sentence-level ones such as textual relatedness and entailment, to name a few.

It is difficult to imagine any large-scale application dealing with human language (either in research or in industry) which does not use word embeddings in this or that way. Distributional methods of processing meaning enjoy tremendous... more

It is difficult to imagine any large-scale application dealing with human language (either in research or in industry) which does not use word embeddings in this or that way. Distributional methods of processing meaning enjoy tremendous popularity.
But is distributional hypothesis a full-fledged scientific theory? That is, can it be properly falsified? It seems not.