WordSpace — Visual Summary of Text Corpora (original) (raw)

Docuburst: Visualizing document content using language structure

Computer Graphics Forum, 2009

Textual data is at the forefront of information management problems today. One response has been the development of visualizations of text data. These visualizations, commonly based on simple attributes such as relative word frequency, have become increasingly popular tools. We extend this direction, presenting the first visualization of document content which combines word frequency with the human-created structure in lexical databases to create a visualization that also reflects semantic content. DocuBurst is ...

Word Clouds with Latent Variable Analysis for Visual Comparison of Documents

2016

Word cloud is a visualization form for text that is recognized for its aesthetic, social, and analytical values. Here, we are concerned with deepening its analytical value for visual comparison of documents. To aid comparative analysis of two or more documents, users need to be able to perceive similarities and differences among documents through their word clouds. However, as we are dealing with text, approaches that treat words independently may impede accurate discernment of similarities among word clouds containing different words of related meanings. We therefore motivate the principle of displaying related words in a coherent manner, and propose to realize it through modeling the latent aspects of words. Our WORD FLOCK solution brings together latent variable analysis for embedding and aspect modeling, and calibrated layout algorithm within a synchronized word cloud generation framework. We present the quantitative and qualitative results on real-life text corpora, showcasing ...

Exploration of dimensionality reduction for text visualization

2005

In the text document visualization community, statistical analysis tools (e.g., principal component analysis and multidimensional scaling) and neurocomputation models (e.g., self-organizing feature maps) have been widely used for dimensionality reduction. often the resulting dimensionality is set to two, as this facilitates plotting the results. The validity and effectiveness of these approaches largely depend on the specific data sets used and semantics of the targeted applications. To date, there has been little evaluation to assess and compare dimensionality reduction methods and dimensionality reduction processes, either numerically or empirically. The focus of this paper is to propose a mechanism for comparing and evaluating the effectiveness of dimensionality reduction techniques in the visual exploration of text document archives. We use multivariate visualization techniques and interactive visual exploration to study three problems: (a) Which dimensionality reduction technique best preserves the interrelationships within a set of text documents; (b) What is the sensitivity of the results to the number of output dimensions; (c) Can we automatically remove redundant or unimportant words from the vector extracted from the documents while still preserving the majority of information, and thus make dimensionality reduction more efficient. To study each problem, we generate supplemental dimensions based on several dimensionality reduction algorithms and parameters controlling these algorithms. We then visually analyze and explore the characteristics of the reduced dimensional spaces as implemented within a multi-dimensional visual exploration tool, XmdvTool. We compare the derived dimensions to features known to be present in the original data. Quantitative measures are also used in identifying the quality of results using different numbers of output dimensions.

Visualization of term discrimination analysis

Journal of The American Society for Information Science and Technology, 2001

A visual term discrimination value analysis method is introduced using a document density space within a distance–angle-based visual information retrieval environment. The term discrimination capacity is analyzed using the comparison of the distance and angle-based visual representations with and without a specified term, thereby allowing the user to see the impact of the term on individual documents within the density space. Next, the concept of a “term density space” is introduced for term discrimination analysis. Using this concept, a term discrimination capacity for distinguishing one term from others in the space can also be visualized within the visual space. Applications of these methods facilitate more effective assignment of term weights to index terms within documents and may assist searchers in the selection of search terms.

Creating Interactive Document Maps Through Dimensionality Reduction and Visualization Techniques

2005

The current availability of information many times impair the tasks of searching, browsing and analysing information pertinent to a topic of interest. This paper presents a methodology to create a meaningful graphical representation of corpora of documents targeted at supporting exploration of correlated information. The purpose of such an approach is to produce a map from a document body on a research topic or field based on the analysis of their contents, and similarities amongst articles. The document map is generated, after text pre-processing, by projecting the data in two dimensions using Latent Semantic Indexing. The projection is followed by hierarchical clustering to support sub-area identification. The map can be interactively explored, helping to narrow down the search for relevant articles. Tests were performed using a collection of documents pre-classified in three research subject classes: Case-Based Reasoning, Information Retrieval, and Inductive Logic Programming. The map produced was capable of separating the main areas and approaching documents by their similarity, revealing possible topics, and identifying boundaries between them. The tool can deal with the exploration of inter-topics and intra-topic relationship and is useful in many contexts that need deciding on relevant articles to read, such as scientific research, education, and training.

Towards Automated Visualisation of Scientific Literature

2019

Nowadays, an exponential growth in biological data has been recorded, including both structured and unstructured data. One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured textual corpora to effectively support the decision making process. Since the emergence of topic modelling, new and interesting approaches to compactly represent the content of a document collection have been proposed. However, the effective exploitation of the proposed strategies requires a lot of expertise.

Visualizing topics with multi-word expressions

2009

Abstract: We describe a new method for visualizing topics, the distributions over terms that are automatically extracted from large text corpora using latent variable models. Our method finds significant $ n $-grams related to a topic, which are then used to help understand and interpret the underlying distribution.

How to visualize high-dimensional data: a roadmap

Journal of Data Mining & Digital Humanities, Special issue on Visualisations in Historical Linguistics, 2020

Discovery of the chronological or geographical distribution of collections of historical text can be more reliable when based on multivariate rather than on univariate data because multivariate data provide a more complete description. Where the data are high-dimensional, however, their complexity can defy analysis using traditional philological methods. The first step in dealing with such data is to visualize it using graphical methods in order to identify any latent structure. If found, such structure facilitates formulation of hypotheses which can be tested using a range of mathematical and statistical methods. Where, however, the dimensionality is greater than 3, direct graphical investigation is impossible. The present discussion presents a roadmap of how this obstacle can be overcome, and is in three main parts: the first part presents some fundamental data concepts, the second describes an example corpus and a high-dimensional data set derived from it, and the third outlines two approaches to visualization of that data set: dimensionality reduction and cluster analysis. keywords Data visualization, multivariate data, high dimensionality, dimensionality reduction, cluster analysis. INTRODUCTION Discovery of the chronological or geographical distribution of collections of historical text can be more reliable when based on multivariate rather than on univariate data because, assuming that the variables describe different aspects of the texts in question, multivariate data provide a more complete description. Where the multivariate data are high-dimensional, however, their complexity can defy analysis using traditional philological methods. The first step in dealing with such data is to visualize it using graphical methods in order to identify any latent structure. If found, such structure facilitates formulation of hypotheses which can be tested using a range of mathematical and statistical methods. Where, however, the dimensionality is greater than 3, direct graphical investigation is impossible. The present discussion presents a roadmap of how this obstacle can be overcome. Exemplification is based on data abstracted from a corpus of English historical texts with a known temporal distribution, allowing the efficacy of the methods covered in the discussion to be readily verified by the reader. The discussion is in three main parts. The first part presents some fundamental data concepts-its nature, its representation using vectors and matrices, and its interpretation in terms of concepts of vector space and manifold, the second part describes the corpus and a high-dimensional data set abstracted from it, and the third outlines approaches to visualization of that data set using the concepts from (1) applied to (2). These approaches are of two types.  The first, dimensionality reduction, reduces high-dimensional data to dimensionality 3 or less to enable graphical representation; the methods presented are (i) variable selection based on variance and (ii) principal component analysis.  The second, cluster analysis, represents the structure of data in high-dimensional space directly without dimensionality reduction.

Visualising Text Co-occurrence Networks

We present a tool for automatically generating a visual summary of unstructured text data retrieved from documents, web sites or social media feeds. Unlike tools such as word clouds, we are able to visualise structures and topic relationships occurring in a document. These relationships are determined by a unique approach to co-occurrence analysis. The algorithm applies a decaying function to the distance between word pairs found in the original text such that words regularly occurring close to each other score highly, but even words occurring some distance apart will make a small contribution to the overall co-occurrence score. This is in contrast to other algorithms which simply count adjacent words or use a sliding window of fixed size. We show, with examples, how the network generated can be presented in tree or graph format. The tree format allows for the user to interact with the visualisation and expand or contract the data to a preferred level of detail. The tool is available as a web application and can be viewed using any modern web browser. 1 Background Visual representations have proved to be useful alternatives to linear text documents. The mind mapping technique was introduced in the 1960s and is thought to encourage learning. However, creating mind maps can be a complex and time-consuming undertaking and the ability to automatically produce text visualisations has attracted significant research in recent decades. A number of possible benefits have been attributed to such tools including managing information overload, providing summaries and 'im-pression formation'. Tools have been developed for identifying topics and topic correlations , displaying knowledge and generating concept clouds [1][2]. Here we will briefly outline a number of existing techniques and then show how we have developed a method based on word co-occurrence which can be used for generating both graphs and trees in various types of diagram. Here we include a number of example visuali-sations, all of which are based on the text of a paper concerning conceptual struc-tures[3] 1 1 Available at http://www.jfsowa.com/pubs/ca4cs.pdf It may help the reader to briefly read the article before viewing the visualisations.