Aleksandr Drozd - Academia.edu (original) (raw)
Uploads
inproceedings by Aleksandr Drozd
This paper presents a case study of discovering and classifying verbs in large web-corpora. Many ... more This paper presents a case study of discovering and classifying verbs in large web-corpora. Many tasks in natural language processing require corpora containing billions of words, and with such volumes of data co-occurrence extraction becomes one of the performance bottlenecks in the Vector Space Models of computational linguistics. We propose a co-occurrence extraction kernel based on ternary trees as an alternative (or a complimentary stage) to conventional map-reduce based approach, this kernel achieves an order of magnitude improvement in memory footprint and processing speed. Our classifier successfully and efficiently identified verbs in a 1.2-billion words untagged corpus of Russian fiction and distinguished between their two aspectual classes. The model proved efficient even for low-frequency vocabulary, including nonce verbs and neologisms.
We present a case study of Python-based workflow for a data-intensive natural language processing... more We present a case study of Python-based workflow for a data-intensive natural language processing problem, namely word classification with vector space model methodology. Problems in the area of natural language processing are typically solved in many steps which require transformation of the data to vastly different formats (in our case, raw text to sparse matrices to dense vectors). A Python implementation for each of these steps would require a different solution. We survey existing approaches to using Python for high-performance processing of large volumes of data, and we propose a sample solution for each step for our case study (aspectual classification of Russian verbs), attempting to preserve both efficiency and user-friendliness. For the most computationally intensive part of the workflow we develop a prototype distributed implementation of co-occurrence extraction module using IPython.parallel cluster.
Solving word analogies became one of the most popular benchmarks for word embeddings on the assum... more Solving word analogies became one of the most popular benchmarks for word embeddings on the assumption that linear relations between word pairs (such as king:man :: woman:queen) are indicative of the quality of the embedding. We question this assumption by showing that the information not detected by linear offset may still be recoverable by a more sophisticated search method, and thus is actually encoded in the embedding. The general problem with linear offset is its sensitivity to the idiosyncrasies of individual words. We show that simple averaging over multiple word pairs improves over the state-of-the-art. A further improvement in accuracy (up to 30% for some embeddings and relations) is achieved by combining cosine similarity with an estimation of the extent to which a candidate answer belongs to the correct word class. In addition to this practical contribution, this work highlights the problem of the interaction between word embeddings and analogy retrieval algorithms, and its implications for the evaluation of word embeddings and the use of analogies in extrinsic tasks.
Following up on numerous reports of analogybased identification of "linguistic regularities" in w... more Following up on numerous reports of analogybased identification of "linguistic regularities" in word embeddings, this study applies the widely used vector offset method to 4 types of linguistic relations: inflectional and derivational morphology, and lexicographic and encyclopedic semantics. We present a balanced test set with 99,200 questions in 40 categories, and we systematically examine how accuracy for different categories is affected by window size and dimensionality of the SVD-based word embeddings. We also show that GloVe and SVD yield similar patterns of results for different categories, offering further evidence for conceptual similarity between count-based and neural-net based models.
Proceedings of The 1st Workshop on Evaluating Vector Space Representations for NLP
This paper presents an analysis of existing methods for the intrinsic evaluation of word embeddin... more This paper presents an analysis of existing methods for the intrinsic evaluation of word embeddings. We show that the main methodological premise of such evaluations is "interpretability" of word embeddings: a "good" embedding produces results that make sense in terms of traditional linguistic categories. This approach is not only of limited practical use, but also fails to do justice to the strengths of distributional meaning representations. We argue for a shift from abstract ratings of word embedding "quality" to exploration of their strengths and weaknesses.
Papers by Aleksandr Drozd
IEEE Transactions on Big Data, 2016
Proceedings of the NAACL Student Research Workshop, 2016
Artificial Life and Robotics, 2016
—Swarming is thought to critically improve the efficiency of group foraging, as it allows for err... more —Swarming is thought to critically improve the efficiency of group foraging, as it allows for error-correction of individual mistakes in collective dynamics. High levels of noise from the environment may require a critical mass of agents to make collective behavior emerge. It is therefore crucial to reach sufficient computing power to allow for these effects to emerge in simulations. We extend an abstract agent-based swarming model based on the evolution of neural network controllers, in order to explore further the emergence of swarming. Our model is grounded in the ecological situation in which agents can access some information from the environment about the resource location, but through a noisy channel. Swarming critically improves the efficiency of group foraging, by allowing agents to reach resource areas much more easily by correcting individual mistakes in group dynamics. As high levels of noise may make the emergence of collective behavior depend on a critical mass of agents, it is crucial to reach sufficient computing power to allow for the evolution of the whole set of dynamics in simulation. Since simulating neural controllers and information exchanges between agents is computationally intensive, in order to scale up simulations to model critical masses of individuals, the implementation requires careful optimization. We apply techniques from astrophysics known as treecodes to compute the signal propagation, and efficiently parallelize for multi-core architectures. Our results open up future research on signal-based emergent collective behavior as a valid collective strategy for uninformed search over a domain space.
2015 IEEE International Conference on Data Science and Data Intensive Systems, 2015
This paper presents a case study of discovering and classifying verbs in large web-corpora. Many ... more This paper presents a case study of discovering and classifying verbs in large web-corpora. Many tasks in natural language processing require corpora containing billions of words, and with such volumes of data co-occurrence extraction becomes one of the performance bottlenecks in the Vector Space Models of computational linguistics. We propose a co-occurrence extraction kernel based on ternary trees as an alternative (or a complimentary stage) to conventional map-reduce based approach, this kernel achieves an order of magnitude improvement in memory footprint and processing speed. Our classifier successfully and efficiently identified verbs in a 1.2-billion words untagged corpus of Russian fiction and distinguished between their two aspectual classes. The model proved efficient even for low-frequency vocabulary, including nonce verbs and neologisms.
2014 IEEE International Conference on Big Data (Big Data), 2014
Proceedings of the 5th Workshop on Python for High-Performance and Scientific Computing - PyHPC '15, 2015
We present a case study of Python-based workflow for a data-intensive natural language processing... more We present a case study of Python-based workflow for a data-intensive natural language processing problem, namely word classification with vector space model methodology. Problems in the area of natural language processing are typically solved in many steps which require transformation of the data to vastly different formats (in our case, raw text to sparse matrices to dense vectors). A Python implementation for each of these steps would require a different solution. We survey existing approaches to using Python for high-performance processing of large volumes of data, and we propose a sample solution for each step for our case study (aspectual classification of Russian verbs), attempting to preserve both efficiency and user-friendliness. For the most computationally intensive part of the workflow we develop a prototype distributed implementation of co-occurrence extraction module using IPython.parallel cluster.
Lecture Notes in Computer Science, 2013
ABSTRACT This paper describes a performance model for read alignment problem, one of the most com... more ABSTRACT This paper describes a performance model for read alignment problem, one of the most computationally intensive tasks in bioinformatics. We adapted Burrows Wheeler transform based index to be used with GPUs to reduce overall memory footprint. A mathematical model of computation and communication costs was developed to find optimal memory partitioning for index and queries. Last we explored the possibility of using multiple GPUs to reduce data transfers and achieved super-linear speedup. Performance evaluation of experimental implementation supports our claims and shows more than 10fold performance gain per device.
2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012
ABSTRACT Bioinformatics is a quickly emerging area of science with many important applications to... more ABSTRACT Bioinformatics is a quickly emerging area of science with many important applications to human life. Sequence alignment in various forms is one of the main instruments used in bioinformatics. This work is motivated by the ever-increasing amount of sequence data that requires more and more computation power for its processing. This task calls for new GPU-based systems and their higher computational potential and energy efficiency as compared to CPUs. We address the problem of facilitating faster sequence alignment using modern multi-GPU clusters. Our initial step was to develop a fast and scalable GPU exact short sequence aligner. We used matching algorithm with small memory footprint based on Burrows-Wheeler transform. We developed a mathematical model of computation and communication costs to find optimal memory partitioning strategy for index and queries. Our solution achieves 10 times speedup over previous implementation based on suffix array on one GPU and scales to multiple GPUs. Our next step will be to adapt the suggested data structure and performance model for multi-node multi-GPU approximate sequence alignment. It is also planned to use exact matching to detect common regions in large sequences and use it as an intermediate step in full-scale genome comparison.
ABSTRACT We address the problem of performing faster read alignment on GPU devices. The task of D... more ABSTRACT We address the problem of performing faster read alignment on GPU devices. The task of DNA sequence processing is extremely computationally intensive as constant progress in sequencing technology leads to ever-increasing amounts of sequence data[6]. One of possible solutions for this problem is to use the extreme parallel capacities of modern GPU devices[5]. However, performance characteristics and programming models for GPU differ from those of traditional architectures and require new approaches. Most importantly, host memory and I/O systems are not directly accessible from a GPU device and GPU memory is usually an order of magnitude smaller than memory on a host. Considering the size of read alignment data, the memory limit becomes a real problem: when reference sequence index does not fit into memory it has to be split into chunks that will be processed individually. In most cases the complexity of the algorithm does not depend on the index size, so such index splitting increases computation time tremendously. Analysis of existing solutions for read alignment on GPU showed that memory limit is the chief performance issue. One of the attempts to reduce memory consumption consisted in replacing commonly used suffix tree, which allows for better theoretical performance of the algorithm [4], with suffix array, which is less efficient in terms of pure computational complexity but more compact. By doing this, authors of MummerGPU++ achieved several times better performance[3]. We suggest using Burrows-Wheeler Transform[1] for both index and the corresponding search algorithm to achieve much smaller memory footprint. This transform is used mainly in compression algorithms such as bzip2 as it replaces reoccurring patterns in the string by continuous runs of a single symbol, but it can be also used for pattern matching[2]. At the same time we continue using more traditional suffix array on host side to benefit from computational characteristics of both GPU and CPU. We reduced index size 12 times and just by doing this achieved 3-4 time performance improvement compared to suffix-array based solution MummerGPU++. Since even with this compressed index workload size can exceed available device memory we developed a performance model to analyze how overall execution time is affected by proportions and succession in memory is allocated for chunks of index and query set. This model allowed us to find best balance of memory allocation and double performance compared to naive approach when we allocate equal shares of memory for index and queries. The model is then applied to show that using multiple GPUs is a way not only to speed up application, but also to overcome some single-GPU performance issues and have super-linear scaling at least on number of GPUs typically available on one host.
2014 IEEE International Congress on Big Data, 2014
This paper presents a case study of discovering and classifying verbs in large web-corpora. Many ... more This paper presents a case study of discovering and classifying verbs in large web-corpora. Many tasks in natural language processing require corpora containing billions of words, and with such volumes of data co-occurrence extraction becomes one of the performance bottlenecks in the Vector Space Models of computational linguistics. We propose a co-occurrence extraction kernel based on ternary trees as an alternative (or a complimentary stage) to conventional map-reduce based approach, this kernel achieves an order of magnitude improvement in memory footprint and processing speed. Our classifier successfully and efficiently identified verbs in a 1.2-billion words untagged corpus of Russian fiction and distinguished between their two aspectual classes. The model proved efficient even for low-frequency vocabulary, including nonce verbs and neologisms.
We present a case study of Python-based workflow for a data-intensive natural language processing... more We present a case study of Python-based workflow for a data-intensive natural language processing problem, namely word classification with vector space model methodology. Problems in the area of natural language processing are typically solved in many steps which require transformation of the data to vastly different formats (in our case, raw text to sparse matrices to dense vectors). A Python implementation for each of these steps would require a different solution. We survey existing approaches to using Python for high-performance processing of large volumes of data, and we propose a sample solution for each step for our case study (aspectual classification of Russian verbs), attempting to preserve both efficiency and user-friendliness. For the most computationally intensive part of the workflow we develop a prototype distributed implementation of co-occurrence extraction module using IPython.parallel cluster.
Solving word analogies became one of the most popular benchmarks for word embeddings on the assum... more Solving word analogies became one of the most popular benchmarks for word embeddings on the assumption that linear relations between word pairs (such as king:man :: woman:queen) are indicative of the quality of the embedding. We question this assumption by showing that the information not detected by linear offset may still be recoverable by a more sophisticated search method, and thus is actually encoded in the embedding. The general problem with linear offset is its sensitivity to the idiosyncrasies of individual words. We show that simple averaging over multiple word pairs improves over the state-of-the-art. A further improvement in accuracy (up to 30% for some embeddings and relations) is achieved by combining cosine similarity with an estimation of the extent to which a candidate answer belongs to the correct word class. In addition to this practical contribution, this work highlights the problem of the interaction between word embeddings and analogy retrieval algorithms, and its implications for the evaluation of word embeddings and the use of analogies in extrinsic tasks.
Following up on numerous reports of analogybased identification of "linguistic regularities" in w... more Following up on numerous reports of analogybased identification of "linguistic regularities" in word embeddings, this study applies the widely used vector offset method to 4 types of linguistic relations: inflectional and derivational morphology, and lexicographic and encyclopedic semantics. We present a balanced test set with 99,200 questions in 40 categories, and we systematically examine how accuracy for different categories is affected by window size and dimensionality of the SVD-based word embeddings. We also show that GloVe and SVD yield similar patterns of results for different categories, offering further evidence for conceptual similarity between count-based and neural-net based models.
Proceedings of The 1st Workshop on Evaluating Vector Space Representations for NLP
This paper presents an analysis of existing methods for the intrinsic evaluation of word embeddin... more This paper presents an analysis of existing methods for the intrinsic evaluation of word embeddings. We show that the main methodological premise of such evaluations is "interpretability" of word embeddings: a "good" embedding produces results that make sense in terms of traditional linguistic categories. This approach is not only of limited practical use, but also fails to do justice to the strengths of distributional meaning representations. We argue for a shift from abstract ratings of word embedding "quality" to exploration of their strengths and weaknesses.
IEEE Transactions on Big Data, 2016
Proceedings of the NAACL Student Research Workshop, 2016
Artificial Life and Robotics, 2016
—Swarming is thought to critically improve the efficiency of group foraging, as it allows for err... more —Swarming is thought to critically improve the efficiency of group foraging, as it allows for error-correction of individual mistakes in collective dynamics. High levels of noise from the environment may require a critical mass of agents to make collective behavior emerge. It is therefore crucial to reach sufficient computing power to allow for these effects to emerge in simulations. We extend an abstract agent-based swarming model based on the evolution of neural network controllers, in order to explore further the emergence of swarming. Our model is grounded in the ecological situation in which agents can access some information from the environment about the resource location, but through a noisy channel. Swarming critically improves the efficiency of group foraging, by allowing agents to reach resource areas much more easily by correcting individual mistakes in group dynamics. As high levels of noise may make the emergence of collective behavior depend on a critical mass of agents, it is crucial to reach sufficient computing power to allow for the evolution of the whole set of dynamics in simulation. Since simulating neural controllers and information exchanges between agents is computationally intensive, in order to scale up simulations to model critical masses of individuals, the implementation requires careful optimization. We apply techniques from astrophysics known as treecodes to compute the signal propagation, and efficiently parallelize for multi-core architectures. Our results open up future research on signal-based emergent collective behavior as a valid collective strategy for uninformed search over a domain space.
2015 IEEE International Conference on Data Science and Data Intensive Systems, 2015
This paper presents a case study of discovering and classifying verbs in large web-corpora. Many ... more This paper presents a case study of discovering and classifying verbs in large web-corpora. Many tasks in natural language processing require corpora containing billions of words, and with such volumes of data co-occurrence extraction becomes one of the performance bottlenecks in the Vector Space Models of computational linguistics. We propose a co-occurrence extraction kernel based on ternary trees as an alternative (or a complimentary stage) to conventional map-reduce based approach, this kernel achieves an order of magnitude improvement in memory footprint and processing speed. Our classifier successfully and efficiently identified verbs in a 1.2-billion words untagged corpus of Russian fiction and distinguished between their two aspectual classes. The model proved efficient even for low-frequency vocabulary, including nonce verbs and neologisms.
2014 IEEE International Conference on Big Data (Big Data), 2014
Proceedings of the 5th Workshop on Python for High-Performance and Scientific Computing - PyHPC '15, 2015
We present a case study of Python-based workflow for a data-intensive natural language processing... more We present a case study of Python-based workflow for a data-intensive natural language processing problem, namely word classification with vector space model methodology. Problems in the area of natural language processing are typically solved in many steps which require transformation of the data to vastly different formats (in our case, raw text to sparse matrices to dense vectors). A Python implementation for each of these steps would require a different solution. We survey existing approaches to using Python for high-performance processing of large volumes of data, and we propose a sample solution for each step for our case study (aspectual classification of Russian verbs), attempting to preserve both efficiency and user-friendliness. For the most computationally intensive part of the workflow we develop a prototype distributed implementation of co-occurrence extraction module using IPython.parallel cluster.
Lecture Notes in Computer Science, 2013
ABSTRACT This paper describes a performance model for read alignment problem, one of the most com... more ABSTRACT This paper describes a performance model for read alignment problem, one of the most computationally intensive tasks in bioinformatics. We adapted Burrows Wheeler transform based index to be used with GPUs to reduce overall memory footprint. A mathematical model of computation and communication costs was developed to find optimal memory partitioning for index and queries. Last we explored the possibility of using multiple GPUs to reduce data transfers and achieved super-linear speedup. Performance evaluation of experimental implementation supports our claims and shows more than 10fold performance gain per device.
2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012
ABSTRACT Bioinformatics is a quickly emerging area of science with many important applications to... more ABSTRACT Bioinformatics is a quickly emerging area of science with many important applications to human life. Sequence alignment in various forms is one of the main instruments used in bioinformatics. This work is motivated by the ever-increasing amount of sequence data that requires more and more computation power for its processing. This task calls for new GPU-based systems and their higher computational potential and energy efficiency as compared to CPUs. We address the problem of facilitating faster sequence alignment using modern multi-GPU clusters. Our initial step was to develop a fast and scalable GPU exact short sequence aligner. We used matching algorithm with small memory footprint based on Burrows-Wheeler transform. We developed a mathematical model of computation and communication costs to find optimal memory partitioning strategy for index and queries. Our solution achieves 10 times speedup over previous implementation based on suffix array on one GPU and scales to multiple GPUs. Our next step will be to adapt the suggested data structure and performance model for multi-node multi-GPU approximate sequence alignment. It is also planned to use exact matching to detect common regions in large sequences and use it as an intermediate step in full-scale genome comparison.
ABSTRACT We address the problem of performing faster read alignment on GPU devices. The task of D... more ABSTRACT We address the problem of performing faster read alignment on GPU devices. The task of DNA sequence processing is extremely computationally intensive as constant progress in sequencing technology leads to ever-increasing amounts of sequence data[6]. One of possible solutions for this problem is to use the extreme parallel capacities of modern GPU devices[5]. However, performance characteristics and programming models for GPU differ from those of traditional architectures and require new approaches. Most importantly, host memory and I/O systems are not directly accessible from a GPU device and GPU memory is usually an order of magnitude smaller than memory on a host. Considering the size of read alignment data, the memory limit becomes a real problem: when reference sequence index does not fit into memory it has to be split into chunks that will be processed individually. In most cases the complexity of the algorithm does not depend on the index size, so such index splitting increases computation time tremendously. Analysis of existing solutions for read alignment on GPU showed that memory limit is the chief performance issue. One of the attempts to reduce memory consumption consisted in replacing commonly used suffix tree, which allows for better theoretical performance of the algorithm [4], with suffix array, which is less efficient in terms of pure computational complexity but more compact. By doing this, authors of MummerGPU++ achieved several times better performance[3]. We suggest using Burrows-Wheeler Transform[1] for both index and the corresponding search algorithm to achieve much smaller memory footprint. This transform is used mainly in compression algorithms such as bzip2 as it replaces reoccurring patterns in the string by continuous runs of a single symbol, but it can be also used for pattern matching[2]. At the same time we continue using more traditional suffix array on host side to benefit from computational characteristics of both GPU and CPU. We reduced index size 12 times and just by doing this achieved 3-4 time performance improvement compared to suffix-array based solution MummerGPU++. Since even with this compressed index workload size can exceed available device memory we developed a performance model to analyze how overall execution time is affected by proportions and succession in memory is allocated for chunks of index and query set. This model allowed us to find best balance of memory allocation and double performance compared to naive approach when we allocate equal shares of memory for index and queries. The model is then applied to show that using multiple GPUs is a way not only to speed up application, but also to overcome some single-GPU performance issues and have super-linear scaling at least on number of GPUs typically available on one host.
2014 IEEE International Congress on Big Data, 2014