Vanraj Vala - Academia.edu (original) (raw)

Papers by Vanraj Vala

2021 IEEE 15th International Conference on Semantic Computing (ICSC)

Natural Language Processing(NLP) has emerged as a crucial research area in Machine Learning domai... more Natural Language Processing(NLP) has emerged as a crucial research area in Machine Learning domain. NLP concerns itself with understanding human language and deriving meaningful inferences from the textual data. In most NLP models, word embeddings are a good starting point. Word embedding is a vector representation of words in a predetermined space. These represented word vectors follow typical vector rules, and depict similarity or dis-similarity of words in a vector space. GloVe, Word2Vec and FastText are some of the most popular pre-trained word embeddings. However, since these embeddings are trained on a very large corpus, they tend to be very large in size. The large size makes it difficult to use such embeddings in a memory constrained environment. The real values of these embedding also makes downstream tasks slow. In this paper we present a novel approach to convert the continuous pre-trained embeddings to its binary representation, without degrading the semantic information they carry. The binary conversion drastically reduces the size of the resultant embedding and facilitates porting to memory constrained devices. Experiments have shown comparable results to original embeddings in downstream tasks, with 20 to 40 times reduction in file size.

2021 IEEE 15th International Conference on Semantic Computing (ICSC)

Recent trends have increasingly indicated a shift in search technologies across all applications ... more Recent trends have increasingly indicated a shift in search technologies across all applications from syntactic and lexical matching approaches to semantic methods, aiming to understand the intent and contextual meaning of search queries, in order to yield more relevant and accurate results. Such methods often rely on semantic ontologies to map query words to concepts and aid in expansion. However, most applications require a domain specific language definition in order to overcome issues of ambiguity and misinterpretation of meaning. General purpose ontologies are often lacking in this purpose and fail to yield appropriate results in specific applications. In this paper, we propose a novel method of building a domain specific thesaurus for aiding semantic search through automatically creating a refined general thesaurus, followed by training a Siamese Network in two phases to classify candidate synonyms as relevant or non-relevant to the particular domain. We focus on the application of tag-based gallery image retrieval and extract and utilise information from Google's Conceptual Captions dataset in order to improve our model's performance. In order to investigate and justify our training method and architecture, we conduct an ablation study and compare results with our model. We further analytically and empirically demonstrate the advantage of representing terms in a domain-specific environment through semantic vectors fine-tuned on corpora related to the domain. Although our experiments are focused on building a word ontology specific to image retrieval, our method is generic and can be generalised to any field requiring a domain specific semantic language.

2017 IEEE 4th International Conference on Soft Computing & Machine Intelligence (ISCMI), 2017

The expectations from computing systems are increasing every year. For systems to multitask and s... more The expectations from computing systems are increasing every year. For systems to multitask and still be highly responsive, the necessary references and dependencies should be readily available in memory. Since the memory is limited, memory needs to be freed up from relatively old references so that new references can be loaded. In case of Distributed Systems having remote reference dependencies, Stub-Scion Pair (SSP) Creation and Recollection is a factor in responsiveness of the system. In this paper, Intelligent SSP Forecast and Memory Reclamation Strategy is proposed. It learns and adapts memory reclamation as per user behaviour and reference dependencies. Proposed method addresses better management of references and SSP by learning process dependency and usage patterns and adapting the local and remote reference creation and reclamation. Proposed strategy learns the user and process behaviour and builds a Bayesian Belief Net. Memory Reclamation Decision and Predictive SSP Forecast is based on status and inference from Belief Net.

Proceedings of the 28th International Conference on Computational Linguistics, 2020

With growing applications of Machine Learning in daily lives, Natural Language Processing (NLP) h... more With growing applications of Machine Learning in daily lives, Natural Language Processing (NLP) has emerged as a heavily researched area. Finding its applications in tasks ranging from simple Q/A chatbots to fully fledged conversational AI, NLP models are vital. Word and Sentence embeddings are one of the most common starting points of any NLP task. A word embedding represents a given word in a predefined vector-space while maintaining vector relations with similar or dissimilar entities. As such, different pretrained embedding such as Word2Vec, GloVe, FastText have been developed. These embeddings generated on millions of words are however very large in terms of size. Having embeddings with floating point precision also makes the downstream evaluation slow. In this paper we present a novel method to convert continuous embedding to its binary representation, thus reducing the overall size of the embedding while keeping the semantic and relational knowledge intact. This will facilitate an option of porting such big embedding onto devices where space is limited. We also present different approaches suitable for different downstream tasks based on the requirement of contextual and semantic information. Experiments have shown comparable result in downstream tasks with 7 to 15 times reduction in file size and about 5% change in evaluation parameters.

2020 IEEE 14th International Conference on Semantic Computing (ICSC), 2020

Following an exponential increase in the number of applications created every year, there are cur... more Following an exponential increase in the number of applications created every year, there are currently over 2.5 million apps in the Google Play Store. Consequently, there has been a sharp rise in the number of apps downloaded by users on their devices. However, limited research has been done on navigability, grouping, and searching of applications on these devices. Current methods of app classification require manual labelling or extracting information from the app store. Such methods are not only resource-intensive and time-consuming but are also not scalable. To overcome these issues, the authors propose a novel architecture for classification of applications into categories, utilizing only the information available in their application packages (APKs) - consequently removing any external dependency and making the entire process completely on-device. A multimodal deep learning approach is followed in a 2-phase training scheme by independently training neural models on distinct sets of information extracted from the APKs and assimilating and fine-tuning the learned weights to incorporate combined knowledge. Our experiments show significant improvement in the evaluation metrics for app classification and clustering over the set benchmarks. The proposed architecture enables a fully on-device solution for app categorization.

2020 IEEE 14th International Conference on Semantic Computing (ICSC), 2020

Incremental classification has become a hot research topic due to ever increasing availability of... more Incremental classification has become a hot research topic due to ever increasing availability of data, finding its application in fields like image and sound analysis. As more data is collected, we may get instances which do not belong to any of the classes previously seen by the classification model. To incorporate the new class, typical multi-class classification methods require retention of complete previous data, making the training process memory intensive. This paper puts forth a novel approach to class-incremental classification using support vector machines (SVM) and a subset of training data known as candidate support vectors (CSV). By using these CSVs, the proposed method facilitates addition of a new class to multiclass SVM classifier trained in one-vs-all fashion. Experiments on different datasets achieve accuracy comparable to complete batch training, while retaining only a small subset of the training data.

2019 IEEE 9th International Conference on Advanced Computing (IACC), 2019

Huge volume, variance and velocity of data due to digitalization of various sectors have lead to ... more Huge volume, variance and velocity of data due to digitalization of various sectors have lead to information explosion. Storing and retrieval of huge volume of data requires appropriate data structures that will contribute towards performance optimization of the computing systems. Trie and its variants are popular in applications ranging from sub-string search to auto completion, where strings are used as keys. Trie data structure is characterised by huge memory requirements, thus it is infeasible to store in primary memory, especially for embedded devices where memory is a constraint, thus implementing a trie in secondary storage is a feasible solution for embedded devices. B tree, B+ trees, B trie and Burst trie are reported data structures that are designed to minimize read and write latency in secondary memory. We propose a File-based trie that performs insert, search and delete operations without explicitly loading trie data into the primary memory. We compare the performance of File-based Linked list-trie with B trie, B trees and FlatBuffers on strings form standard dictionary. Our results demonstrate that Linked list-trie takes 40 percent less lookup time compared to B trie, B tree and 29 percent less lookup time compared to FlatBuffers. Linked list-trie consumes 7 KB of main memory and provide support for characters of multiple languages. The implementation can be further investigated for applications involving query completion, prefix matching etc.

2021 IEEE 15th International Conference on Semantic Computing (ICSC), 2021

With increasing connectivity, there has been an exponential surge in the creation and availabilit... more With increasing connectivity, there has been an exponential surge in the creation and availability of textual content in the form of news articles, blogs, social media posts and product reviews. A large portion of this data is consumed on mobile devices, and more recently, through wearables and smart speakers. Text summarization involves generating a brief description of a text, which captures the overall intention and the vital information being conveyed in its content. Common techniques for automatic text summarization follow extractive or abstractive approaches and involve large scale models with millions of parameters. While such models can be utilized in web or cloud-based applications, they are impractical for deployment on devices with limited storage and computational capabilities. In this paper, we propose a novel character-level neural architecture for extractive text summarization, with the model size reduced by 99.64% to 97.98% from existing methods, thus making it suitable for deployment on-device such as in mobiles, tabs and smart speakers. We tested the performance of our model on various benchmark datasets and compared it with several strong baselines and models. Despite using only a fraction of the space, our model outperformed the baselines and several state-of-the-art models, while coming close in performance with others. On-device text summarization remains largely an unexplored area, and our model's results show a promising approach towards building summarization models suitable for a constrained environment.

Natural Language Processing and Information Systems, 2019

Last few years have seen a consistent increase in the availability and usage of mobile applicatio... more Last few years have seen a consistent increase in the availability and usage of mobile application (apps). Mobile operating systems have dedicated stores to host these apps and make them easily discoverable. Also, app developers depict their core features in textual descriptions while consumers share their opinions in form of user reviews. Apart from these inputs, applications hosted on app stores also contain indicators such as category, app ratings, and age ratings which affect the retrieval mechanisms and discoverability of these applications. An attempt is made in this paper to jointly model app descriptions and reviews to evaluate their use in predicting other indicators like app category and ratings. A multi-task neural architecture is proposed to learn and analyze the influence of application’s textual data to predict other categorical parameters. During the training process, the neural architecture also learns generic app-embeddings, which aid in other unsupervised tasks like nearest neighbor analysis and app clustering. Various qualitative and quantitative experiments are performed on these learned embeddings to achieve promising results.

Proceedings of the ACM India Joint International Conference on Data Science and Management of Data, 2019

Most Learning To Rank (LTR) algorithms like Ranking SVM, RankNet, LambdaRank and LambdaMART use o... more Most Learning To Rank (LTR) algorithms like Ranking SVM, RankNet, LambdaRank and LambdaMART use only relevance label judgments as ground truth for training. But in common scenarios like ranking of information cards (google now, other personal assistants), mobile notifications, netflix recommendations, etc. there is additional information which can be captured from user behavior and how user interacts with the retrieved items. Within the relevance labels, there might be different sets whose information (i.e. cluster information) can be derived implicitly from user interaction (positive, negative, neutral, etc.) or from explicit-user feedback ('Do not show again', 'I like this suggestion', etc). This additional information provides significant knowledge for training any ranking algorithm using two-dimensional output variable. This paper proposes a novel method to use the relevance label along with cluster information to better train the ranking models. Results for user-trial Notification Ranking dataset and standard datasets like LETOR 4.0, MSLR-WEB10K and YahooLTR further support this claim.

2020 International Joint Conference on Neural Networks (IJCNN), 2020

Increased connectivity has led to a sharp rise in the creation and availability of structured and... more Increased connectivity has led to a sharp rise in the creation and availability of structured and unstructured text content, with millions of new documents being generated every minute. Key-phrase extraction is the process of finding the most important words and phrases which best capture the overall meaning and topics of a text document. Common techniques follow supervised or unsupervised methods for extractive or abstractive key-phrase extraction, but struggle to perform well and generalize to different datasets. In this paper, we follow a supervised, extractive approach and model the key-phrase extraction problem as a sequence labeling task. We utilize the power of transformers on sequential tasks and explore the effect of initializing the embedding layer of the model with pre-trained weights. We test our model on different standard key-phrase extraction datasets and our results significantly outperform all baselines as well as state-of-the-art scores on all the datasets.

Computación y Sistemas, 2019

Text generation based on comprehensive datasets has been a well-known problem from several years.... more Text generation based on comprehensive datasets has been a well-known problem from several years. The biggest challenge is in creating a readable and coherent personalized text for specific user. Deep learning models have had huge success in the different text generation tasks such as script creation, translation, caption generation etc. Most of the existing methods require large amounts of data to perform simple sentence generation that may be used to greet the user or to give a unique reply. This research presents a novel and efficient method to generate sentences using a combination of Context Free Grammars and Hidden Markov Models. We have evaluated using two different methods, the first one is using a score similar to the BLEU score. The proposed implementation achieved 83% precision on the tweets dataset. The second method of evaluation being a subjective evaluation for the generated messages which is observed to be better than other methods.

Computación y Sistemas, 2019

Information Retrieval Systems have revolutionized the organization and extraction of Information.... more Information Retrieval Systems have revolutionized the organization and extraction of Information. In recent years, mobile applications (apps) have become primary tools of collecting and disseminating information. However, limited research is available on how to retrieve and organize mobile apps on users' devices. In this paper, authors propose a novel method to estimate app-embeddings which are then applied to tasks like app clustering, classification, and retrieval. Usage of app-embedding for query expansion, nearest neighbor analysis enables unique and interesting use cases to enhance end-user experience with mobile apps.

2021 IEEE 15th International Conference on Semantic Computing (ICSC)

2017 IEEE 4th International Conference on Soft Computing & Machine Intelligence (ISCMI), 2017

Proceedings of the 28th International Conference on Computational Linguistics, 2020

2020 IEEE 14th International Conference on Semantic Computing (ICSC), 2020

2019 IEEE 9th International Conference on Advanced Computing (IACC), 2019

2021 IEEE 15th International Conference on Semantic Computing (ICSC), 2021

Natural Language Processing and Information Systems, 2019

Proceedings of the ACM India Joint International Conference on Data Science and Management of Data, 2019

2020 International Joint Conference on Neural Networks (IJCNN), 2020

Computación y Sistemas, 2019