Vibhu Mittal - Academia.edu (original) (raw)

Papers by Vibhu Mittal

Research paper thumbnail of Contributing writers

Copyright and permissions should be obtained from the publisher prior to any prohibited reproduct... more Copyright and permissions should be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. To obtain permission(s) to use materials from this work, please submit a written request to Pearson Higher Education, Permissions Department, 1 Lake Street, Upper Saddle River, NJ 07458. The author and publisher of this book have used their best efforts in preparing this book. These

Research paper thumbnail of Web search: finding information in billions of pages

Information retrieval, especially in the context of the Web, presents a host of challenges that m... more Information retrieval, especially in the context of the Web, presents a host of challenges that must be addressed in order to better help people find relevant information in a growing sea of text. Such challenges include not only important issues in building large, scalable systems, but also providing intelligence to these systems to sift, organize, and present relevant information to users. We look at how many of the assumptions in traditional IR systems are challenged in the context of the web. Moreover, we specifically consider how using the richness of information available on the web, as well as the structures afforded in a hyperlinked environment can considerably impact the efficacy of web retrieval systems. La recherche d'information, particulierement sur le Web, represente un ensemble de defis qui doivent etre releves afin de mieux aider a la recherche d'information pertinente dans un nombre croissant de textes. De tels defis ont une influence non seulement sur la co...

Research paper thumbnail of Ad rendering parameters, such as size, style, and/or layout, of online ads

Research paper thumbnail of Generating Hyperlinks and Anchor Text in HTML and Non-HTML Documents

Research paper thumbnail of Search Query Categorization for Business Listings Search

Research paper thumbnail of Describing complex charts in natural language

Computational Linguistics, Sep 1, 1998

Research paper thumbnail of Selecting Text Spans for Document Summaries: Heuristics and Metrics

Human-quality text summarization systems are difficult to design, and even more difficult to eval... more Human-quality text summarization systems are difficult to design, and even more difficult to evaluate, in part because documents can differ along several dimensions, such as length, writing style and lexical usage. Nevertheless, certain cues can often help suggest the selection of sentences for inclusion in a summary. This paper presents an analysis of news-article summaries generated by sentence extraction. Sentences are ranked for potential inclusion in the summary usi ng a weighted combination of linguistic features – derived from an analysis of news-wire summaries. This paper evaluates the relative effectiveness of these features. In order to do so, we discuss the construction of a large corpus of extraction-based summaries, and characterize the underlying degree of difficulty of summarization at different compression level s on articles in this corpus. Results on our feature set are prese nted after normalization by this degree of difficulty.

Research paper thumbnail of FORM APPROVIEDREPORT DOCUMENTATION PAGEos. r

Public reporting burdeU for this collection of information b estimated to average I hoI per reepo... more Public reporting burdeU for this collection of information b estimated to average I hoI per reeponae, including the time for reviewing Inotfactlons, searching exkiftg datesources, ghterng id mainlaining the danthe co=ection of information. Send commente regoading this burden e--imated or any other aspect Cohis colection of inlormation. including euggeetings for reducing this burden to Wlehington Headquarters Services. Dectorate for information Operetio end Reorts, 1216.Jefferson Davis highway. Suits I20, Azington, VA 2202-43CI, and to the Office of wmmageuealt and Budgist, Paperwork Reduaction Project (014488 oaaintm DC 20603.

Research paper thumbnail of 1 Introduction Stemming and its effects on TFIDF Ranking

High precision IR is often hard for a variety of reasons; one

Research paper thumbnail of Published In Multi-Document Summarization By Sentence Extraction

U.S.A. This paper discusses a text extraction approach to multi-document summarization that build... more U.S.A. This paper discusses a text extraction approach to multi-document summarization that builds on single-document summarization methods by using additional, available in-, formation about the document set as a whole and the relationships between the documents. Multi-document summarization differs from single in that the issues of compression, speed, redundancy and passage selec-tion are critical in the formation of useful summaries. Our approach addresses these issues by using domain-independent techniques based mainly on fast, statistical processing, a metric for reducing redundancy and maxi-mizing diversity in the selected passages, and a modular framework to allow easy parameterization for different genres, corpora characteristics and user requirements. 1

Research paper thumbnail of 4. Title and Subtitle S. Funding Numbers

This document hat been Qppiav'ed 4,z public telece,and saoe; its d.Ltribution is uniiited 93... more This document hat been Qppiav'ed 4,z public telece,and saoe; its d.Ltribution is uniiited 93-21822 •'3 q /7 Osl

Research paper thumbnail of Bridging the Lexical Chasm: Statistical Approaches to Answer-Finding

This paper investigates whether a machine can automatically learn the task of finding, within a l... more This paper investigates whether a machine can automatically learn the task of finding, within a large collection of candidate responses, the answers to questions. The learning process consists of inspecting a collection of answered questions and characterizing the relation between question and answer with a statistical model. For the purpose of learning this relation, we propose two sources of data: Usenet FAQ documents and customer service call-center dialogues from a large retail company. We will show that the task of "answerf-inding" differs from both document retrieval and traditional question-answering, presenting challenges different from those found in these problems. The central aim of this work is to discover, through theoretical and empirical investigation, those statistical techniques best suited to the answer-finding problem.

Research paper thumbnail of Generating Explanatory Captions for Information Graphics

Graphical presentations can be used to communicate information in relational data sets succinctly... more Graphical presentations can be used to communicate information in relational data sets succinctly and effectively. However, novel graphical presentations about numerous attributes and their relationships are often difficult to understand completely until explained. Automatically generated graphical presentations must therefore either be limited to simple, conventional ones, or risk incomprehensibility. One way of alleviating this problem is to design graphical presentation systems that can work in conjunction with a natural language generator to produce "explanatory captions." This paper presents three strategies for generating explanatory captions to accompany information graphics based on: (1) a representation of the structure of the graphical presentation (2) a framework for identifyingthe perceptual complexity of graphical elements, and (3) the structure of the data expressed in the graphic. We describe an implemented system and illustrate how it is used to generate ex...

Research paper thumbnail of An Evaluation Road Map for Summarization Research

one can apply text similarity metrics in order to automatically create the Extract, i.e., the set... more one can apply text similarity metrics in order to automatically create the Extract, i.e., the set of sentence fragments in the Text that were used to write the Abstract, at levels of performance that are close to those of humans. Assuming that one manually or semi-automatically identifies sets of documents on specific topics that already have human generated abstracts (see Figure 1), one can then apply the algorithm described by Jing and McKeown [1999] in order to identify in single documents the sentence fragments that were used in order to produce the abstracts; or the algorithms described by Marcu [1999] and Banko et al. [1999] in order to automatically identify in single documents the clauses/sentences that were used in order to produce the abstracts. Multi-document abstracts at different levels of compression can be then produced manually; it is unlikely that we will be able to find naturally occurring corpora of multi-document summaries. If the selected documents do not have a...

Research paper thumbnail of Systemes et procedes pour faire des recherches au moyen de demandes ecrites dans un ensemble de caracteres et/ou langage different a partir de pages cibles

L'invention concerne des procedes et des appareils permettant a un utilisateur de soumettre u... more L'invention concerne des procedes et des appareils permettant a un utilisateur de soumettre une demande de recherche ambigue et de recevoir les resultats de recherche convenables. Les demandes peuvent etre exprimees au moyen d'ensembles de caracteres et/ou de langages differents de l'ensemble de caracteres et/ou de langages d'au moins certaines donnees a chercher. Une traduction entre les ensembles de caracteres et/ou les langages peut etre effectuee grâce a l'examen de l'utilisation des termes dans un texte aligne. Des probabilites peuvent etre associees avec chaque traduction possible. Des affinements peuvent etre faits pour ces probabilites grâce a l'examen des interactions utilisateur avec des resultats de recherche.

Research paper thumbnail of Decision Class Analysis with Incomplete Information

Research paper thumbnail of Automatic Text Summarization of Multiple Documents Thesis Proposal Thesis Committee

In this era, where electronic text information is exponentially growing and where time is a criti... more In this era, where electronic text information is exponentially growing and where time is a critical resource, it has become virtually impossible for any user to browse or read large numbers of individual documents. It is therefore important to explore methods of allowing users to locate and browse information quickly within collections of documents. Automatic text summarization of multiple documents fulllls such information seeking goals by providing a method for the user to quickly view highlights and/or relevant portions of document collections. As of yet, there has been little work with multi-document summarization, although single document summarization has been a subject of focus in the last few years. Multi-document summarization diiers from single in that the issues of compression, speed, redundancy and passage selection are critical in the formation of useful summaries. If multi-document summarization is to be useful across subject areas and languages, it must be relatively...

Research paper thumbnail of Content based Sentence Ordering using Spanning Tree Algorithm for Improved Multi Document Summarization

Due to the availability of required information in the web, as multiple documents, the need for s... more Due to the availability of required information in the web, as multiple documents, the need for summarizing these multiple documents and ordering of the sentences in the summary in an efficient way become a relevant task in data mining. We present a novel sentence ordering method based on maximum cost spanning tree algorithm to improve the readability and cohesion of the summary obtained by extraction method from related multiple documents. It is

Research paper thumbnail of Systems and methods to search using written questions in a character set and / or language different from the target pages

A method comprising: identifying (904) a first set of anchor text written in a first format and c... more A method comprising: identifying (904) a first set of anchor text written in a first format and containing a given term; identifying (906) a set of documents to which the first set of points anchor text; identifying (908) a second set of anchor text written in a second format and pointing to the identified set of documents; analyzing (910) the second set of anchor text to determine that a representation of the given term in the first format corresponds to a representation of a given term in the second format.

Research paper thumbnail of Scatter-Brain: an experiment in distributed problem solving applied to load balancing

[1989] Proceedings of the Thirteenth Annual International Computer Software & Applications Conference

A framework is presented for distributed problem solving. As an illustration of this system, the ... more A framework is presented for distributed problem solving. As an illustration of this system, the implementation of a job scheduler to do load balancing in distributed systems is developed and discussed. The framework is developed around a distributed blackboard system with multiple planes in a hierarchy. Sites in a distributed system have an incorrect and inconsistent picture of the states

Research paper thumbnail of Contributing writers

Copyright and permissions should be obtained from the publisher prior to any prohibited reproduct... more Copyright and permissions should be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. To obtain permission(s) to use materials from this work, please submit a written request to Pearson Higher Education, Permissions Department, 1 Lake Street, Upper Saddle River, NJ 07458. The author and publisher of this book have used their best efforts in preparing this book. These

Research paper thumbnail of Web search: finding information in billions of pages

Information retrieval, especially in the context of the Web, presents a host of challenges that m... more Information retrieval, especially in the context of the Web, presents a host of challenges that must be addressed in order to better help people find relevant information in a growing sea of text. Such challenges include not only important issues in building large, scalable systems, but also providing intelligence to these systems to sift, organize, and present relevant information to users. We look at how many of the assumptions in traditional IR systems are challenged in the context of the web. Moreover, we specifically consider how using the richness of information available on the web, as well as the structures afforded in a hyperlinked environment can considerably impact the efficacy of web retrieval systems. La recherche d'information, particulierement sur le Web, represente un ensemble de defis qui doivent etre releves afin de mieux aider a la recherche d'information pertinente dans un nombre croissant de textes. De tels defis ont une influence non seulement sur la co...

Research paper thumbnail of Ad rendering parameters, such as size, style, and/or layout, of online ads

Research paper thumbnail of Generating Hyperlinks and Anchor Text in HTML and Non-HTML Documents

Research paper thumbnail of Search Query Categorization for Business Listings Search

Research paper thumbnail of Describing complex charts in natural language

Computational Linguistics, Sep 1, 1998

Research paper thumbnail of Selecting Text Spans for Document Summaries: Heuristics and Metrics

Human-quality text summarization systems are difficult to design, and even more difficult to eval... more Human-quality text summarization systems are difficult to design, and even more difficult to evaluate, in part because documents can differ along several dimensions, such as length, writing style and lexical usage. Nevertheless, certain cues can often help suggest the selection of sentences for inclusion in a summary. This paper presents an analysis of news-article summaries generated by sentence extraction. Sentences are ranked for potential inclusion in the summary usi ng a weighted combination of linguistic features – derived from an analysis of news-wire summaries. This paper evaluates the relative effectiveness of these features. In order to do so, we discuss the construction of a large corpus of extraction-based summaries, and characterize the underlying degree of difficulty of summarization at different compression level s on articles in this corpus. Results on our feature set are prese nted after normalization by this degree of difficulty.

Research paper thumbnail of FORM APPROVIEDREPORT DOCUMENTATION PAGEos. r

Public reporting burdeU for this collection of information b estimated to average I hoI per reepo... more Public reporting burdeU for this collection of information b estimated to average I hoI per reeponae, including the time for reviewing Inotfactlons, searching exkiftg datesources, ghterng id mainlaining the danthe co=ection of information. Send commente regoading this burden e--imated or any other aspect Cohis colection of inlormation. including euggeetings for reducing this burden to Wlehington Headquarters Services. Dectorate for information Operetio end Reorts, 1216.Jefferson Davis highway. Suits I20, Azington, VA 2202-43CI, and to the Office of wmmageuealt and Budgist, Paperwork Reduaction Project (014488 oaaintm DC 20603.

Research paper thumbnail of 1 Introduction Stemming and its effects on TFIDF Ranking

High precision IR is often hard for a variety of reasons; one

Research paper thumbnail of Published In Multi-Document Summarization By Sentence Extraction

U.S.A. This paper discusses a text extraction approach to multi-document summarization that build... more U.S.A. This paper discusses a text extraction approach to multi-document summarization that builds on single-document summarization methods by using additional, available in-, formation about the document set as a whole and the relationships between the documents. Multi-document summarization differs from single in that the issues of compression, speed, redundancy and passage selec-tion are critical in the formation of useful summaries. Our approach addresses these issues by using domain-independent techniques based mainly on fast, statistical processing, a metric for reducing redundancy and maxi-mizing diversity in the selected passages, and a modular framework to allow easy parameterization for different genres, corpora characteristics and user requirements. 1

Research paper thumbnail of 4. Title and Subtitle S. Funding Numbers

This document hat been Qppiav'ed 4,z public telece,and saoe; its d.Ltribution is uniiited 93... more This document hat been Qppiav'ed 4,z public telece,and saoe; its d.Ltribution is uniiited 93-21822 •'3 q /7 Osl

Research paper thumbnail of Bridging the Lexical Chasm: Statistical Approaches to Answer-Finding

This paper investigates whether a machine can automatically learn the task of finding, within a l... more This paper investigates whether a machine can automatically learn the task of finding, within a large collection of candidate responses, the answers to questions. The learning process consists of inspecting a collection of answered questions and characterizing the relation between question and answer with a statistical model. For the purpose of learning this relation, we propose two sources of data: Usenet FAQ documents and customer service call-center dialogues from a large retail company. We will show that the task of "answerf-inding" differs from both document retrieval and traditional question-answering, presenting challenges different from those found in these problems. The central aim of this work is to discover, through theoretical and empirical investigation, those statistical techniques best suited to the answer-finding problem.

Research paper thumbnail of Generating Explanatory Captions for Information Graphics

Graphical presentations can be used to communicate information in relational data sets succinctly... more Graphical presentations can be used to communicate information in relational data sets succinctly and effectively. However, novel graphical presentations about numerous attributes and their relationships are often difficult to understand completely until explained. Automatically generated graphical presentations must therefore either be limited to simple, conventional ones, or risk incomprehensibility. One way of alleviating this problem is to design graphical presentation systems that can work in conjunction with a natural language generator to produce "explanatory captions." This paper presents three strategies for generating explanatory captions to accompany information graphics based on: (1) a representation of the structure of the graphical presentation (2) a framework for identifyingthe perceptual complexity of graphical elements, and (3) the structure of the data expressed in the graphic. We describe an implemented system and illustrate how it is used to generate ex...

Research paper thumbnail of An Evaluation Road Map for Summarization Research

one can apply text similarity metrics in order to automatically create the Extract, i.e., the set... more one can apply text similarity metrics in order to automatically create the Extract, i.e., the set of sentence fragments in the Text that were used to write the Abstract, at levels of performance that are close to those of humans. Assuming that one manually or semi-automatically identifies sets of documents on specific topics that already have human generated abstracts (see Figure 1), one can then apply the algorithm described by Jing and McKeown [1999] in order to identify in single documents the sentence fragments that were used in order to produce the abstracts; or the algorithms described by Marcu [1999] and Banko et al. [1999] in order to automatically identify in single documents the clauses/sentences that were used in order to produce the abstracts. Multi-document abstracts at different levels of compression can be then produced manually; it is unlikely that we will be able to find naturally occurring corpora of multi-document summaries. If the selected documents do not have a...

Research paper thumbnail of Systemes et procedes pour faire des recherches au moyen de demandes ecrites dans un ensemble de caracteres et/ou langage different a partir de pages cibles

L'invention concerne des procedes et des appareils permettant a un utilisateur de soumettre u... more L'invention concerne des procedes et des appareils permettant a un utilisateur de soumettre une demande de recherche ambigue et de recevoir les resultats de recherche convenables. Les demandes peuvent etre exprimees au moyen d'ensembles de caracteres et/ou de langages differents de l'ensemble de caracteres et/ou de langages d'au moins certaines donnees a chercher. Une traduction entre les ensembles de caracteres et/ou les langages peut etre effectuee grâce a l'examen de l'utilisation des termes dans un texte aligne. Des probabilites peuvent etre associees avec chaque traduction possible. Des affinements peuvent etre faits pour ces probabilites grâce a l'examen des interactions utilisateur avec des resultats de recherche.

Research paper thumbnail of Decision Class Analysis with Incomplete Information

Research paper thumbnail of Automatic Text Summarization of Multiple Documents Thesis Proposal Thesis Committee

In this era, where electronic text information is exponentially growing and where time is a criti... more In this era, where electronic text information is exponentially growing and where time is a critical resource, it has become virtually impossible for any user to browse or read large numbers of individual documents. It is therefore important to explore methods of allowing users to locate and browse information quickly within collections of documents. Automatic text summarization of multiple documents fulllls such information seeking goals by providing a method for the user to quickly view highlights and/or relevant portions of document collections. As of yet, there has been little work with multi-document summarization, although single document summarization has been a subject of focus in the last few years. Multi-document summarization diiers from single in that the issues of compression, speed, redundancy and passage selection are critical in the formation of useful summaries. If multi-document summarization is to be useful across subject areas and languages, it must be relatively...

Research paper thumbnail of Content based Sentence Ordering using Spanning Tree Algorithm for Improved Multi Document Summarization

Due to the availability of required information in the web, as multiple documents, the need for s... more Due to the availability of required information in the web, as multiple documents, the need for summarizing these multiple documents and ordering of the sentences in the summary in an efficient way become a relevant task in data mining. We present a novel sentence ordering method based on maximum cost spanning tree algorithm to improve the readability and cohesion of the summary obtained by extraction method from related multiple documents. It is

Research paper thumbnail of Systems and methods to search using written questions in a character set and / or language different from the target pages

A method comprising: identifying (904) a first set of anchor text written in a first format and c... more A method comprising: identifying (904) a first set of anchor text written in a first format and containing a given term; identifying (906) a set of documents to which the first set of points anchor text; identifying (908) a second set of anchor text written in a second format and pointing to the identified set of documents; analyzing (910) the second set of anchor text to determine that a representation of the given term in the first format corresponds to a representation of a given term in the second format.

Research paper thumbnail of Scatter-Brain: an experiment in distributed problem solving applied to load balancing

[1989] Proceedings of the Thirteenth Annual International Computer Software & Applications Conference

A framework is presented for distributed problem solving. As an illustration of this system, the ... more A framework is presented for distributed problem solving. As an illustration of this system, the implementation of a job scheduler to do load balancing in distributed systems is developed and discussed. The framework is developed around a distributed blackboard system with multiple planes in a hierarchy. Sites in a distributed system have an incorrect and inconsistent picture of the states