Ganesh Ramesh - Academia.edu (original) (raw)

Papers by Ganesh Ramesh

Research paper thumbnail of Can Attackers Learn from Samples

Sampling is often used to achieve disclosure limitation for categorical and microarray datasets. ... more Sampling is often used to achieve disclosure limitation for categorical and microarray datasets. The motivation is that while the public gets a snapshot of what is in the data, the entire data is not revealed and hence complete disclosure is prevented. However, the presence of prior knowledge is often overlooked in risk assessment. A sample plays an important role in risk analysis and can be used by a malicious user to construct prior knowledge of the domain. In this paper, we focus on formalizing the various kinds of prior knowledge an attacker can develop using samples and make the following contributions. We abstract various types of prior knowledge and define measures of quality which enables us to quantify how good the prior knowledge is with respect to the true knowledge given by the database. We propose a lightweight general purpose sampling framework with which a data owner can assess the impact of various sampling methods on the quality of prior knowledge. Finally, through a systematic set of experiments using real benchmark datasets, we study the effect of various sampling parameters on the quality of prior knowledge that is obtained from these samples. Such an analysis can help the data owner in making informed decisions about releasing samples to achieve disclosure limitation.

Research paper thumbnail of Feasible itemset distributions in data mining: theory and application

Computing frequent itemsets and maximally frequent itemsets in a database are classic problems in... more Computing frequent itemsets and maximally frequent itemsets in a database are classic problems in data mining. The resource requirements of all extant algorithms for both problems depend on the distribution of frequent patterns, a topic that has not been formally investigated. In this paper, we study properties of length distributions of frequent and maximal frequent itemset collections and provide novel solutions for computing tight lower bounds for feasible distributions. We show how these bounding distributions can help in generating realistic synthetic datasets, which can be used for algorithm benchmarking.

Research paper thumbnail of Multi-Source Combined-Media Video Tracking for Summarization

Video summarization is receiving increasing attention due to the large amount of video content ma... more Video summarization is receiving increasing attention due to the large amount of video content made available on the Internet. We present an idea to track video from multiple sources for video summarization. An algorithm that takes advantage of both video and closed caption text information for video scene clustering is described. Experimental results are given followed by discussion on future directions.

Research paper thumbnail of Indexing and Data Access Methods for Database Mining

Most of today's techniques for data mining and association rule mining (ARM) in particular, are r... more Most of today's techniques for data mining and association rule mining (ARM) in particular, are really "flat file mining", since the database is typically dumped to an intermediate flat file that is input to the mining software. Previous research in integrating ARM with databases mainly looked at exploiting language (SQL) as a tool for implementing mining algorithms. In this paper we explore an alternative approach, using various data access methods and systems programming techniques to study the efficiency of mining data.

Research paper thumbnail of k-Anonymization of Social Networks By Vertex Addition

With an abundance of social network data being released, the need to protect sensitive informatio... more With an abundance of social network data being released, the need to protect sensitive information within these networks has become an important concern of data publishers. In this paper we focus on the popular notion of kanonymization as applied to node degrees in a social network. Given such a network N , the problem we study is to transform N to N ′ , such that the degree of each node in N ′ is attained by at least k − 1 other nodes in N ′ . Apart from previous work, we permit modifications to the node set, rather than the edge set, and this offers unique advantages with respect to the utility of the released anonymized network. We study both vertex-labeled and unlabeled graphs, since instances of each occur in real-world social networks. Under the constraint of minimum node additions, we show that on vertex-labeled graphs, the problem is NP-complete. For unlabeled graphs, we give an efficient (near-linear) algorithm and show that it gives solutions that are optimal modulo k, a guarantee that is novel in the literature. Additionally, we demonstrate empirically that commonlystudied structural properties of the network, such as clustering coefficient, are quite minorly distorted by the anonymization procedure.

Research paper thumbnail of Distribution-Based Synthetic Database Generation Techniques for Itemset Mining

The resource requirements of frequent pattern mining algorithms depend mainly on the length distr... more The resource requirements of frequent pattern mining algorithms depend mainly on the length distribution of the mined patterns in the database. Synthetic databases, which are used to benchmark performance of algorithms, tend to have distributions far different from those observed in real datasets. In this paper we focus on the problem of synthetic database generation and propose algorithms to effectively embed within the database, any given set of maximal pattern collections, and make the following contributions: 1. A database generation technique is presented which takes k maximal itemset collections as input, and constructs a database which produces these maximal collections as output, when mined at k levels of support. To analyze the efficiency of the procedure, upper bounds are provided on the number of transactions output in the generated database. 2. A compression method is used and extended to reduce the size of the output database. An optimization to the generation procedure is provided which could potentially reduce the number of transactions generated. 3. Preliminary experimental results are presented to demonstrate the feasibility of using the generation technique.

Research paper thumbnail of Can Attackers Learn from Samples

Sampling is often used to achieve disclosure limitation for categorical and microarray datasets. ... more Sampling is often used to achieve disclosure limitation for categorical and microarray datasets. The motivation is that while the public gets a snapshot of what is in the data, the entire data is not revealed and hence complete disclosure is prevented. However, the presence of prior knowledge is often overlooked in risk assessment. A sample plays an important role in risk analysis and can be used by a malicious user to construct prior knowledge of the domain. In this paper, we focus on formalizing the various kinds of prior knowledge an attacker can develop using samples and make the following contributions. We abstract various types of prior knowledge and define measures of quality which enables us to quantify how good the prior knowledge is with respect to the true knowledge given by the database. We propose a lightweight general purpose sampling framework with which a data owner can assess the impact of various sampling methods on the quality of prior knowledge. Finally, through a systematic set of experiments using real benchmark datasets, we study the effect of various sampling parameters on the quality of prior knowledge that is obtained from these samples. Such an analysis can help the data owner in making informed decisions about releasing samples to achieve disclosure limitation.

Research paper thumbnail of Feasible itemset distributions in data mining: theory and application

Computing frequent itemsets and maximally frequent itemsets in a database are classic problems in... more Computing frequent itemsets and maximally frequent itemsets in a database are classic problems in data mining. The resource requirements of all extant algorithms for both problems depend on the distribution of frequent patterns, a topic that has not been formally investigated. In this paper, we study properties of length distributions of frequent and maximal frequent itemset collections and provide novel solutions for computing tight lower bounds for feasible distributions. We show how these bounding distributions can help in generating realistic synthetic datasets, which can be used for algorithm benchmarking.

Research paper thumbnail of Multi-Source Combined-Media Video Tracking for Summarization

Video summarization is receiving increasing attention due to the large amount of video content ma... more Video summarization is receiving increasing attention due to the large amount of video content made available on the Internet. We present an idea to track video from multiple sources for video summarization. An algorithm that takes advantage of both video and closed caption text information for video scene clustering is described. Experimental results are given followed by discussion on future directions.

Research paper thumbnail of Indexing and Data Access Methods for Database Mining

Most of today's techniques for data mining and association rule mining (ARM) in particular, are r... more Most of today's techniques for data mining and association rule mining (ARM) in particular, are really "flat file mining", since the database is typically dumped to an intermediate flat file that is input to the mining software. Previous research in integrating ARM with databases mainly looked at exploiting language (SQL) as a tool for implementing mining algorithms. In this paper we explore an alternative approach, using various data access methods and systems programming techniques to study the efficiency of mining data.

Research paper thumbnail of k-Anonymization of Social Networks By Vertex Addition

With an abundance of social network data being released, the need to protect sensitive informatio... more With an abundance of social network data being released, the need to protect sensitive information within these networks has become an important concern of data publishers. In this paper we focus on the popular notion of kanonymization as applied to node degrees in a social network. Given such a network N , the problem we study is to transform N to N ′ , such that the degree of each node in N ′ is attained by at least k − 1 other nodes in N ′ . Apart from previous work, we permit modifications to the node set, rather than the edge set, and this offers unique advantages with respect to the utility of the released anonymized network. We study both vertex-labeled and unlabeled graphs, since instances of each occur in real-world social networks. Under the constraint of minimum node additions, we show that on vertex-labeled graphs, the problem is NP-complete. For unlabeled graphs, we give an efficient (near-linear) algorithm and show that it gives solutions that are optimal modulo k, a guarantee that is novel in the literature. Additionally, we demonstrate empirically that commonlystudied structural properties of the network, such as clustering coefficient, are quite minorly distorted by the anonymization procedure.

Research paper thumbnail of Distribution-Based Synthetic Database Generation Techniques for Itemset Mining

The resource requirements of frequent pattern mining algorithms depend mainly on the length distr... more The resource requirements of frequent pattern mining algorithms depend mainly on the length distribution of the mined patterns in the database. Synthetic databases, which are used to benchmark performance of algorithms, tend to have distributions far different from those observed in real datasets. In this paper we focus on the problem of synthetic database generation and propose algorithms to effectively embed within the database, any given set of maximal pattern collections, and make the following contributions: 1. A database generation technique is presented which takes k maximal itemset collections as input, and constructs a database which produces these maximal collections as output, when mined at k levels of support. To analyze the efficiency of the procedure, upper bounds are provided on the number of transactions output in the generated database. 2. A compression method is used and extended to reduce the size of the output database. An optimization to the generation procedure is provided which could potentially reduce the number of transactions generated. 3. Preliminary experimental results are presented to demonstrate the feasibility of using the generation technique.