Joseph Jájá - Academia.edu (original) (raw)
Drafts by Joseph Jájá
BMVC, 2022
Text-VQA aims at answering questions that require understanding the textual cues in an image. Des... more Text-VQA aims at answering questions that require understanding the textual cues in an image. Despite the great progress of existing Text-VQA methods, their performance suffers from insufficient human-labeled question-answer (QA) pairs. However, we ob- serve that, in general, the scene text is not fully exploited in the existing datasets– only a small portion of the text in each image participates in the annotated QA activities. This results in a huge waste of useful information. To address this deficiency, we develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image. Specifically, we propose, TAG, a text-aware visual question-answer generation architecture that learns to produce meaningful, and accurate QA samples using a multimodal transformer. The architec- ture exploits underexplored scene text information and enhances scene understanding of Text-VQA models by combining the generated QA pairs with the initial training data. Extensive experimental results on two well-known Text-VQA benchmarks (TextVQA and ST-VQA) demonstrate that our proposed TAG effectively enlarges the training data that helps improve the Text-VQA performance without extra labeling effort. Moreover, our model outperforms state-of-the-art approaches that are pre-trained with extra large- scale data. Code is available at https://github.com/HenryJunW/TAG.
Papers by Joseph Jájá
ArXiv, 2016
Diffusion Magnetic Resonance Imaging (MRI) exploits the anisotropic diffusion of water molecules ... more Diffusion Magnetic Resonance Imaging (MRI) exploits the anisotropic diffusion of water molecules in the brain to enable the estimation of the brain's anatomical fiber tracts at a relatively high resolution. In particular, tractographic methods can be used to generate whole-brain anatomical connectivity matrix where each element provides an estimate of the connectivity strength between the corresponding voxels. Structural brain networks are built using the connectivity information and a predefined brain parcellation, where the nodes of the network represent the brain regions and the edge weights capture the connectivity strengths between the corresponding brain regions. This paper introduces a number of novel scalable methods to generate and analyze structural brain networks with a varying number of nodes. In particular, we introduce a new parallel algorithm to quickly generate large scale connectivity-based parcellations for which voxels in a region possess highly similar connec...
Written by an authority in the field, this book provides an introduction to the design and analys... more Written by an authority in the field, this book provides an introduction to the design and analysis of parallel algorithms. The emphasis is on the application of the PRAM (parallel random access machine) model of parallel computation, with all its variants, to algorithm analysis. Special attention is given to the selection of relevant data structures and to algorithm design principles that have proved to be useful. Features *Uses PRAM (parallel random access machine) as the model for parallel computation. *Covers all essential classes of parallel algorithms. *Rich exercise sets. *Written by a highly respected author within the field. 0201548569B04062001
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019
Adversarial training has been successfully applied to build robust models at a certain cost. Whil... more Adversarial training has been successfully applied to build robust models at a certain cost. While the robustness of a model increases, the standard classification accuracy declines. This phenomenon is suggested to be an inherent trade-off. We propose a model that employs feature prioritization by a nonlinear attention module and L2 feature regularization to improve the adversarial robustness and the standard accuracy relative to adversarial training. The attention module encourages the model to rely heavily on robust features by assigning larger weights to them while suppressing non-robust features. The regularizer encourages the model to extract similar features for the natural and adversarial images, effectively ignoring the added perturbation. In addition to evaluating the robustness of our model, we provide justification for the attention module and propose a novel experimental strategy that quantitatively demonstrates that our model is almost ideally aligned with salient data ...
Supercomputing Frontiers and Innovations, 2017
Parallel Processing Letters, 2016
In this paper, we illustrate the possibility of developing strategies to carry out matrix computa... more In this paper, we illustrate the possibility of developing strategies to carry out matrix computations on heterogeneous platforms which achieve native GPU performance on very large data sizes up to the capacity of the CPU memory. More specifically, we present a dense matrix multiplication strategy on a heterogeneous platform, specifically tailored for the case when the input is too large to fit on the device memory, which achieves near peak GPU performance. Our strategy involves the development of CUDA stream based software pipelines that effectively overlap PCIe data transfers with kernel executions. As a result, we are able to achieve over 1 and 2 TFLOPS performance on a single node using 1 and 2 GPUs respectively.
2015 IEEE International Conference on Big Data (Big Data), 2015
Diffusion Tensor Imaging (DTI) is an effective tool for the analysis of structural brain connecti... more Diffusion Tensor Imaging (DTI) is an effective tool for the analysis of structural brain connectivity in normal development and in a broad range of brain disorders. However efforts to derive inherent characteristics of structural brain networks have been hampered by the very high dimensionality of the data, relatively small sample sizes, and the lack of widely acceptable connectivity-based regions of interests (ROIs). Typical approaches have focused either on regions defined by standard anatomical atlases that do not incorporate anatomical connectivity, or have been based on voxel-wise analysis, which results in loss of statistical power relative to structurewise connectivity analysis. In this work, we propose a novel, computationally efficient iterative clustering method to generate connectivity-based whole-brain parcellations that converge to a stable parcellation in a few iterations. Our algorithm is based on a sparse representation of the whole brain connectivity matrix, which reduces the number of edges from around a half billion to a few million while incorporating the necessary spatial constraints. We show that the resulting regions in a sense capture the inherent connectivity information present in the data, and are stable with respect to initialization and the randomization scheme within the algorithm. These parcellations provide consistent structural regions across the subjects of population samples that are homogeneous with respect to anatomic connectivity. Our method also derives connectivity structures that can be used to distinguish between population samples with known different structural connectivity. In particular, new results in structural differences for different population samples such as Females vs Males, Normal Controls vs Schizophrenia, and different age groups in Normal Controls are also shown. Index Terms-data-driven whole-brain parcellation; structural connectivity; clustering; statistical analysis; parcellation stability and reproducibility.
Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, 1998
Clusters of symmetric multiprocessors (SMPs) have emerged as the primary candidates for large sca... more Clusters of symmetric multiprocessors (SMPs) have emerged as the primary candidates for large scale multiprocessor systems. In this paper, we introduce an e cient sorting algorithm for clusters of SMPs. This algorithm relies on a novel scheme for stably sorting on a single SMP coupled with balanced regular communication on the cluster. Our SMP algorithm seems to be asymptotically faster than any of the published algorithms we are aware of. The algorithms were implemented in C using Posix Threads and the SIMPLE library of communication primitives and run on a cluster of DEC AlphaServer 2100A systems. Our experimental results verify the scalability and e ciency of our proposed solution and illustrate the importance of considering both memory hierarchy and the overhead of shifting to multiple nodes.
19th International Conference on Scientific and Statistical Database Management (SSDBM 2007), 2007
We discuss a new efficient out-of-core multidimensional indexing structure, information-aware 2 n... more We discuss a new efficient out-of-core multidimensional indexing structure, information-aware 2 n-tree, for indexing very large multidimensional volumetric data. Building a series of (n-1)-Dimensional indexing structures on n-Dimensional data causes a scalability problem in the situation of continually growing resolution in every dimension. However, building a single n-Dimensional indexing structure can cause an indexing effectiveness problem compared to the former case. The informationaware 2 n-tree is an effort to maximize the indexing structure efficiency by ensuring that the subdivision of space have as similar coherence as possible along each dimension. It is particularly useful when data distribution along each dimension constantly shows a different degree of coherence from each other dimension. Our preliminary results show that our new tree can achieve higher indexing structure efficiency than previous methods.
Algorithms and Computation, 2004
We present linear-space sub-logarithmic algorithms for handling the 3-dimensional dominance repor... more We present linear-space sub-logarithmic algorithms for handling the 3-dimensional dominance reporting and the 2-dimensional dominance counting problems. Under the RAM model as described in [M. L. Fredman and D. E. Willard. "Surpassing the information theoretic bound with fusion trees", Journal of Computer and System Sciences, 47:424-436, 1993], our algorithms achieve O(log n/ log log n + f) query time for the 3-dimensional dominance reporting problem, where f is the output size, and O(log n/ log log n) query time for the 2-dimensional dominance counting problem. We extend these results to any constant dimension d ≥ 3, achieving O(n(log n/ log log n) d−3) space and O((log n/ log log n) d−2 + f) query time for the reporting case and O(n(log n/ log log n) d−2) space and O((log n/ log log n) d−1) query time for the counting case.
Scientific Programming, 2009
Interactive high quality volume rendering is becoming increasingly more important as the amount o... more Interactive high quality volume rendering is becoming increasingly more important as the amount of more complex volumetric data steadily grows. While a number of volumetric rendering techniques have been widely used, ray casting has been recognized as an effective approach for generating high quality visualization. However, for most users, the use of ray casting has been limited to datasets that are very small because of its high demands on computational power and memory bandwidth. However the recent introduction of the Cell Broadband Engine (Cell B.E.) processor, which consists of 9 heterogeneous cores designed to handle extremely demanding computations with large streams of data, provides an opportunity to put the ray casting into practical use. In this paper, we introduce an efficient parallel implementation of volume ray casting on the Cell B.E. The implementation is designed to take full advantage of the computational power and memory bandwidth of the Cell B.E. using an intrica...
Theoretical Computer Science, 2005
Given a set of n objects, each characterized by d attributes speci ed at m xed time instances, we... more Given a set of n objects, each characterized by d attributes speci ed at m xed time instances, we are interested in the problem of designing space e cient indexing structures such that arbitrary temporal range search queries can be handled e ciently. When m = 1, our problem reduces to the d-dimensional orthogonal search problem. We establish e cient data structures to handle several classes of the general problem. Our results include a linear size data structure that enables a query time of O(log n log m= log log n + f) for one-sided queries when d = 1, where f is the number of objects satisfying the query. A similar result is shown for counting queries. We also show that the most general problem can be solved with a polylogarithmic query time using nonlinear space data structures.
SIAM Journal on Computing, 1986
The paper is divided into two main sections. The first deals with a multidimensional search techn... more The paper is divided into two main sections. The first deals with a multidimensional search technique of Megiddo [J. Assoc. Comput. Mach., 31 (1984), pp. 114–127], and suggests an improvement. The second gives an application of the technique to the Euclidean one-centre problem in mathbbRd\mathbb{R}^d mathbbRd. An algorithm of time-complexity O(3(d+2)2n)O(3^{(d + 2)^2 } n)O(3(d+2)2n) is derived for this problem. This improves the best previous bound even in the case d=2d = 2d=2.
The Journal of Supercomputing, 1996
This paper presents e cient and portable implementations of a useful image enhancement process, t... more This paper presents e cient and portable implementations of a useful image enhancement process, the Symmetric Neighborhood Filter (SNF), and an image segmentation technique which makes use of the SNF and a variant of the conventional connected components algorithm which we call-Connected Components. Our general framework is a single-address space, distributed memory programming model. We use e cient techniques for distributing and coalescing data as well as e cient combinations of task and data parallelism. The image segmentation algorithm makes use of an e cient connected components algorithm which uses a novel approach for parallel merging. The algorithms have been coded in Split-C and run on a variety of platforms, including the Thinking Machines CM-5, IBM SP-1 and SP-2, Cray Research T3D, Meiko Scienti c CS-2, Intel Paragon, and workstation clusters. Our experimental results are consistent with the theoretical analysis (and provide the best known execution times for segmentation, even when compared with machine-speci c implementations.) Our test data include di cult images from the Landsat Thematic Mapper (TM) satellite data. More e cient implementations of Split-C will likely result in even faster execution times.
Journal of Parallel and Distributed Computing, 2007
This article was originally published in a journal published by Elsevier, and the attached copy i... more This article was originally published in a journal published by Elsevier, and the attached copy is provided by Elsevier for the author's benefit and for the benefit of the author's institution, for non-commercial research and educational use including without limitation use in instruction at your institution, sending it to specific colleagues that you know, and providing a copy to your institution's administrator.
Journal of Parallel and Distributed Computing, 1996
This paper presents e cient and portable implementations of two useful primitives in image proces... more This paper presents e cient and portable implementations of two useful primitives in image processing algorithms, histogramming and connected components. Our general framework is a single-address space, distributed memory programming model. We use e cient techniques for distributing and coalescing data as well as e cient combinations of task and data parallelism. Our connected components algorithm uses a novel approach for parallel merging which performs drastically limited updating during iterative steps, and concludes with a total consistency update at the nal step. The algorithms have been coded in Split-C and run on a variety of platforms. Our experimental results are consistent with the theoretical analysis and provide the best known execution times for these two primitives, even when compared with machine speci c implementations. More e cient implementations of Split-C will likely result in even faster execution times.
Journal of Parallel and Distributed Computing, 1998
Previous schemes for sorting on general-purpose parallel machines have had to choose between poor... more Previous schemes for sorting on general-purpose parallel machines have had to choose between poor load balancing and irregular communication or multiple rounds of all-to-all personalized communication. In this paper, we introduce a novel variation on sample sort which uses only two rounds of regular all-to-all personalized communication in a scheme that yields very good load balancing with virtually no overhead. Moreover, unlike previous variations, our algorithm e ciently handles the presence of duplicate values without the overhead of tagging each element with a unique identi er. This algorithm was implemented in Split-C and run on a variety of platforms, including the Thinking Machines CM-5, the IBM SP-2, and the Cray Research T3D. We ran our code using widely di erent benchmarks to examine the dependence of our algorithm on the input distribution. Our experimental results illustrate the e ciency and scalability of our algorithm across di erent platforms. In fact, it seems to outperform all similar algorithms known to the authors on these platforms, and its performance is invariant over the set of input distributions unlike previous e cient algorithms. Our results also compare favorably with those reported for the simpler ranking problem posed by the NAS Integer Sorting (IS) Benchmark.
Journal of Parallel and Distributed Computing, 1999
We describe a methodology for developing high performance programs running on clusters of SMP nod... more We describe a methodology for developing high performance programs running on clusters of SMP nodes. Our methodology is based on a small kernel (SIMPLE) of collective communication primitives that make e cient use of the hybrid shared and message passing environment. We illustrate the power of our methodology by presenting experimental results for sorting integers, two-dimensional fast Fourier transforms (FFT), and constraint-satis ed searching. Our testbed is a cluster of DEC AlphaServer 2100 4/275 nodes interconnected by an ATM switch.
Journal of Parallel and Distributed Computing, 2001
ACM Journal of Experimental Algorithmics, 1998
We introduce a new deterministic parallel sorting algorithm for distributed memory machines based... more We introduce a new deterministic parallel sorting algorithm for distributed memory machines based on the regular sampling approach. The algorithm uses only two rounds of regular all-to-all personalized communication in a scheme that yields very good load balancing with virtually no overhead. Moreover, unlike previous variations, our algorithm efficiently handles the presence of duplicate values without the overhead of tagging each element with a unique identifier. This algorithm was implemented in SPLIT-C and run on a variety of platforms, including the Thinking Machines CM-5, the IBM SP-2-WN, and the Cray Research T3D. We ran our code using widely different benchmarks to examine the dependence of our algorithm on the input distribution. Our experimental results illustrate the efficiency and scalability of our algorithm across different platforms. In fact, the performance compares closely to that of our random sample sort algorithm, which seems to outperform all similar algorithms k...
BMVC, 2022
Text-VQA aims at answering questions that require understanding the textual cues in an image. Des... more Text-VQA aims at answering questions that require understanding the textual cues in an image. Despite the great progress of existing Text-VQA methods, their performance suffers from insufficient human-labeled question-answer (QA) pairs. However, we ob- serve that, in general, the scene text is not fully exploited in the existing datasets– only a small portion of the text in each image participates in the annotated QA activities. This results in a huge waste of useful information. To address this deficiency, we develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image. Specifically, we propose, TAG, a text-aware visual question-answer generation architecture that learns to produce meaningful, and accurate QA samples using a multimodal transformer. The architec- ture exploits underexplored scene text information and enhances scene understanding of Text-VQA models by combining the generated QA pairs with the initial training data. Extensive experimental results on two well-known Text-VQA benchmarks (TextVQA and ST-VQA) demonstrate that our proposed TAG effectively enlarges the training data that helps improve the Text-VQA performance without extra labeling effort. Moreover, our model outperforms state-of-the-art approaches that are pre-trained with extra large- scale data. Code is available at https://github.com/HenryJunW/TAG.
ArXiv, 2016
Diffusion Magnetic Resonance Imaging (MRI) exploits the anisotropic diffusion of water molecules ... more Diffusion Magnetic Resonance Imaging (MRI) exploits the anisotropic diffusion of water molecules in the brain to enable the estimation of the brain's anatomical fiber tracts at a relatively high resolution. In particular, tractographic methods can be used to generate whole-brain anatomical connectivity matrix where each element provides an estimate of the connectivity strength between the corresponding voxels. Structural brain networks are built using the connectivity information and a predefined brain parcellation, where the nodes of the network represent the brain regions and the edge weights capture the connectivity strengths between the corresponding brain regions. This paper introduces a number of novel scalable methods to generate and analyze structural brain networks with a varying number of nodes. In particular, we introduce a new parallel algorithm to quickly generate large scale connectivity-based parcellations for which voxels in a region possess highly similar connec...
Written by an authority in the field, this book provides an introduction to the design and analys... more Written by an authority in the field, this book provides an introduction to the design and analysis of parallel algorithms. The emphasis is on the application of the PRAM (parallel random access machine) model of parallel computation, with all its variants, to algorithm analysis. Special attention is given to the selection of relevant data structures and to algorithm design principles that have proved to be useful. Features *Uses PRAM (parallel random access machine) as the model for parallel computation. *Covers all essential classes of parallel algorithms. *Rich exercise sets. *Written by a highly respected author within the field. 0201548569B04062001
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019
Adversarial training has been successfully applied to build robust models at a certain cost. Whil... more Adversarial training has been successfully applied to build robust models at a certain cost. While the robustness of a model increases, the standard classification accuracy declines. This phenomenon is suggested to be an inherent trade-off. We propose a model that employs feature prioritization by a nonlinear attention module and L2 feature regularization to improve the adversarial robustness and the standard accuracy relative to adversarial training. The attention module encourages the model to rely heavily on robust features by assigning larger weights to them while suppressing non-robust features. The regularizer encourages the model to extract similar features for the natural and adversarial images, effectively ignoring the added perturbation. In addition to evaluating the robustness of our model, we provide justification for the attention module and propose a novel experimental strategy that quantitatively demonstrates that our model is almost ideally aligned with salient data ...
Supercomputing Frontiers and Innovations, 2017
Parallel Processing Letters, 2016
In this paper, we illustrate the possibility of developing strategies to carry out matrix computa... more In this paper, we illustrate the possibility of developing strategies to carry out matrix computations on heterogeneous platforms which achieve native GPU performance on very large data sizes up to the capacity of the CPU memory. More specifically, we present a dense matrix multiplication strategy on a heterogeneous platform, specifically tailored for the case when the input is too large to fit on the device memory, which achieves near peak GPU performance. Our strategy involves the development of CUDA stream based software pipelines that effectively overlap PCIe data transfers with kernel executions. As a result, we are able to achieve over 1 and 2 TFLOPS performance on a single node using 1 and 2 GPUs respectively.
2015 IEEE International Conference on Big Data (Big Data), 2015
Diffusion Tensor Imaging (DTI) is an effective tool for the analysis of structural brain connecti... more Diffusion Tensor Imaging (DTI) is an effective tool for the analysis of structural brain connectivity in normal development and in a broad range of brain disorders. However efforts to derive inherent characteristics of structural brain networks have been hampered by the very high dimensionality of the data, relatively small sample sizes, and the lack of widely acceptable connectivity-based regions of interests (ROIs). Typical approaches have focused either on regions defined by standard anatomical atlases that do not incorporate anatomical connectivity, or have been based on voxel-wise analysis, which results in loss of statistical power relative to structurewise connectivity analysis. In this work, we propose a novel, computationally efficient iterative clustering method to generate connectivity-based whole-brain parcellations that converge to a stable parcellation in a few iterations. Our algorithm is based on a sparse representation of the whole brain connectivity matrix, which reduces the number of edges from around a half billion to a few million while incorporating the necessary spatial constraints. We show that the resulting regions in a sense capture the inherent connectivity information present in the data, and are stable with respect to initialization and the randomization scheme within the algorithm. These parcellations provide consistent structural regions across the subjects of population samples that are homogeneous with respect to anatomic connectivity. Our method also derives connectivity structures that can be used to distinguish between population samples with known different structural connectivity. In particular, new results in structural differences for different population samples such as Females vs Males, Normal Controls vs Schizophrenia, and different age groups in Normal Controls are also shown. Index Terms-data-driven whole-brain parcellation; structural connectivity; clustering; statistical analysis; parcellation stability and reproducibility.
Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, 1998
Clusters of symmetric multiprocessors (SMPs) have emerged as the primary candidates for large sca... more Clusters of symmetric multiprocessors (SMPs) have emerged as the primary candidates for large scale multiprocessor systems. In this paper, we introduce an e cient sorting algorithm for clusters of SMPs. This algorithm relies on a novel scheme for stably sorting on a single SMP coupled with balanced regular communication on the cluster. Our SMP algorithm seems to be asymptotically faster than any of the published algorithms we are aware of. The algorithms were implemented in C using Posix Threads and the SIMPLE library of communication primitives and run on a cluster of DEC AlphaServer 2100A systems. Our experimental results verify the scalability and e ciency of our proposed solution and illustrate the importance of considering both memory hierarchy and the overhead of shifting to multiple nodes.
19th International Conference on Scientific and Statistical Database Management (SSDBM 2007), 2007
We discuss a new efficient out-of-core multidimensional indexing structure, information-aware 2 n... more We discuss a new efficient out-of-core multidimensional indexing structure, information-aware 2 n-tree, for indexing very large multidimensional volumetric data. Building a series of (n-1)-Dimensional indexing structures on n-Dimensional data causes a scalability problem in the situation of continually growing resolution in every dimension. However, building a single n-Dimensional indexing structure can cause an indexing effectiveness problem compared to the former case. The informationaware 2 n-tree is an effort to maximize the indexing structure efficiency by ensuring that the subdivision of space have as similar coherence as possible along each dimension. It is particularly useful when data distribution along each dimension constantly shows a different degree of coherence from each other dimension. Our preliminary results show that our new tree can achieve higher indexing structure efficiency than previous methods.
Algorithms and Computation, 2004
We present linear-space sub-logarithmic algorithms for handling the 3-dimensional dominance repor... more We present linear-space sub-logarithmic algorithms for handling the 3-dimensional dominance reporting and the 2-dimensional dominance counting problems. Under the RAM model as described in [M. L. Fredman and D. E. Willard. "Surpassing the information theoretic bound with fusion trees", Journal of Computer and System Sciences, 47:424-436, 1993], our algorithms achieve O(log n/ log log n + f) query time for the 3-dimensional dominance reporting problem, where f is the output size, and O(log n/ log log n) query time for the 2-dimensional dominance counting problem. We extend these results to any constant dimension d ≥ 3, achieving O(n(log n/ log log n) d−3) space and O((log n/ log log n) d−2 + f) query time for the reporting case and O(n(log n/ log log n) d−2) space and O((log n/ log log n) d−1) query time for the counting case.
Scientific Programming, 2009
Interactive high quality volume rendering is becoming increasingly more important as the amount o... more Interactive high quality volume rendering is becoming increasingly more important as the amount of more complex volumetric data steadily grows. While a number of volumetric rendering techniques have been widely used, ray casting has been recognized as an effective approach for generating high quality visualization. However, for most users, the use of ray casting has been limited to datasets that are very small because of its high demands on computational power and memory bandwidth. However the recent introduction of the Cell Broadband Engine (Cell B.E.) processor, which consists of 9 heterogeneous cores designed to handle extremely demanding computations with large streams of data, provides an opportunity to put the ray casting into practical use. In this paper, we introduce an efficient parallel implementation of volume ray casting on the Cell B.E. The implementation is designed to take full advantage of the computational power and memory bandwidth of the Cell B.E. using an intrica...
Theoretical Computer Science, 2005
Given a set of n objects, each characterized by d attributes speci ed at m xed time instances, we... more Given a set of n objects, each characterized by d attributes speci ed at m xed time instances, we are interested in the problem of designing space e cient indexing structures such that arbitrary temporal range search queries can be handled e ciently. When m = 1, our problem reduces to the d-dimensional orthogonal search problem. We establish e cient data structures to handle several classes of the general problem. Our results include a linear size data structure that enables a query time of O(log n log m= log log n + f) for one-sided queries when d = 1, where f is the number of objects satisfying the query. A similar result is shown for counting queries. We also show that the most general problem can be solved with a polylogarithmic query time using nonlinear space data structures.
SIAM Journal on Computing, 1986
The paper is divided into two main sections. The first deals with a multidimensional search techn... more The paper is divided into two main sections. The first deals with a multidimensional search technique of Megiddo [J. Assoc. Comput. Mach., 31 (1984), pp. 114–127], and suggests an improvement. The second gives an application of the technique to the Euclidean one-centre problem in mathbbRd\mathbb{R}^d mathbbRd. An algorithm of time-complexity O(3(d+2)2n)O(3^{(d + 2)^2 } n)O(3(d+2)2n) is derived for this problem. This improves the best previous bound even in the case d=2d = 2d=2.
The Journal of Supercomputing, 1996
This paper presents e cient and portable implementations of a useful image enhancement process, t... more This paper presents e cient and portable implementations of a useful image enhancement process, the Symmetric Neighborhood Filter (SNF), and an image segmentation technique which makes use of the SNF and a variant of the conventional connected components algorithm which we call-Connected Components. Our general framework is a single-address space, distributed memory programming model. We use e cient techniques for distributing and coalescing data as well as e cient combinations of task and data parallelism. The image segmentation algorithm makes use of an e cient connected components algorithm which uses a novel approach for parallel merging. The algorithms have been coded in Split-C and run on a variety of platforms, including the Thinking Machines CM-5, IBM SP-1 and SP-2, Cray Research T3D, Meiko Scienti c CS-2, Intel Paragon, and workstation clusters. Our experimental results are consistent with the theoretical analysis (and provide the best known execution times for segmentation, even when compared with machine-speci c implementations.) Our test data include di cult images from the Landsat Thematic Mapper (TM) satellite data. More e cient implementations of Split-C will likely result in even faster execution times.
Journal of Parallel and Distributed Computing, 2007
This article was originally published in a journal published by Elsevier, and the attached copy i... more This article was originally published in a journal published by Elsevier, and the attached copy is provided by Elsevier for the author's benefit and for the benefit of the author's institution, for non-commercial research and educational use including without limitation use in instruction at your institution, sending it to specific colleagues that you know, and providing a copy to your institution's administrator.
Journal of Parallel and Distributed Computing, 1996
This paper presents e cient and portable implementations of two useful primitives in image proces... more This paper presents e cient and portable implementations of two useful primitives in image processing algorithms, histogramming and connected components. Our general framework is a single-address space, distributed memory programming model. We use e cient techniques for distributing and coalescing data as well as e cient combinations of task and data parallelism. Our connected components algorithm uses a novel approach for parallel merging which performs drastically limited updating during iterative steps, and concludes with a total consistency update at the nal step. The algorithms have been coded in Split-C and run on a variety of platforms. Our experimental results are consistent with the theoretical analysis and provide the best known execution times for these two primitives, even when compared with machine speci c implementations. More e cient implementations of Split-C will likely result in even faster execution times.
Journal of Parallel and Distributed Computing, 1998
Previous schemes for sorting on general-purpose parallel machines have had to choose between poor... more Previous schemes for sorting on general-purpose parallel machines have had to choose between poor load balancing and irregular communication or multiple rounds of all-to-all personalized communication. In this paper, we introduce a novel variation on sample sort which uses only two rounds of regular all-to-all personalized communication in a scheme that yields very good load balancing with virtually no overhead. Moreover, unlike previous variations, our algorithm e ciently handles the presence of duplicate values without the overhead of tagging each element with a unique identi er. This algorithm was implemented in Split-C and run on a variety of platforms, including the Thinking Machines CM-5, the IBM SP-2, and the Cray Research T3D. We ran our code using widely di erent benchmarks to examine the dependence of our algorithm on the input distribution. Our experimental results illustrate the e ciency and scalability of our algorithm across di erent platforms. In fact, it seems to outperform all similar algorithms known to the authors on these platforms, and its performance is invariant over the set of input distributions unlike previous e cient algorithms. Our results also compare favorably with those reported for the simpler ranking problem posed by the NAS Integer Sorting (IS) Benchmark.
Journal of Parallel and Distributed Computing, 1999
We describe a methodology for developing high performance programs running on clusters of SMP nod... more We describe a methodology for developing high performance programs running on clusters of SMP nodes. Our methodology is based on a small kernel (SIMPLE) of collective communication primitives that make e cient use of the hybrid shared and message passing environment. We illustrate the power of our methodology by presenting experimental results for sorting integers, two-dimensional fast Fourier transforms (FFT), and constraint-satis ed searching. Our testbed is a cluster of DEC AlphaServer 2100 4/275 nodes interconnected by an ATM switch.
Journal of Parallel and Distributed Computing, 2001
ACM Journal of Experimental Algorithmics, 1998
We introduce a new deterministic parallel sorting algorithm for distributed memory machines based... more We introduce a new deterministic parallel sorting algorithm for distributed memory machines based on the regular sampling approach. The algorithm uses only two rounds of regular all-to-all personalized communication in a scheme that yields very good load balancing with virtually no overhead. Moreover, unlike previous variations, our algorithm efficiently handles the presence of duplicate values without the overhead of tagging each element with a unique identifier. This algorithm was implemented in SPLIT-C and run on a variety of platforms, including the Thinking Machines CM-5, the IBM SP-2-WN, and the Cray Research T3D. We ran our code using widely different benchmarks to examine the dependence of our algorithm on the input distribution. Our experimental results illustrate the efficiency and scalability of our algorithm across different platforms. In fact, the performance compares closely to that of our random sample sort algorithm, which seems to outperform all similar algorithms k...
Journal of Experimental Algorithmics, 1996
A fundamental challenge for parallel computing is to obtain high-level, architecture independent,... more A fundamental challenge for parallel computing is to obtain high-level, architecture independent, algorithms which e ciently execute on general-purpose parallel machines. With the emergence of message passing standards such as MPI, it has become easier to design e cient and portable parallel algorithms by making use of these communication primitives. While existing primitives allow an assortment of collective communication routines, they do not handle an important communication event when most or all processors have non-uniformly sized personalized messages to exchange with each other. We focus in this paper on the h-relation personalized communication whose e cient implementation will allow high performance implementations of a large class of algorithms. While most previous h-relation algorithms use randomization, this paper presents a new deterministic approach for h-relation personalized communication. As an application, we present an e cient algorithm for stable integer sorting. The algorithms presented in this paper have been coded in Split-C and run on a variety of platforms, including the Thinking Machines CM-5, IBM SP-1 and SP-2, Cray Research T3D, Meiko Scienti c CS-2, and the Intel Paragon. Our experimental results are consistent with the theoretical analysis and illustrate the scalability and e ciency of our algorithms across di erent platforms. In fact, they seem to outperform all similar algorithms known to the authors on these platforms.