Srinivas Vemuri - Academia.edu (original) (raw)
Uploads
Papers by Srinivas Vemuri
Submission of an original paper with copyright agreement and authorship responsibility. I (corres... more Submission of an original paper with copyright agreement and authorship responsibility. I (corresponding author) certify that I have participated sufficiently in the conception and design of this work and the analysis of the data (wherever applicable), as well as the writing of the manuscript, to take public responsibility for it. I believe the manuscript represents valid work. I have reviewed the final version of the manuscript and approve it for publication. Neither has the manuscript nor one with substantially similar content under my authorship been published nor is being considered for publication elsewhere, except as described in an attachment. Furthermore I attest that I shall produce the data upon which the manuscript is based for examination by the editors or their assignees, if requested. Thanking you.
Citeseer
Using a DBFS SecureFiles Store File System.
Proceedings of the VLDB Endowment, Aug 1, 2014
Analytics on Big Data is critical to derive business insights and drive innovation in today's Int... more Analytics on Big Data is critical to derive business insights and drive innovation in today's Internet companies. Such analytics involve complex computations on large datasets, and are typically performed on MapReduce based frameworks such as Hive and Pig. However, in our experience, these systems are still quite limited in performing at scale. In particular, calculations that involve complex joins and aggregations, e.g. statistical calculations, scale poorly on these systems. In this paper we propose novel primitives for scaling such calculations. We propose a new data model for organizing datasets into calculation data units that are organized based on user-defined cost functions. We propose new operators that take advantage of these organized data units to significantly speed up joins and aggregations. Finally, we propose strategies for dividing the aggregation load uniformly across worker processes that are very e↵ective in avoiding skews and reducing (or in some cases even removing) the associated overheads. We have implemented all our proposed primitives in a framework called Rubix, which has been in production at LinkedIn for nearly a year. Rubix powers several applications and processes TBs of data each day. We have seen remarkable improvements in speed and cost of complex calculations due to these primitives.
This software and related documentation are provided under a license agreement containing restric... more This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, the following notice is applicable: U.S. GOVERNMENT END USERS: Oracle programs, including any operating system, integrated software, a...
Proceedings of the VLDB Endowment, 2014
Analytics on Big Data is critical to derive business insights and drive innovation in today's... more Analytics on Big Data is critical to derive business insights and drive innovation in today's Internet companies. Such analytics involve complex computations on large datasets, and are typically performed on MapReduce based frameworks such as Hive and Pig. However, in our experience, these systems are still quite limited in performing at scale. In particular, calculations that involve complex joins and aggregations, e.g. statistical calculations, scale poorly on these systems. In this paper we propose novel primitives for scaling such calculations. We propose a new data model for organizing datasets into calculation data units that are organized based on user-defined cost functions. We propose new operators that take advantage of these organized data units to significantly speed up joins and aggregations. Finally, we propose strategies for dividing the aggregation load uniformly across worker processes that are very effective in avoiding skews and reducing (or in some cases even...
Submission of an original paper with copyright agreement and authorship responsibility. I (corres... more Submission of an original paper with copyright agreement and authorship responsibility. I (corresponding author) certify that I have participated sufficiently in the conception and design of this work and the analysis of the data (wherever applicable), as well as the writing of the manuscript, to take public responsibility for it. I believe the manuscript represents valid work. I have reviewed the final version of the manuscript and approve it for publication. Neither has the manuscript nor one with substantially similar content under my authorship been published nor is being considered for publication elsewhere, except as described in an attachment. Furthermore I attest that I shall produce the data upon which the manuscript is based for examination by the editors or their assignees, if requested. Thanking you.
Citeseer
Using a DBFS SecureFiles Store File System.
Proceedings of the VLDB Endowment, Aug 1, 2014
Analytics on Big Data is critical to derive business insights and drive innovation in today's Int... more Analytics on Big Data is critical to derive business insights and drive innovation in today's Internet companies. Such analytics involve complex computations on large datasets, and are typically performed on MapReduce based frameworks such as Hive and Pig. However, in our experience, these systems are still quite limited in performing at scale. In particular, calculations that involve complex joins and aggregations, e.g. statistical calculations, scale poorly on these systems. In this paper we propose novel primitives for scaling such calculations. We propose a new data model for organizing datasets into calculation data units that are organized based on user-defined cost functions. We propose new operators that take advantage of these organized data units to significantly speed up joins and aggregations. Finally, we propose strategies for dividing the aggregation load uniformly across worker processes that are very e↵ective in avoiding skews and reducing (or in some cases even removing) the associated overheads. We have implemented all our proposed primitives in a framework called Rubix, which has been in production at LinkedIn for nearly a year. Rubix powers several applications and processes TBs of data each day. We have seen remarkable improvements in speed and cost of complex calculations due to these primitives.
This software and related documentation are provided under a license agreement containing restric... more This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, the following notice is applicable: U.S. GOVERNMENT END USERS: Oracle programs, including any operating system, integrated software, a...
Proceedings of the VLDB Endowment, 2014
Analytics on Big Data is critical to derive business insights and drive innovation in today's... more Analytics on Big Data is critical to derive business insights and drive innovation in today's Internet companies. Such analytics involve complex computations on large datasets, and are typically performed on MapReduce based frameworks such as Hive and Pig. However, in our experience, these systems are still quite limited in performing at scale. In particular, calculations that involve complex joins and aggregations, e.g. statistical calculations, scale poorly on these systems. In this paper we propose novel primitives for scaling such calculations. We propose a new data model for organizing datasets into calculation data units that are organized based on user-defined cost functions. We propose new operators that take advantage of these organized data units to significantly speed up joins and aggregations. Finally, we propose strategies for dividing the aggregation load uniformly across worker processes that are very effective in avoiding skews and reducing (or in some cases even...