Three Approaches to Data Analysis with Hadoop (original) (raw)

This white paper demonstrates analysis of large datasets using three different tools that are part of the Hadoop ecosystem -MapReduce, Hive and Pig. The application used is a geographic and temporal analysis of Apache web logs. The problem is explained in depth and then solutions are shown for the three tools. Complete code is included in the Appendices, along with a description of the GeoWeb Apache Log Generator tool (available from http://github.com/DaveJaffe/BigDataDemos) as well as the R methods used to analyze and plot the results. Results are shown for all three tools with a 1TB set of log files and a 10TB set of log files.