Three Approaches to Data Analysis with Hadoop (original) (raw)

Log Analysis with Hadoop MapReduce

2020

Pretty much every part of life now results in the generation of data. Logs are documentation of events or records of system activities and are created automatically through IT systems. Log data analysis is a process of making sense of these records. Log data often grows quickly and the conventional database solutions run short for dealing with a large volume of log files. Hadoop, having a wide area of applications for Big Data analysis, provides a solution for this problem. In this study, Hadoop was installed on two virtual machines. Log files generated by a Python script were analyzed in order to evaluate the system activities. The aim was to validate the importance of Hadoop in meeting the challenge of dealing with Big Data. The performed experiments show that analyzing logs with Hadoop MapReduce makes the data processing and detection of malfunctions and defects faster and simpler. Keywords— Hadoop, MapReduce, Big Data, log analysis, distributed file systems.

Analysis of Apache Logs Using Hadoop and Hive

TEM Journal, 2018

In this paper we consider an analysis of Apache web logs using Cloudera Hadoop distribution and Hive for querying the data in the web logs. We used public available web logs from NASA Kennedy Space Center server. HDFS (Hadoop distributed file system) was used as a logs container. The apache web logs were copied to the HDFS from the local file system. We made an analysis for the total number of hits, unique IPs, the most common hosts that made request to the NASA server in Florida, the most common types of errors. We also examined the ratio between the number of rows in the logs and the time of execution.

Analysis of Log Data and Statistics ReportGeneration Using Hadoop

International Journal of Innovative Research in Computer and Communication Engineering, 2014

Web Log analyser is a tool used for finding the statics of web sites. Through Web Log analyzer the web log files are uploaded into the Hadoop Distributed Framework where parallel procession on log files is carried in the form of master and slave structure. Pig scripts are written on the classified log files to satisfy certain query. The log files are maintained by the web servers. By analysing these log files gives an idea about the user in the way like which IP address have generated the most errors, which user is visiting a web page frequently.. This paper discuss about these log files, their formats, access procedures, their uses, the additional parameters that can be used in the log files which in turn gives way to an effective mining and the tools used to process the log files. It also provides the idea of creating an extended log file and learning the user behaviour. Analysing the user activities is particularly useful for studying user behaviour when using highly interactive ...

Analysis of Web Log Data Using Apache Pig in Hadoop

2018

The wide spread use of internet and increased web applications accelerate the rampant growth of web content. Every organization produces huge amount of data in different forms like text, audio, video etc., from multiplesources. The log data stored in web servers is a great source of knowledge. The real challenge for any organization is to understand the behavior of their customers. Analyzing such web log data will help the organizations to understand navigational patterns and interests of their users. As the logs are growing in size day by day, the existing database technologies face a bottleneck to process such massive unstructured data. Hadoop provides a best solution to this problem. Hadoop framework comes up with Hadoop Distributed File System, a reliable distributed storage for data and MapReduce, a distributed parallel processing for executing large volumes of complex data. Hadoop ecosystem constitutes of several other tools like Pig, Hive, Flume, Sqoop etc., for effective ana...

Web Server Log Processing using Hadoop

Big Data is an emerging growing dataset beyond the ability of a traditional database tool. Hadoop rides the big data where the massive quantity of information is processed using cluster of commodity hardware. A web server log file is a text file that is written as activity is generated by the web server. Log files collect a variety of data about information requests to your web server. Server logs act as a visitor sign-in sheet. Server log files can give information about what pages get the most and the least traffic? What sites refer visitors to your site? What pages that your visitors view and the browsers and operating systems used to access your site. The web server log processing has bright, vibrant scope in the field of information technology. The web server log processing can be so enhanced & expanded that it can be used in various spectra's & fields which are handling enormous amount of data on daily basis. It is reliable, fast and scalable approach for handling large numbers of logs and to transform log data into statistical data and generate reports accordingly.

Data Analysis with Hadoop

IJARCCE, 2019

We live in on-demand, on command digital universe with data prolife ring by institution, individuals and machines at a very high rate. This data is categories as "Big Data" due to its sheer volume, variety and velocity .Most of this data is unstructured, quasi structured or semi structured and it is heterogeneous inn nature. The volume and the heterogeneity of data with the speed it is generated, makes it difficult for the present computing infrastructure to manage Big Data. Traditional data management, warehousing and analysis system fall short of tools to analyze this data. Due to its specific nature of Big Data, it is stored in distributed file system architectures. Hadoop and HDFS by Apache are widely used for storing and managing Big Data. Analyzing Big Data is a challenging task as it involved large distributed file system.

PATTERN ANALYSIS THROUGH ACCESS LOGS AND ERROR LOGS USING HIVE WITH HADOOP

Log files [3] provide valuable information about the functioning and performance of applications and devices. These files are used by the developer to monitor, debug, and troubleshoot the errors that may have occurred in the application. Manual processing of log data requires a huge amount of time, and hence it can be a tedious task. The structure of error logs vary from one application to another. Since Volume, Velocity and Variety are being dealt here, Big Data using Hadoop is used. Analytics [2] involves the discovery of meaningful and understandable patterns from the various types of log files. Error Log Analytics deals about the conversion of data from semi-structured to a uniform structured format, such that Analytics can be performed over it. Business Intelligence (BI) functions such as Predictive Analytics is used to predict and forecast the future status of the application based on the current scenario. Proactive measures can be taken rather than reactive measures in order to ensure efficient maintainability of the applications and the devices. Log files are an example of semi-structured data. These files are used by the developer to monitor, debug, and troubleshoot the errors that may have occurred in an application. All the activities of web servers, application servers, database -servers, operating system, firewalls and networking devices are recorded in these log files.

Log Mining Based on Hadoop's Map and Reduce Technique

2013

In the world of cloud and grid computing Virtual Database Technology (VDB) is one of the effective solutions for integration of data from heterogeneous sources. Hadoop is a large-scale distributed batch processing infrastructure and also designed to efficiently distribute large amounts of work across a set of machines. Hadoop is an implementation of Map Reduce. This paper proposes application for inauguration of new branch of pizza in particular area according to hits from customers. In this paper we will take the log files for the particular website which will be stored on web mining server. These data will be passed on to the cloud server for region wise distribution on the virtual servers. Mapping and reduction will be done on these region wise data. The final output is then sent back to the server and client. This paper utilizes the parallel and distributed processing capability of Hadoop Map Reduce for handling heterogeneous query execution on large datasets. So Virtual Databas...

A Framework for Event Log Analysis using Hadoop MapReduce

2017

This Event log file is the most common data-sets exploited by many companies for customer behavior analysis. Oftentimes these records are unordered, and need to be grouped by certain key for effective analysis. One such example is to group similar user with different session ID to facilitate further analysis. This kind of analysis is known as User Sessionization. In this paper, we propose a distributed framework in combination of Hadoop and MapReduce to analyze event log file and sessionize user based on IP-address and timestamp.

Analysis of Web Server Log File Using Hadoop

International Journal for Research in Applied Science and Engineering Technology, 2018

Web usage mining is concerned with finding user navigational patterns on the World Wide Web by extracting knowledge from web usage logs. The log files, which in turn give way to, an effective mining and the tools used to process the log files. It also provides the idea of creating an extended log file and learning the user behavior. Analyzing the user activities is particularly useful for studying user behavior when using highly interactive systems. This paper presents the details of the methodology used, in which the focus is on studying the information-seeking process and on finding log errors and exceptions. The next part of the paper describes the working and techniques used by web log analyzer.