Hadoop Research Papers - Academia.edu (original) (raw)

Centos üzerinde Hadoop nasıl kurulur, adım adım anlatılıyor.

Companies are beginning to recognise the value of having vast volumes of data available in order to make the best choices to help their policies. The generation of digital evidence is increasingly the as emerging technology, the Internet,... more

Companies are beginning to recognise the value of having vast volumes of data available in order to make the best choices to help their policies. The generation of digital evidence is increasingly the as emerging technology, the Internet, Security and social networks emerge. The word "Big Data" refers to the heterogeneous mass of digital data generated by businesses and individuals, which, due to its characteristics (large scale, various formats, and processing speed), necessitates the use of specialised and increasingly advanced machine storage and analysis methods. The aim of this essay is to describe Big Data & Hadoop, its principles, problems, and implementations, as well as the significance of Big Data Analytics.

The widespread use of digital images has led to a new challenge in digital image forensics. These images can be used in court as evidence of criminal cases. However, digital images are easily manipulated which brings up the need of a... more

The widespread use of digital images has led to a new challenge in digital image forensics. These images can be used in court as evidence of criminal cases. However, digital images are easily manipulated which brings up the need of a method to verify the authenticity of the image. One of the methods is by identifying the source camera. In spite of that, it takes a large amount of time to be completed by using traditional desktop computers. To tackle the problem, we aim to increase the performance of the process by implementing it in a distributed computing environment. We evaluate the camera identification process using conditional probability features and Apache Hadoop. The evaluation process used 6000 images from six different mobile phones of the different models and classified them using Apache Mahout, a scalable machine learning tool which runs on Hadoop. We ran the source camera identification process in a cluster of up to 19 computing nodes. The experimental results demonstrate exponential decrease in processing times and slight decrease in accuracies as the processes are distributed across the cluster. Our prediction accuracies are recorded between 85 to 95% across varying number of mappers.

While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of conventional software and hardware. Hadoop framework distributes large datasets over multiple commodity servers and performs parallel... more

While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of conventional software and hardware. Hadoop framework distributes large datasets over multiple commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop framework and propose methods for enhancing I/O performance. A proven approach is to cache data to maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node combining design which extends the traditional combiner to a node level. The in-node combiner reduces the total number of intermediate results and curtail network traffic between mappers and reducers.

Pretty much every part of life now results in the generation of data. Logs are documentation of events or records of system activities and are created automatically through IT systems. Log data analysis is a process of making sense of... more

Pretty much every part of life now results in the generation of data. Logs are documentation of events or records of system activities and are created automatically through IT systems. Log data analysis is a process of making sense of these records. Log data often grows quickly and the conventional database solutions run short for dealing with a large volume of log files. Hadoop, having a wide area of applications for Big Data analysis, provides a solution for this problem. In this study, Hadoop was installed on two virtual machines. Log files generated by a Python script were analyzed in order to evaluate the system activities. The aim was to validate the importance of Hadoop in meeting the challenge of dealing with Big Data. The performed experiments show that analyzing logs with Hadoop MapReduce makes the data processing and detection of malfunctions and defects faster and simpler. Keywords— Hadoop, MapReduce, Big Data, log analysis, distributed file systems.

Anti-phishing is an intriguing challenge for Internet users especially for online business or e-pay users. Tracking phishing is quite difficult because most victims are not instantly aware of phishing attacks until their accounts are... more

Anti-phishing is an intriguing challenge for Internet users especially for online business or e-pay users. Tracking phishing is quite difficult because most victims are not instantly aware of phishing attacks until their accounts are compromised, and monetary losses occur. Most web browsers provide plug-ins to protect users from phishing websites, but a client side solution cannot provide detailed forensic information on phishing attacks. In this paper, we propose an offline phishing detection system named LARX (acronym for Large-scale Anti-phishing by Retrospective data-eXploration). LARX uses network traffic data archived at a vantage point and analyzes these data for phishing attacks. All of LARX's phishing filtering operations use cloud computing platforms and work in parallel. As an off-line solution for phishing attack detection, LARX can be effectively scaled up to analyze a large volume of trace data when enough computing power and storage capacity are provided.

We present a virtualized setup of a Hadoop cluster that provides greater computing capacity with lesser resources, since a virtualized cluster requires fewer physical machines. The master node of the cluster is set up on a physical... more

We present a virtualized setup of a Hadoop cluster that provides greater computing capacity with lesser resources, since a virtualized cluster requires fewer physical machines. The master
node of the cluster is set up on a physical machine, and slave nodes are set up on virtual machines (VMs) that may be on a common physical machine. Hadoop configured VM images are created by cloning of VMs, which facilitates fast addition and deletion of nodes in the cluster without much overhead. Also, we have configured the Hadoop virtualized cluster to use capacity scheduler instead of the default FIFO scheduler. The capacity scheduler schedules tasks based on the availability of RAM and virtual memory (VMEM) in slave nodes before allocating any job. So instead of queuing up the jobs, they are efficiently allocated on the VMs based on the memory avail- able. Various configuration parameters of Hadoop are analyzed and the virtualized cluster is fine-tuned to ensure best performance and maximum scalability.

The cloud is the best method used for the utilization and organization of data. The cloud provides many resources for us via the internet. There are many technologies used in cloud computing systems; each one uses a different kind of... more

The cloud is the best method used for the utilization and organization of data. The cloud provides many resources for us via the internet. There are many technologies used in cloud computing systems; each one uses a different kind of protocols and methods. Many tasks can execute on different servers per second, which cannot execute on their computer. The most popular technologies used in the cloud system are Hadoop, Dryad, and another map reducing framework. Also, there are many tools used to optimize the performance of the cloud system, such as Cap3, HEP, and Cloudburst. This paper reviews in detail the cloud computing system, its used technologies, and the best technologies used with it according to multiple factors and criteria such as the procedure cost, speed cons and pros. Moreover, A comprehensive comparison of the tools used for the utilization of cloud computing systems is presented.

DataFlair's Big Data Hadoop Tutorial PPT for Beginners takes you through various concepts of Hadoop:This Hadoop tutorial PPT covers: 1. Introduction to Hadoop 2. What is Hadoop 3. Hadoop History 4. Why Hadoop 5. Hadoop Nodes 6.... more

DataFlair's Big Data Hadoop Tutorial PPT for Beginners takes you through various concepts of Hadoop:This Hadoop tutorial PPT covers:
1. Introduction to Hadoop
2. What is Hadoop
3. Hadoop History
4. Why Hadoop
5. Hadoop Nodes
6. Hadoop Architecture
7. Hadoop data flow
8. Hadoop components – HDFS, MapReduce, Yarn
9. Hadoop Daemons
10. Hadoop characteristics & features Related Blogs:
Hadoop Introduction – A Comprehensive Guide: https://goo.gl/QadBS4
Wish to Learn Hadoop & Carve your career in Big Data, Contact us: info@data-flair.training +91-7718877477, +91-9111133369 Or visit our website. https://data-flair.training/

We have implemented a Hadoop compatible framework that helps in detection of suspected hardware failures in DataNodes within a Hadoop cluster and in signalling the nodes accordingly. Based on the status of the various hardware components... more

We have implemented a Hadoop compatible framework that helps in detection of suspected hardware failures in DataNodes within a Hadoop cluster and in signalling the nodes accordingly. Based on the status of the various hardware components of the DataNodes, the master node signals the DataNode so that appropriate actions can be taken. In a Hadoop cluster, the knowledge of the network is important for the master node to configure itself to meet varying needs of the DataNodes. It is important to keep track of the current topology of the Hadoop network and track the status of the network services in Hadoop. We have also added the functionality to discover the network topology in a Hadoop cluster. This discovered topology information will be useful in performing load balancing in the network, or in making intelligent decisions in data replication in case of node failures.

Data outsourcing allows data owners to keep their data at untrusted clouds that do not ensure the privacy of data and/or computations. One useful framework for fault-tolerant data processing in a distributed fashion is MapReduce, which... more

Data outsourcing allows data owners to keep their data at untrusted clouds that do not ensure the privacy of data and/or computations. One useful framework for fault-tolerant data processing in a distributed fashion is MapReduce, which was developed for trusted private clouds. This paper presents algorithms for data outsourcing based on Shamir's secret-sharing scheme and for executing privacy-preserving SQL queries such as count, selection including range selection, projection, and join while using MapReduce as an underlying programming model. The proposed algorithms prevent the untrusted cloud to know the database or the query while also preventing output size and access-pattern attacks. Interestingly, our algorithms do not need the database owner, which only creates and distributes secret-shares once, to be involved to answer any query, and hence, the database owner also cannot learn the query. We evaluate the efficiency of the algorithms on parameters: (i) the number of communication rounds (between a user and a cloud), (ii) the total amount of bit flow (between a user and a cloud), and (iii) the computational load at the user-side and the cloud-side.

There is a growing trend of applications that need to handle Big Data, as many corporations and organizations are required to collect more data from their operations. Recently, processing Big Data, using MapReduce structure has become... more

There is a growing trend of applications that need to handle Big Data, as many corporations and organizations are required to collect more data from their operations. Recently, processing Big Data, using MapReduce structure has become very popular, because the traditional data warehousing solutions for handling such datasets are not feasible. Hadoop provides an environment for execution of MapReduce program over distributed memory clusters, which supports the processing of large datasets in a distributed computing environment. Information retrieval systems facilitate searching of the content of the books or journals based on the metadata or indexing. An inverted index is a data structure which stores a mapping from content, such as words or numbers, to its locations in one or more documents. In this paper we propose three different implementations for inverted indexes (Indexer, IndexerCombiner, and IndexerMap) in Hadoop environment using MapReduce programming model, and compare their performance to evaluate the impacts of different factors such as data format and output file format in MapReduce.

Even though there are lots of invented systems that have implemented customer analytics, it’s still an upcoming and unexplored market that has greater potential for better advancements. Big data is one of the most raising technology... more

Even though there are lots of invented systems that have implemented customer analytics, it’s still an upcoming and unexplored market that has greater potential for better advancements. Big data is one of the most raising technology trends that have the
capability for significantly changing the way business organizations use customer behaviour to analyze and transform it into valuable insights. Also decision trees can be used efficiently in the decision making analysis under uncertainty which provides a variety of essential results. Hence to the end of this paper, we propose the MapReduce implementation of well-known statistical classifier, C4.5 decision tree algorithm. The paper additionally mentions why C4.5 is preferred over ID3. Apart from this, our system aims to implement Customer data visualization using Data Driven Documents (d3.js) which allows you to build well customized graphics.

In this digitalised world where every information is stored, the data a are growing exponentially. It is estimated that data are doubles itself every two years. Geospatial data are one of the prime contributors to the big data scenario.... more

In this digitalised world where every information
is stored, the data a are growing exponentially. It is estimated
that data are doubles itself every two years. Geospatial data are
one of the prime contributors to the big data scenario. There
are numerous tools of the big data analytics. But not all the
big data analytics tools are capabilities to handle geospatial big
data. In the present paper, it has been discussed about the
recent two popular open source geospatial big data analytical
tools i.e. SpatialHadoop and GeoSpark which can be used for
analysis and process the geospatial big data in efficient manner.
It has compared the architectural view of SpatialHadoop and
GeoSpark. Through the architectural comparison, it has also
summarised the merits and demerits of these tools according the
execution times and volume of the data which has been used.
Index Terms—Big Data; Geospatial big data; GIS; Spatial-
Hadoop; GeoSpark

The ontology is a common place where we can get the shared knowledge about the common concepts. In medical domains, university data bases and in the study of botanical areas of research we can see the applications of ontologies. The main... more

The ontology is a common place where we can get the shared knowledge about the common concepts. In medical domains, university data bases and in the study of botanical areas of research we can see the applications of ontologies. The main objective of our research is to propose a job recommendation system based on the ontology. The process here is construction of the ontology from various job portals and notifying the result set to both job seekers and employers. The important attempt that here we are describing is to adapt the dynamism to the ontology with the idea of Slowly Changing Source detection based on lookup and update strategy. Look up will identify the change happens or not, whereas the update strategy tracks the change in terms of updating, deletion or insertion of the source. The benefit of our work is dynamic data management provision to the ontology constructed. We believe that our method will give the best results in case of accuracy and time parameters in the job notifications.

The computer industry is being challenged to develop methods and techniques for affordable data processing on large datasets at optimum response times. The technical challenges in dealing with the increasing demand to handle vast... more

The computer industry is being challenged to develop methods and techniques for affordable data processing on large datasets at optimum response times. The technical challenges in dealing with the increasing demand to handle vast quantities of data is daunting and on the rise. One of the recent
processing models with a more efficient and intuitive solution to rapidly process large amount of data in parallel is called MapReduce. It is a framework defining a template approach of programming to perform large-scale data computation on clusters of machines in a cloud computing environment. MapReduce provides automatic parallelization and distribution of computation based on several processors. It hides
the complexity of writing parallel and distributed programming code. This paper provides a comprehensive systematic review and analysis of large-scale dataset processing and dataset handling challenges and requirements in a cloud computing environment by using the MapReduce framework and its open-source implementation Hadoop. We defined requirements for MapReduce systems to perform large-scale data processing. We also proposed the MapReduce framework and one implementation of this framework on Amazon Web Services. At the end of the paper, we presented an experimentation of running MapReduce system in a cloud environment. This paper outlines one of the best techniques to process large datasets is
MapReduce; it also can help developers to do parallel and distributed computation in a cloud environment.

Business Intelligence or Decision Support System (DSS) is IT for decision-makers and business leaders. It describes the means, the tools and the methods that make it possible to collect, consolidate, model and restore the data, material... more

Business Intelligence or Decision Support System (DSS) is IT for decision-makers and business leaders. It describes the means, the tools and the methods that make it possible to collect, consolidate, model and restore the data, material or immaterial, of a company to offer a decision aid and to allow a decision-maker to have an overview of the activity being treated. Given the large volume, variety, and data velocity we entered the era of Big Data. And since most of today's BI tools are designed to handle structured data. In our research project, we aim to consolidate a BI system for Big Data. In continuous efforts, this paper is a progress report of our first contribution that aims to apply the techniques of model engineering to propose a universal approach to deal with Big Data is to help decision-makers make decisions.

Big Data is large-volume of data generated by public web, social media and different networks, business applications, scientific instruments, types of mobile devices and different sensor technology. Data mining involves knowledge... more

Big Data is large-volume of data generated by public web, social media and different networks, business applications, scientific instruments, types of mobile devices and different sensor technology. Data mining involves knowledge discovery from these large data sets. Different types of clustering methods are used in data mining. For processing these large amounts of data in an inexpensive and efficient way, new architecture, techniques, algorithms and analytics are require to manage it and extract value and hidden knowledge from it. Hadoop is the core platform for storing the large volume of data into Hadoop Distributed File System (HDFS) and that data get processed by MapReduce model in parallel. Hadoop is designed to scale up from a single server to thousands of machines and with a very high degree of fault tolerance. This paper presents the survey of big data, issues with big data, clustering information and how Hadoop works.

This Document Gives you details Instructions to work with Hadoop HDFS file System. I have Explained Every command which is available.

Since Big data is so huge that it's become difficult to handle it, so it requires special technology which can handle bigdata. Hadoop is Apache Foundation's Framework which aims to provide efficient storage and analytics of big data;... more

Since Big data is so huge that it's become difficult to handle it, so it requires special technology which can handle bigdata. Hadoop is Apache Foundation's Framework which aims to provide efficient storage and analytics of big data; also, it is open system software. Two core technologies are associated with Hadoop i.e. HDFS and Map Reduce. HDFS is abbreviated for Hadoop Distributed File Technology, it's a special file system which provides efficient storage for big data in cluster of commodity hardware and based on stream access pattern. HDFS cluster Architecture is based on distributed file system therefore has Client server architecture. Since in HDFS cluster, there is no way to check the authenticity of client, therefore a method to incorporate Kerberos protocol in between Client and HDFS cluster is purposed to make the system secured. Kerberos is network authentication protocol which provides secured communication between client and server over unsecured network. Moreover, an Agent has been incorporated which is authorized to access the client's buffer and takes data out from it and loads it into HDFS cluster.

A flexible, efficient and secure networking architecture is required in order to process big data. However, existing network architectures are mostly unable to handle big data. As big data pushes network resources to the limits it results... more

A flexible, efficient and secure networking architecture is required in order to process big data. However, existing network architectures are mostly unable to handle big data. As big data pushes network resources to the limits it results in network congestion, poor performance, and detrimental user experiences. This paper presents the current state-of-the-art research challenges and possible solutions on big data networking theory. More specifically, we present the state of networking issues of big data related to capacity, management and data processing. We also present the architectures of MapReduce and Hadoop paradigm with research challenges, fabric networks and software defined networks (SDN) that are used to handle today's idly growing digital world and compare and contrast them to identify relevant problems and solutions.

The tremendous growth of data in Internet and the prevalence of the same as a common communication medium has resulted in the evolution of web data. The continuing growth of World Wide Web and on-line text collections makes a large volume... more

The tremendous growth of data in Internet and the prevalence of the same as a common communication medium has resulted in the evolution of web data. The continuing growth of World Wide Web and on-line text collections makes a large volume of information available to users. The information overload either leads to a stage of significant time in browsing all the information or else useful information is missed out. Web forums are online bulletin boards which is the simplest way for like minded people with common interest to come together to share their thoughts and opinion in true online style. But the explosion in the number of Internet users had made it very difficult to draw an efficient conclusion about a given query because of immense number of different answers. This paper focuses on the efficient summarization of terabytes perhaps pentabytes of such forum web pages to get an optimal final result.
The main concept used in tackling the massive amount of redundant forum data problem is Map-Reduce [9], [19]. It was developed by Google in 2004 [9] and it proves to be a very efficient way to handle huge amount of data. It uses the idea of divide and conquer to split the problem in to many phases and then tackling it individually. It has been widely used for services such as Google News [16]. Its ability to provide distributed processing environment makes map-reduce a very efficient method for handling web data. In this paper the inverted index for the forum data is generated using a map-reduce framework, Hadoop. The inverted index is then used for further processing with the given query for filtering
out relevant documents and then later summarized, ranked and
displayed according to weights assigned to each documents by using Statistical approaches. The results shows that this method
is able to produce a coherent summary at an average length of
ranked webpages which is at the compression rate of 60%.

IEEE Hadoop Big Data Project Titles 2017 | 2018 IEEE Big Data Projects A Scalable Data Chunk Similarity based Compression Approach for Efficient Big Sensing Data Processing on Cloud A Systematic Approach Toward Description and... more

IEEE Hadoop Big Data Project Titles 2017 | 2018 IEEE Big Data Projects
A Scalable Data Chunk Similarity based Compression Approach for Efficient Big Sensing Data Processing on Cloud
A Systematic Approach Toward Description and Classification of Cybercrime Incidents
Achieving Efficient and Privacy-Preserving Cross-Domain Big Data Deduplication in Cloud
aHDFS: An Erasure-Coded Data Archival System for Hadoop Clusters
Big Data Privacy in Biomedical Research
Cost-Aware Big Data Processing across Geo-distributed Datacenters
Disease Prediction by Machine Learning over Big Data from Healthcare Communities
DyScale: a MapReduce Job Scheduler for Heterogeneous Multicore Processors
Efficient Processing of Skyline Queries Using MapReduce
FiDoop-DP: Data Partitioning in Frequent Itemset Mining on Hadoop Clusters
Hadoop MapReduce for Mobile Clouds
Mining Human Activity Patterns from Smart Home Big Data for Healthcare Applications
PPHOPCM: Privacy-preserving High-order Possibilistic c-Means Algorithm for Big Data Clustering with Cloud Computing
Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset
Public Interest Analysis Based on Implicit Feedback of IPTV Users
Ring: Real-Time Emerging Anomaly Monitoring System over Text Streams
Robust Big Data Analytics for Electricity Price Forecasting in the Smart Grid
Scalable Uncertainty-Aware Truth Discovery in Big Data Social Sensing Applications for Cyber-Physical Systems
Self-Adjusting Slot Configurations for Homogeneous and Heterogeneous Hadoop Clusters
Service Rating Prediction by Exploring Social Mobile Users’ Geographical Locations

Assemblage of large and complex datasets are generally called Big Data. These large datasets cannot be stored, managed or analyzed easily by the current tools and methodologies because of their large size and complexity. However, such... more

Assemblage of large and complex datasets are generally called Big Data. These large datasets cannot be stored,
managed or analyzed easily by the current tools and methodologies because of their large size and complexity. However, such
datasets provide various opportunities like modelling or predicting model the future. This overwhelming growth of data is now
coupled with various new challenges. Increase in data at a massive rate has resulted in most exciting opportunities for
researchers in the upcoming years. In this paper, we discuss about the topic in detail sushisen algorithms in HEPMASS
dataset, its current scenario, characteristics and challenges to forecast the future. This paper discusses about the tools and
technologies used to manage these large datasets and also some of the problems and essential issues such as security
management and privacy of data.

DataFlair's Big Data Hadoop Tutorial PPT for Beginners takes you through various concepts of Hadoop:This Hadoop tutorial PPT covers: 1. Introduction to Hadoop 2. What is Hadoop 3. Hadoop History 4. Why Hadoop 5. Hadoop Nodes 6.... more

DataFlair's Big Data Hadoop Tutorial PPT for Beginners takes you through various concepts of Hadoop:This Hadoop tutorial PPT covers:
1. Introduction to Hadoop
2. What is Hadoop
3. Hadoop History
4. Why Hadoop
5. Hadoop Nodes
6. Hadoop Architecture
7. Hadoop data flow
8. Hadoop components – HDFS, MapReduce, Yarn
9. Hadoop Daemons
10. Hadoop characteristics & features Related Blogs:
Hadoop Introduction – A Comprehensive Guide: https://goo.gl/QadBS4
Wish to Learn Hadoop & Carve your career in Big Data, Contact us: info@data-flair.training +91-7718877477, +91-9111133369 Or visit our website. https://data-flair.training/

Pretty much every part of life now results in the generation of data. Logs are documentation of events or records of system activities and are created automatically through IT systems. Log data analysis is a process of making sense of... more

Pretty much every part of life now results in the generation of data. Logs are documentation of events or records of system activities and are created automatically through IT systems. Log data analysis is a process of making sense of these records. Log data often grows quickly and the conventional database solutions run short for dealing with a large volume of log files. Hadoop, having a wide area of applications for Big Data analysis, provides a solution for this problem. In this study, Hadoop was installed on two virtual machines. Log files generated by a Python script were analyzed in order to evaluate the system activities. The aim was to validate the importance of Hadoop in meeting the challenge of dealing with Big Data. The performed experiments show that analyzing logs with Hadoop MapReduce makes the data processing and detection of malfunctions and defects faster and simpler.

Hadoop is a framework to implement MapReduce. used between the Job Tracker and HDFS for overlapping the execution and write operation. It provides load balancing and improves data locality, which maximizes the throughput and minimizes the... more

Hadoop is a framework to implement MapReduce. used between the Job Tracker and HDFS for overlapping the execution and write operation. It provides load balancing and improves data locality, which maximizes the throughput and minimizes the delay. Experiments have been conducted on web log file of NASA and Academic websites on Click Count and Sessionization applications. Experimental results show that data write operation is 1.49 times faster when compared to conventional method and Turnaround time using Dynamic method is improved by 12.4 % for Sessionization and 25.6% for Click Count application compared conventional Hadoop MapReduce. Using the proposed method, an average throughput of 23.74 mbps for Sessionization and 27.7 mbps for Click Count application using Dynamic method is obtained from the experimental results.

NoSQL databases outperform the traditional RDBMS due to their faster retrieval of large volumes of data, scalability, and high performance. The need for these databases has been increasing in recent years because data collection is... more

NoSQL databases outperform the traditional RDBMS due to their faster retrieval of large volumes of data, scalability, and high performance. The need for these databases has been increasing in recent years because data collection is growing tremendously. Structured, unstructured, and semi-structured data storage is allowed in NoSQL, which is not possible in a traditional database. NoSQL needs to compensate with its security feature for its amazing functionalities of faster data access and large data storage. The main concern exists in sensitive information stored in the data. The need to protect this sensitive data is crucial for confidentiality and privacy problems. To understand the severity of preserving sensitive data, recognizing the security issues is important. These security issues, if not resolved, will cause data loss, unauthorized access, database crashes by hackers, and security breaches. This paper investigates the security issues common to the top twenty NoSQL databases of the following types: document, key-value, column, graph, objectoriented, and multi-model. The top twenty NoSQL databases studied were MongoDB,

Co-locating the computation as close as possible to the data is an important consideration in the current data intensive systems. This is known as data locality problem. In this paper, we analyze the impact of data locality on YARN,... more

Co-locating the computation as close as possible
to the data is an important consideration in the current data
intensive systems. This is known as data locality problem. In
this paper, we analyze the impact of data locality on YARN,
which is the new version of Hadoop. We investigate YARN delay
scheduler behavior with respect to data locality for a variety of
workloads and configurations. We address in this paper three
problems related to data locality. First, we study the trade-off
between the data locality and the job completion time. Secondly,
we observe that there is an imbalance of resource allocation when
considering the data locality, which may under-utilize the cluster.
Thirdly, we address the redundant I/O operations when different
YARN containers request input data blocks on the same node.
Additionally, we propose YARN Locality Simulator (YLocSim),
a simulator tool that simulates the interactions between YARN
components in a real cluster and reports the data locality
percentages in real time. We validate YLocSim over a real cluster
setup and use it in our study

— In today’s age of information technology processing data is a very important issue. Nowadays even terabytes and petabytes of data is not sufficient for storing large chunks of database. The data is too big, moves too fast, or doesn’t... more

— In today’s age of information technology processing data is a very important issue. Nowadays even terabytes and petabytes of data is not sufficient for storing large chunks of database. The data is too big, moves too fast, or doesn’t fit the structures of the current database architectures. Big Data is typically large volume of un-structured and structured data that gets created from various organized and unorganized applications, activities such as emails web logs, Facebook, etc. The main difficulties with Big Data include capture, storage, search, sharing, analysis, and visualization. Hence companies today use concept called Hadoop in their applications. Even sufficiently large amount of data warehouses are unable to satisfy the needs of data storage. Hadoop is designed to store large amount of data sets reliably. It is an open source software which supports parallel and distributed data processing. Along with reliability and scalability features Hadoop also provide fault tolera...

A Process having a large number of data affects the Operation. Due to this large augment of data, the industries are struggling to store, handle, and analyse the data. The normal data base systems are not enough to do the above mentioned... more

A Process having a large number of data affects the Operation. Due to this large augment of data, the industries are struggling to store, handle, and analyse the data. The normal data base systems are not enough to do the above mentioned activities.Then here comes the hadoop technology, In Hadoop the enormous data will be stored and processed effectively and efficiently. Hadoop is the technology which has many frameworks such as data integration, management, orchestration, monitoring, data serialization, data intelligence, storage, integration and access. So hadoop technology is used in which Sqoop tool is used ,it is a command-line interface application for transferring data between relational databases and hadoop . In hadoop scoop is the command line interface used for both Import and export from relational database to hadoop. In hadoop another tool called ambari is used. It is used to simplify Hadoop management processing of huge amount of data. It also works for provisioning, managing and monitoring of apache Hadoop clusters. In this paper the sqoop and ambari frameworks have been analysed with various parameters.

— Data quality comes from consistency which is dependent on reliable source. The concept of ontology is used to represent a common knowledge sharing to various users in the distributed manner. In the literature we can see various... more

— Data quality comes from consistency which is dependent on reliable source. The concept of ontology is used to represent a common knowledge sharing to various users in the distributed manner. In the literature we can see various ontologies like medical, biotechnology and automobile. All these ontologies allows various levels of users so as to share the knowledge , but unfortunately the construction of automatic ontology ,change management in the source so as to track the modifications and update the changes which leads to dynamic ontologies are very complex. The current article explains the automatic ontology construction in simple and efficient manner along with the method of change management in the flexible way. The dynamic ontology construction is the leading aspect of the article, by taking the changes at the source we are merged the changes to the ontology. The importance of the research is to build a recommendation system with job portals data which will help the recruiters and job seekers in searching of jobs, and more over there is no common framework of activities to tag automatic ontology construction, change tracking to identify the modifications in source data and construction of dynamic ontology. We believe that the work outperforming when compared with other ontology construction methods. The innovation here is a frame work that holds all the activities like ontology construction, change management and dynamic ontology generation.

Map Reduce has gained remarkable significance as a prominent parallel data processing tool in the research community, academia and industry with the spurt in volume of data that is to be analyzed. Map Reduce is used in different... more

Map Reduce has gained remarkable significance as a prominent parallel data processing tool in the research community, academia and industry with the spurt in volume of data that is to be analyzed. Map Reduce is used in different applications such as data mining, data analytics where massive data analysis is required, but still it is constantly being explored on different parameters such as performance and efficiency. This survey intends to explore large scale data processing using MapReduce and its various implementations to facilitate the database, researchers and other communities in developing the technical understanding of the MapReduce framework. In this survey, different MapReduce implementations are explored and their inherent features are compared on different parameters. It also addresses the open
issues and challenges raised on fully functional DBMS/Data Warehouse on MapReduce. The comparison of various Map Reduce implementations is done with the most popular implementation Hadoop and other similar implementations using other platforms.

Nowadays, the stock market is attracting more and more people's notice with its high challenging risks and high return over. A stock exchange market depicts savings and investments that are advantageous to increase the effectiveness of... more

Nowadays, the stock market is attracting more and more people's notice with its high challenging risks and high return over. A stock exchange market depicts savings and investments that are advantageous to increase the effectiveness of the national economy. The future stock returns have some predictive relationships with the publicly available information of present and historical stock market indices. ARIMA is a statistical model which is known to be efficient for time series forecasting especially for short-term prediction. In this paper, we propose a model for forecasting the stock market trends based on the technical analysis using historical stock market data and ARIMA model. This model will automate the process of direction of future stock price indices and provides assistance for financial specialists to choose the better timing for purchasing and selling of stocks. The results are shown in terms of visualizations using R programming language. The obtained results revealed that the ARIMA model has a strong potential for short-term prediction of stock market trends.

Pretty much every part of life now results in the generation of data. Logs are documentation of events or records of system activities and are created automatically through IT systems. Log data analysis is a process of making sense of... more

Pretty much every part of life now results in the generation of data. Logs are documentation of events or records of system activities and are created automatically through IT systems. Log data analysis is a process of making sense of these records. Log data often grows quickly and the conventional database solutions run short for dealing with a large volume of log files. Hadoop, having a wide area of applications for Big Data analysis, provides a solution for this problem. In this study, Hadoop was installed on two virtual machines. Log files generated by a Python script were analyzed in order to evaluate the system activities. The aim was to validate the importance of Hadoop in meeting the challenge of dealing with Big Data. The performed experiments show that analyzing logs with Hadoop MapReduce makes the data processing and detection of malfunctions and defects faster and simpler.

Hadoop is a software framework that supports data intensive distributed application. Hadoop creates clusters of machine and coordinates the work among them. It include two major component, HDFS (Hadoop Distributed File System) and Map... more

Hadoop is a software framework that supports data intensive distributed application. Hadoop creates clusters of machine and coordinates the work among them. It include two major component, HDFS (Hadoop Distributed File System) and Map Reduce. HDFS is designed to store large amount of data reliably and provide high availability of data to user application running at client. It creates multiple data blocks and store each of the block redundantly across the pool of servers to enable reliable, extreme rapid computation. Map Reduce is software framework for the analyzing and transforming a very large data set in to desired output. This paper describe introduction of hadoop, types of hadoop, architecture of HDFS and Map Reduce, benefit of HDFS and Map Reduce.

This book “Grid and Cloud Computing” is about an exploratory awareness to solve large scale scientific problems through Grid and Cloud Computing. It contributes an impression towards virtualization as fundamental concept towards cloud... more

This book “Grid and Cloud Computing” is about an exploratory awareness to solve large scale scientific problems through Grid and Cloud Computing. It contributes an impression towards virtualization as fundamental concept towards cloud computing. It provides a preliminary study on Grid Computing and further briefs into a detailed study on Cloud Computing with various features like Security, Virtualization and environment setup. It provides procedural footsteps for setting up a Grid environment - Globus Toolkit, Cloud Environment – Open Nebula and Hadoop Environment in Ubuntu Linux for Grid and Cloud Computing Laboratory along with sample programs and guidelines.

SQL is a set based declarative programming language, keyword based language and not an imperative programming language like C or BASIC, for accessing as well as manipulating database systems. This research paper include the basic concept... more

SQL is a set based declarative programming language, keyword based language and not an imperative programming language like C or BASIC, for accessing as well as manipulating database systems. This research paper include the basic concept of SQL with its advantages, disadvantages as well as its architecture, and introduction to Apache Hive with its features, advantages, disadvantages and its architecture. Further this research paper also contains introduction to HiveQL as well as comparison of SQL with HiveQL.

Nowadays, companies are starting to realize the importance of data availability in large amounts in order to make the right decisions and support their strategies. With the development of new technologies, the Internet and social... more

Nowadays, companies are starting to realize the importance of data availability in large amounts in order to make the right decisions and support their strategies. With the development of new technologies, the Internet and social networks, the production of digital data is constantly growing. The term "Big Data" refers to the heterogeneous mass of digital data produced by companies and individuals whose characteristics (large volume, different forms, speed of processing) require specific and increasingly sophisticated computer storage and analysis tools. This article intends to define the concept of Big Data, its concepts, challenges and applications, as well as the importance of Big Data Analytics

Twitter, a micro-blogging service, has been generating a large amount of data every minute as it gives people chance to express their thoughts and feelings quickly and clearly about any topics. To obtain the desired information from these... more

Twitter, a micro-blogging service, has been generating a large amount of data every minute as it gives people chance to express their thoughts and feelings quickly and clearly about any topics. To obtain the desired information from these available big data, it requires high-performance parallel computing tools along with machine learning algorithms' support. Emerging big data processing frameworks (e.g. Hadoop) can handle such big data effectively. In this paper, we, firstly introduce a novel approach to automatically classify Twitter data obtained from British Geological Survey (BGS), collected using some specific keywords such as landslide, landslides, mudslide, landfall, land-slip, soil sliding, based on tweet post date and the countries where tweets are posted using MapReduce algorithm. We then propose a model to distinguish the tweets if they are landslides-related using Na¨ıveNa¨ıve-Bayes machine learning algorithm with n-Grams language model on Mahout. This paper also describes an algorithm for the pre-processing steps to make the semi-structured Twitter text data ready for classification. The proposed methods are useful for the BGS and other interested people to be able to see the name and number of the countries where the tweets are sent, the number of tweets sent from each country, the dates and time intervals of the tweets, and to classify the tweets whether they are related to landslides.

Dans le monde d’aujourd’hui de multiples acteurs de la technologie numérique produisent des quantités infinies de données. Capteurs, réseaux sociaux ou e-commerce, ils génèrent tous de l’information qui s’incrémente en temps-réel selon... more

Dans le monde d’aujourd’hui de multiples acteurs de la technologie numérique produisent des quantités infinies de données. Capteurs, réseaux sociaux ou e-commerce, ils génèrent tous de l’information qui s’incrémente en temps-réel selon les 3V de Gartner : en Volume, en Vitesse et en Variabilité.
L’optimisation du temps de traitement des opérations sur les bases de données ont toujours été un sujet très étudié aussi bien par les développeurs que par les chercheurs. L’apparition des systèmes distribué a pu guidé les études ainsi que les recherches vers des techniques de résolutions de plus en plus optimales, en divisant en ensemble de tâches destinées à être réparties sur différents noeuds ou cluster.
Avec l’émergence des applications intensives en manipulation de données, les approches d’évaluation des opérations réparties basées sur le modèle de program-mation MapReduce ont gagné en popularité. Néanmoins, ces approches n’évitent pas le problème de déséquilibrage des valeurs de l’attribut dans la première phase du modèle MapReduce.
Pour répondre à cette question et après une étude des techniques proposées qui traitent les opérations de base de données (jointure, sélection et projection) avec MapReduce, le présent travail propose une approche pour équilibrer les charges de traitements sur des différentes machines. Cette approche exploite le mécanisme de la deuxième version du traitement distribué MapReduce 2.

This book “Information Management” provides a exposure towards basics of information, database design and modelling, addresses the issues in information governance and integration. It provides a maiden study on core relational database... more

This book “Information Management” provides a exposure towards basics of information, database design and modelling, addresses the issues in information governance and integration. It provides a maiden study on core relational database design and modelling. It provides a pervasive over creating, maintaining and performance evaluation of Bigdata environments like master data management, data warehouse.

Data has always been one of the most valuable resources for organizations. With it we can extract information and, with enough information on a subject, we can build knowledge. However, it is first needed to store that data for later... more

Data has always been one of the most valuable resources for organizations. With it we can extract information and, with enough information on a subject, we can build knowledge. However, it is first needed to store that data for later processing. On the last decades we have been assisting what was called “information explosion”. With the advent of the new technologies, the volume, velocity and variety of data has increased exponentially, becoming what is known today as big data. Telecommunications operators gather, using network monitoring equipment, millions of network event records, the Call Detail Records (CDRs) and the Event Detail Records (EDRs), commonly known as xDRs. These records are stored and later processed to compute network performance and quality of service metrics. With the ever increasing number of telecommunications subscribers, the volume of generated xDRs needing to be stored and processed has increased exponentially, making the current solutions based on relational databases not suited any more and so, they are facing a big data problem. To handle that problem, many contributions have been made on the last years that have resulted in solid and innovative solutions. Among them, Hadoop and its vast ecosystem stands out. Hadoop integrates new methods of storing and process high volumes of data in a robust and cost-effective way, using commodity hardware.
This dissertation presents a platform that enables the current systems inserting data into relational databases, to keep doing it transparently when migrating those to Hadoop. The platform has to, like in the relational databases, give delivery guarantees, support unique constraints and, be fault tolerant.
As proof of concept, the developed platform was integrated with a system specifically designed to the computation of performance and quality of service metrics from xDRs, the Altaia. The performance tests have shown the platform fulfills and exceeds the requirements for the insertion rate of records. During the tests the behaviour of the platform when trying to insert duplicated records and when in failure scenarios have also been evaluated. The results for both situations were as expected.