Posts

Sqoop vs Flume vs HDFS in Hadoop

Sqoop   Flume   HDFS Sqoop is used for importing data from structured data sources such as RDBMS. Flume is used for moving bulk streaming data into HDFS.  HDFS is a distributed file system used by Hadoop ecosystem to store data. Sqoop has a connector based architecture. Connectors know how to connect to the respective data source and fetch the data. Flume has an agent based architecture. Here, code is written (which is called as 'agent') which takes care of fetching data.  HDFS has a distributed architecture where data is distributed across multiple data nodes. HDFS is a destination for data import using Sqoop. Data flows to HDFS through zero or more channels.  HDFS is an ultimate destination for data storage. Sqoop data load is not event driven.  Flume data load can be driven by event.  HDFS just stores data provided to it by whatsoever means. In order to import data from structured data sources, one has to use Sqoop only, because its connector...

What is Sqoop? What is FLUME - Hadoop Tutorial

Image
Before we learn more about Flume and Sqoop , lets study Issues with Data Load into Hadoop Analytical processing using Hadoop requires loading of huge amounts of data from diverse sources into Hadoop clusters. This process of bulk data load into Hadoop, from heterogeneous sources and then processing it, comes with certain set of challenges. Maintaining and ensuring data consistency and ensuring efficient utilization of resources, are some factors to consider before selecting right approach for data load. Major Issues: 1. Data load using Scripts Traditional approach of using scripts to load data, is not suitable for bulk data load into Hadoop; this approach is inefficient and very time consuming. 2. Direct access to external data via Map-Reduce application Providing direct access to the data residing at external systems(without loading into Hadopp) for map reduce applications complicates these applications. So, this approach is not feasible. 3.In addition to having a...

What is MapReduce? How it Works - Hadoop MapReduce Tutorial

Image
MapReduce is a programming model suitable for processing of huge data. Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python, and C++. MapReduce programs are parallel in nature, thus are very useful for performing large-scale data analysis using multiple machines in the cluster. MapReduce programs work in two phases: Map phase Reduce phase. Input to each phase are  key-value  pairs. In addition, every programmer needs to specify two functions:  map function  and  reduce function . The whole process goes through three phase of execution namely, How MapReduce works Lets understand this with an example – Consider you have following input data for your MapReduce Program Welcome to Hadoop Class Hadoop is good Hadoop is bad The final output of the MapReduce task is bad  1 Class  1 good  1 Hadoop  3 is  2 to  1 Welcome  1 The data goes through fol...

HDFS Tutorial: Read & Write Commands using Java API

Image
Hadoop comes with a distributed file system called  HDFS  ( HADOOP Distributed File Systems ) HADOOP based applications make use of HDFS. HDFS is designed for storing very large data files, running on clusters of commodity hardware. It is fault tolerant, scalable, and extremely simple to expand. Do you know?   When data exceeds the capacity of storage on a single physical machine, it becomes essential to divide it across number of separate machines. File system that manages storage specific operations across a network of machines is called as  distributed file system . In this tutorial we will learn, Read Operation Write Operation Access HDFS using JAVA API Access HDFS Using COMMAND-LINE INTERFACE HDFS cluster primarily consists of a  NameNode  that manages the file system  Metadata  and a  DataNodes  that stores the  actual data . NameNode:  NameNode can be considered as a master of the system. It mainta...

Hadoop Setup Tutorial - Installation & Configuration

Image
Prerequisites: You must have  Ubuntu installed  and running You must have  Java Installed. Step 1)  Add a Hadoop system user using below command sudo addgroup hadoop_ sudo adduser --ingroup hadoop_ hduser_ Enter your password , name and other details. NOTE: There is a possibility of below mentioned error in this setup and installation process. "hduser is not in the sudoers file. This incident will be reported." This error can be resolved by Login as a root user Execute the command sudo adduser hduser_ sudo Re-login as hduser_ Step 2) .  Configure SSH In order to manage nodes in a cluster, Hadoop require SSH access First, switch user, enter following command su - hduser_ This command will create a new key. ssh-keygen -t rsa -P "" Enable SSH access to local machine using this key. cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys Now test SSH setup by connecting to locahost as...