XML Parsing with Map Reduce

www.edureka.co/big-data-and-hadoop Hadoop the ultimate data storage And processing Together

www.edureka.co/big-data-and-hadoop Objectives Analyze different use-cases where MapReduce is used Differentiate between Traditional way and MapReduce way Learn about Hadoop 2.x MapReduce architecture and components Understand execution flow of YARN MapReduce application Implement basic MapReduce concepts Run a MapReduce Program At the end of this module, you will be able to

www.edureka.co/big-data-and-hadoop Where MapReduce is Used? Weather Forecasting HealthCare  Problem Statement: » De-identify personal health information.  Problem Statement: » Finding Maximum temperature recorded in a year.

www.edureka.co/big-data-and-hadoop Where MapReduce is Used? MapReduce FeaturesLarge Scale Distributed Model Used in Function Design Pattern Parallel Programming A Program Model Classification Analytics Recommendation Index and Search Map Reduce Classification Eg: Top N records Analytics Eg: Join, Selection Recommendation Eg: Sort Summarization Eg: Inverted Index Implemented Google Apache Hadoop HDFS Pig Hive HBase For

www.edureka.co/big-data-and-hadoop The Traditional Way Very Big Data Split Data matches All matches grep grep grep cat grep : matches matches matches Split Data Split Data Split Data

www.edureka.co/big-data-and-hadoop MapReduce Way Very Big Data Split Data All matches : Split Data Split Data Split Data M A P R E D U C E MapReduce Framework

www.edureka.co/big-data-and-hadoop MapReduce Paradigm The Overall MapReduce Word Count Process Input Splitting Mapping Shuffling Reducing Final Result List(K3,V3) Deer Bear River Dear Bear River Car Car River Deer Car Bear Bear, 2 Car, 3 Deer, 2 River, 2 Deer, 1 Bear, 1 River, 1 Car, 1 Car, 1 River, 1 Deer, 1 Car, 1 Bear, 1 K2,List(V2)List(K2,V2) K1,V1 Car Car River Deer Car Bear Bear, 2 Car, 3 Deer, 2 River, 2 Bear, (1,1) Car, (1,1,1) Deer, (1,1) River, (1,1)

www.edureka.co/big-data-and-hadoop Anatomy of a MapReduce Program MapReduce Map: Reduce: (K1, V1) List (K2, V2) (K2, list (V2)) List (K3, V3) Key Value

www.edureka.co/big-data-and-hadoop Why MapReduce? Two biggest Advantages: » Taking processing to the data » Processing data in parallel a b c Map Task HDFS Block Data Center Rack Node

www.edureka.co/big-data-and-hadoop  ApplicationMaster » One per application » Short life » Coordinates and Manages MapReduce Jobs » Negotiates with Resource Manager to schedule tasks » The tasks are started by NodeManager(s)  Job History Server » Maintains information about submitted MapReduce jobs after their ApplicationMaster terminates  Client » Submits a MapReduce Job  Resource Manager » Cluster Level resource manager » Long Life, High Quality Hardware  Node Manager » One per Data Node » Monitors resources on Data Node Hadoop 2.x MapReduce Components  Container » Created by NM when requested » Allocates certain amount of resources (memory, CPU etc.) on a slave node

www.edureka.co/big-data-and-hadoop BATCH (MapReduce) INTERACTIVE (Text) ONLINE (HBase) STREAMING (Storm, S4, …) GRAPH (Giraph) IN-MEMORY (Spark) HPC MPI (OpenMPI) OTHER (Search) (Weave..) http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html YARN – Moving beyond MapReduce

www.edureka.co/big-data-and-hadoop MapReduce Application Execution Executing MapReduce Application on YARN

www.edureka.co/big-data-and-hadoop YARN MR Application Execution Flow MapReduce Job Execution » Job Submission » Job Initialization » Tasks Assignment » Memory Assignment » Status Updates » Failure Recovery

www.edureka.co/big-data-and-hadoop HDFS Application Job Object Client JVM Client Resource Manager Management Node Run Job 2. Get New Application ID 4. Submit Application Context 3. Prepare the Application submit context 3.1 App Jar 3.2 Job Resources(Block locations) 3.3 User Information 1. Notify Start Application YARN MR Application Execution Flow

www.edureka.co/big-data-and-hadoop HDFS 3. Prepare the Application submit context 3.1 App Jar 3.2 Job Resources(Block locations) 3.3 User Information Node Manager 5. Start AppMaster container / Allocate Context for AppMaster App Master 6.Alloate Container for AppMaster 7.Request Resources 8.Notify with resources Availability Data Node YARN MR Application Execution Flow Application Job Object Client JVM Client Resource Manager Management Node Run Job 2. Get New Application ID 4. Submit Application Context 1. Notify Start Application

www.edureka.co/big-data-and-hadoop HDFS Resource Manager 3. Prepare the Application submit context 3.1 App Jar 3.2 Job Resources(Block locations) 3.3 User Information Management Node Node Manager 5. Start AppMaster container / Allocate Context for AppMaster App Master 6. Allocate Container for AppMaster 7.Request Resources 8.Notify with resources Availability Data Node Client Node Manager Data node-1 Node Manager Map Block 9.Start Container in the worker node Data node-2 Node Manager Map Block 10.NM allocate Container 10.NM allocate Container 2. Get New Application 4. Submit Application 1. Notify Start Application 9.Start Container in the worker node YARN MR Application Execution Flow

www.edureka.co/big-data-and-hadoop YARN MR Application Execution Flow 11.Task get Executed. 12.If any reducer in a Job Reducer, again AppMaster Request the Node Manager to start the and Allocate Container 13.Output of All the Maps given to reducer and Reducer get executed 14.Once Job finished, Application Master notify the Resource Manager and Client Library 15.Application Master closed.

www.edureka.co/big-data-and-hadoop Hadoop 2.x : YARN Workflow Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Container 1.2 Container 1.1 Container 2.1 Container 2.2 Container 2.3 App Master 2 App Master 1 Scheduler Applications Manager (AsM) Resource Manager

www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application Client RM NM AM 1

www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM Client RM NM AM 1 2

www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM Client RM NM AM 1 2 3

www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM Client RM NM AM 1 2 3 4

www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers Client RM NM AM 1 2 3 4 5

www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers 6. Application code is executed in container Client RM NM AM 1 2 3 4 5 6

www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers 6. Application code is executed in container 7. Client contacts RM/AM to monitor application’s status Client RM NM AM 1 2 3 4 5 7 6

www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers 6. Application code is executed in container 7. Client contacts RM/AM to monitor application’s status 8. AM unregisters with RM Client RM NM AM 1 2 3 4 5 7 8 6

www.edureka.co/big-data-and-hadoop Input Splits INPUT DATA Physical Division Logical Division HDFS Blocks Input Splits

www.edureka.co/big-data-and-hadoop Relation Between Input Splits and HDFS Blocks 1 2 3 4 5 6 7 8 9 10 11  Logical records do not fit neatly into the HDFS blocks.  Logical records are lines that cross the boundary of the blocks.  First split contains line 5 although it spans across blocks. File Lines Block Boundary Block Boundary Block Boundary Block Boundary Split Split Split

www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Node 1 Node 2 INPUT DATA

www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Map Node 1 Map Node 2 INPUT DATA

www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Mapper outputs intermediate data Map Node 1 Map Node 2 INPUT DATA

www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Mapper outputs intermediate data Data exchange between nodes in a “shuffle” process Map Node 1 Map Node 2 Node 1 Node 2 INPUT DATA

www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Mapper outputs intermediate data Data exchange between nodes in a “shuffle” process Intermediate data of the same key goes to the same reducer Map Node 1 Map Node 2 Reduce Node 1 Reduce Node 2 INPUT DATA

www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Mapper outputs intermediate data Data exchange between nodes in a “shuffle” process Intermediate data of the same key goes to the same reducer Reducer output is stored Map Node 1 Map Node 2 Reduce Node 1 Reduce Node 2 INPUT DATA

www.edureka.co/big-data-and-hadoop Combiner Combiner Reducer (B,1) (C,1) (D,1) (E,1) (D,1) (B,1) (D,1) (A,1) (A,1) (C,1) (B,1) (D,1) (B,2) (C,1) (D,2) (E,1) (D,2) (A,2) (C,1) (B,1) (A, [2]) (B, [2,1]) (C, [1,1]) (D, [2,2]) (E, [1]) (A,2) (B,3) (C,2) (D,4) (E,1) Shuffle CombinerMapper Mapper B C D E D B D A A C B D Block1Block2

www.edureka.co/big-data-and-hadoop Partitioner – Redirecting Output from Mapper Map Map Map Reducer Reducer Reducer Partitioner Partitioner Partitioner

www.edureka.co/big-data-and-hadoop Getting Data to the Mapper Input File Input File Input split Input split Input split Input split RecordReader RecordReader RecordReader RecordReader Mapper Mapper Mapper Mapper (intermediates) (intermediates) (intermediates) (intermediates)

www.edureka.co/big-data-and-hadoop Partition and Shuffle Mapper Mapper Mapper Mapper (intermediates) (intermediates) (intermediates) (intermediates) Partitioner Partitioner Partitioner Partitioner (intermediates) (intermediates) (intermediates) Reducer Reducer Reducer

www.edureka.co/big-data-and-hadoop Demo of Word Count Program To illustrate Default Input Format (Text Input Format) Demo

www.edureka.co/big-data-and-hadoop Input file Input Split Input Split Input Split Record Reader Record Reader Record Reader Mapper Mapper Mapper (Intermediates) (Intermediates) (Intermediates) InputFormat Input Split Record Reader Mapper Input file (Intermediates) Input Format

www.edureka.co/big-data-and-hadoop Combine File Input Format<K,V> Text Input Format Key Value Text Input Format Nline Input Format Sequence File Input Format<K,V> File Input Format <K,V> Input Format<K,V> org.apache.hadoop.mapreduce <<interface>> Composable Input Format <K,V> Composite Input Format <K,V> DB Input Format<T> Sequence File As Binary Input Format Sequence File As Text Input Format Sequence File Input Filter<K,V> Input Format – Class Hierarchy

www.edureka.co/big-data-and-hadoop Reducer RecordWriter Output file Reducer RecordWriter Output file Reducer RecordWriter Output file OutputFormat Output Format

www.edureka.co/big-data-and-hadoop Text Output Format <K,V> Sequence File Output Format<K,V> Output Format <K,V> org.apache.hadoop.mapreduce DB Output Format <K,V> File Output Format <K,V> Null Output Format <K,V> Filter Output Format <K,V> Sequence File As Binary Output Format Lazy Output Format <K,V> Output Format – Class Hierarchy

www.edureka.co/big-data-and-hadoop Demo Demo: Custom Input Format

XML Parsing with Map Reduce

More Related Content

What's hot

Viewers also liked

Similar to XML Parsing with Map Reduce

More from Edureka!

Recently uploaded

XML Parsing with Map Reduce