Dr. C.V. Suresh Babu DESIGN OF HADOOP DISTRIBUTED FILE SYSTEM (CentreforKnowledgeTransfer) institute
DISCUSSION TOPICS  Hadoop Distributed File System (HDFS)  How does HDFS work?  HDFS Architecture  Features of HDFS  Benefits of using HDFS  Examples: Target Marketing  HDFS data replication (CentreforKnowledgeTransfer) institute
HADOOP DISTRIBUTED FILE SYSTEM (HDFS)  The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications.  HDFS employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.  Hadoop itself is an open source distributed processing framework that manages data processing and storage for big data applications.  HDFS is a key part of the many Hadoop ecosystem technologies.  It provides a reliable means for managing pools of big data and supporting related big data analytics applications. (CentreforKnowledgeTransfer) institute
HOW DOES HDFS WORK?  HDFS enables the rapid transfer of data between compute nodes.  It was closely coupled with MapReduce, a framework for data processing that filters and divides up work among the nodes in a cluster, and it organizes and condenses the results into a cohesive answer to a query.  Similarly, when HDFS takes in data, it breaks the information down into separate blocks and distributes them to different nodes in a cluster. (CentreforKnowledgeTransfer) institute
(CentreforKnowledgeTransfer) institute
FEATURES OF HDFS Data replication. This is used to ensure that the data is always available and prevents data loss. For example, when a node crashes or there is a hardware failure, replicated data can be pulled from elsewhere within a cluster, so processing continues while data is recovered. Fault tolerance and reliability. HDFS' ability to replicate file blocks and store them across nodes in a large cluster ensures fault tolerance and reliability. High availability. As mentioned earlier, because of replication across notes, data is available even if the NameNode or a DataNode fails. Scalability. Because HDFS stores data on various nodes in the cluster, as requirements increase, a cluster can scale to hundreds of nodes. High throughput. Because HDFS stores data in a distributed manner, the data can be processed in parallel on a cluster of nodes. This, plus data locality (see next bullet), cut the processing time and enable high throughput. Data locality. With HDFS, computation happens on the DataNodes where the data resides, rather than having the data move to where the computational unit is. By minimizing the distance between the data and the computing process, this approach decreases network congestion and boosts a system's overall throughput. (CentreforKnowledgeTransfer) institute
BENEFITS OF USING HDFS  Cost effectiveness. The DataNodes that store the data rely on inexpensive off-the-shelf hardware, which cuts storage costs. Also, because HDFS is open source, there's no licensing fee.  Large data set storage. HDFS stores a variety of data of any size -- from megabytes to petabytes -- and in any format, including structured and unstructured data.  Fast recovery from hardware failure. HDFS is designed to detect faults and automatically recover on its own.  Portability. HDFS is portable across all hardware platforms, and it is compatible with several operating systems, including Windows, Linux and Mac OS/X.  Streaming data access. HDFS is built for high data throughput, which is best for access to streaming data. (CentreforKnowledgeTransfer) institute
EXAMPLES: TARGET MARKETING  Targeted marketing campaigns depend on marketers knowing a lot about their target audiences.  Marketers can get this information from several sources, including CRM systems, direct mail responses, point-of-sale systems, Facebook and Twitter.  Because much of this data is unstructured, an HDFS cluster is the most cost- effective place to put data before analyzing it. (CentreforKnowledgeTransfer) institute
HDFS DATA REPLICATION  Data replication is an important part of the HDFS format as it ensures data remains available if there's a node or hardware failure.  As previously mentioned, the data is divided into blocks and replicated across numerous nodes in the cluster.  Therefore, when one node goes down, the user can access the data that was on that node from other machines.  HDFS maintains the replication process at regular intervals. (CentreforKnowledgeTransfer) institute

Design of Hadoop Distributed File System

  • 1.
    Dr. C.V. SureshBabu DESIGN OF HADOOP DISTRIBUTED FILE SYSTEM (CentreforKnowledgeTransfer) institute
  • 2.
    DISCUSSION TOPICS  HadoopDistributed File System (HDFS)  How does HDFS work?  HDFS Architecture  Features of HDFS  Benefits of using HDFS  Examples: Target Marketing  HDFS data replication (CentreforKnowledgeTransfer) institute
  • 3.
    HADOOP DISTRIBUTED FILESYSTEM (HDFS)  The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications.  HDFS employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.  Hadoop itself is an open source distributed processing framework that manages data processing and storage for big data applications.  HDFS is a key part of the many Hadoop ecosystem technologies.  It provides a reliable means for managing pools of big data and supporting related big data analytics applications. (CentreforKnowledgeTransfer) institute
  • 4.
    HOW DOES HDFSWORK?  HDFS enables the rapid transfer of data between compute nodes.  It was closely coupled with MapReduce, a framework for data processing that filters and divides up work among the nodes in a cluster, and it organizes and condenses the results into a cohesive answer to a query.  Similarly, when HDFS takes in data, it breaks the information down into separate blocks and distributes them to different nodes in a cluster. (CentreforKnowledgeTransfer) institute
  • 5.
  • 6.
    FEATURES OF HDFS Datareplication. This is used to ensure that the data is always available and prevents data loss. For example, when a node crashes or there is a hardware failure, replicated data can be pulled from elsewhere within a cluster, so processing continues while data is recovered. Fault tolerance and reliability. HDFS' ability to replicate file blocks and store them across nodes in a large cluster ensures fault tolerance and reliability. High availability. As mentioned earlier, because of replication across notes, data is available even if the NameNode or a DataNode fails. Scalability. Because HDFS stores data on various nodes in the cluster, as requirements increase, a cluster can scale to hundreds of nodes. High throughput. Because HDFS stores data in a distributed manner, the data can be processed in parallel on a cluster of nodes. This, plus data locality (see next bullet), cut the processing time and enable high throughput. Data locality. With HDFS, computation happens on the DataNodes where the data resides, rather than having the data move to where the computational unit is. By minimizing the distance between the data and the computing process, this approach decreases network congestion and boosts a system's overall throughput. (CentreforKnowledgeTransfer) institute
  • 7.
    BENEFITS OF USINGHDFS  Cost effectiveness. The DataNodes that store the data rely on inexpensive off-the-shelf hardware, which cuts storage costs. Also, because HDFS is open source, there's no licensing fee.  Large data set storage. HDFS stores a variety of data of any size -- from megabytes to petabytes -- and in any format, including structured and unstructured data.  Fast recovery from hardware failure. HDFS is designed to detect faults and automatically recover on its own.  Portability. HDFS is portable across all hardware platforms, and it is compatible with several operating systems, including Windows, Linux and Mac OS/X.  Streaming data access. HDFS is built for high data throughput, which is best for access to streaming data. (CentreforKnowledgeTransfer) institute
  • 8.
    EXAMPLES: TARGET MARKETING Targeted marketing campaigns depend on marketers knowing a lot about their target audiences.  Marketers can get this information from several sources, including CRM systems, direct mail responses, point-of-sale systems, Facebook and Twitter.  Because much of this data is unstructured, an HDFS cluster is the most cost- effective place to put data before analyzing it. (CentreforKnowledgeTransfer) institute
  • 9.
    HDFS DATA REPLICATION Data replication is an important part of the HDFS format as it ensures data remains available if there's a node or hardware failure.  As previously mentioned, the data is divided into blocks and replicated across numerous nodes in the cluster.  Therefore, when one node goes down, the user can access the data that was on that node from other machines.  HDFS maintains the replication process at regular intervals. (CentreforKnowledgeTransfer) institute