HBase is a column-oriented, open-source NoSQL database management system derived from Google's Bigtable, designed to handle vast amounts of structured and semi-structured data. It features a horizontally scalable architecture, automatic failure support, and is optimized for fast querying and data processing, making it suitable for industries like telecommunications, medical, and e-commerce. HBase operates on top of the HDFS, utilizing a distributed architecture with regions managed by servers and coordinated by ZooKeeper.
What’s in itfor you? What is HBase? 1 HBase Use Case 2
4.
What’s in itfor you? What is HBase? Applications of HBase 1 HBase Use Case 2 3
5.
What’s in itfor you? What is HBase? Applications of HBase HBase vs RDBMS 1 HBase Use Case 2 3 4
6.
What’s in itfor you? What is HBase? Applications of HBase HBase Storage HBase vs RDBMS 1 HBase Use Case 2 3 45
7.
What’s in itfor you? What is HBase? Applications of HBase HBase Storage HBase Architectural Components HBase vs RDBMS 1 HBase Use Case 2 3 45 6
8.
What’s in itfor you? What is HBase? Applications of HBase HBase Storage HBase Architectural Components Demo on HBase HBase vs RDBMS 1 HBase Use Case 2 3 45 6 7
Introduction to HBase Thisdata could be easily stored in a Relational Database (RDMS) Structured data Back in the days, data used to be less and was mostly structured
11.
Introduction to HBase Then,Internet evolved and huge volumes of structured and semi- structured data got generated Storing and processing this data on RDBMS became a major problem Semi-structured data
12.
Introduction to HBase ApacheHBASE was the solution for this Semi-structured data SolutionThen, Internet evolved and huge volumes of structured and semi- structured data got generated
HBase History 1 2 3 Google releasedthe paper on BigTable HBase prototype was created as a Hadoop contribution First usable HBase along with Hadoop 0.15.0 was released Nov 2006 Feb 2007 Oct 2007
17.
HBase History 1 2 3 4 Google releasedthe paper on BigTable HBase prototype was created as a Hadoop contribution First usable HBase along with Hadoop 0.15.0 was released HBase became the subproject of Hadoop Nov 2006 Feb 2007 Oct 2007 Jan 2008
18.
HBase History 1 2 3 4 5 Google releasedthe paper on BigTable HBase prototype was created as a Hadoop contribution First usable HBase along with Hadoop 0.15.0 was released HBase became the subproject of Hadoop HBase 0.81.1, 0.19.0 and 0.20.0 was released between Oct 2008 – Sep 2009 Nov 2006 Feb 2007 Oct 2007 Jan 2008 Oct 2008 – Sep 2009
19.
HBase History 1 2 3 4 5 6 Google releasedthe paper on BigTable HBase prototype was created as a Hadoop contribution First usable HBase along with Hadoop 0.15.0 was released HBase became the subproject of Hadoop HBase 0.81.1, 0.19.0 and 0.20.0 was released between Oct 2008 – Sep 2009 HBase became Apache top-level project Nov 2006 Feb 2007 Oct 2007 Jan 2008 Oct 2008 – Sep 2009 May 2010
What is HBase? HBaseis a column oriented database management system derived from Google’s NoSQL database BigTable that runs on top of HDFS
22.
What is HBase? HBaseis a column oriented database management system derived from Google’s NoSQL database BigTable that runs on top of HDFS Open source project that is horizontally scalable1
23.
What is HBase? Opensource project that is horizontally scalable1 2 HBase is a column oriented database management system derived from Google’s NoSQL database BigTable that runs on top of HDFS NoSQL database written in JAVA which performs faster querying
24.
What is HBase? Opensource project that is horizontally scalable NoSQL database written in JAVA which performs faster querying Well suited for sparse data sets (can contain missing or NA values) 1 2 3 HBase is a column oriented database management system derived from Google’s NoSQL database BigTable that runs on top of HDFS
HBase Use Case Telecommunicationcompany that provides mobile voice and multimedia services across China Generated billions of Call Detail Records (CDR)
30.
HBase Use Case Telecommunicationcompany that provides mobile voice and multimedia services across China Traditional database systems were unable to scale up to the vast volumes of data and provide a cost-effective solution Generated billions of Call Detail Records (CDR)
31.
HBase Use Case Telecommunicationcompany that provides mobile voice and multimedia services across China Generated billions of Call Detail Records (CDR) Traditional database systems were unable to scale up to the vast volumes of data and provide a cost-effective solution
32.
HBase Use Case Telecommunicationcompany that provides mobile voice and multimedia services across China Storing and real-time analysis of billions of call records was a major problem Generated billions of Call Detail Records (CDR)
33.
HBase Use Case Telecommunicationcompany that provides mobile voice and multimedia services across China HBase stores billions of rows of detailed call records Solution Generated billions of Call Detail Records (CDR)
34.
HBase Use Case Telecommunicationcompany that provides mobile voice and multimedia services across China HBase performs fast processing of records using SQL queries Generates billions of Call Detail Records (CDR)
Applications of HBase MedicalE-Commerce HBase is used for storing genome sequences Storing disease history of people or an area HBase is used for storing logs about customer search history Performs analytics and target advertisement for better business insights
38.
Applications of HBase MedicalE-Commerce Sports HBase is used for storing genome sequences Storing disease history of people or an area HBase is used for storing logs about customer search history Performs analytics and target advertisement for better business insights HBase stores match details and history of each match Uses this data for better prediction
HBase vs RDBMS Doesnot have a fixed schema (schema-less). Defines only column families Has a fixed schema which describes the structure of the tables HBase RDBMS
41.
HBase vs RDBMS Doesnot have a fixed schema (schema-less). Defines only column families Has a fixed schema which describes the structure of the tables Works well with structured and semi-structured data Works well with structured data HBase RDBMS
42.
HBase vs RDBMS Doesnot have a fixed schema (schema-less). Defines only column families Has a fixed schema which describes the structure of the tables Works well with structured and semi-structured data Works well with structured data RDBMS can store only normalized data HBase RDBMS It can have de-normalized data (can contain missing or NA values)
43.
HBase vs RDBMS Doesnot have a fixed schema (schema-less). Defines only column families Has a fixed schema which describes the structure of the tables Works well with structured and semi-structured data Works well with structured data It can have de-normalized data (can contain missing or NA values) RDBMS can store only normalized data Built for wide tables that can be scaled horizontally Built for thin tables that is hard to scale HBase RDBMS
Features of HBase Scalable Datacan be scaled across various nodes as it is stored in HDFS Automatic failure support Write Ahead Log across clusters which provides automatic support against failure
47.
Features of HBase Scalable Datacan be scaled across various nodes as it is stored in HDFS Consistent read and write HBase provides consistent read and write of data Automatic failure support Write Ahead Log across clusters which provides automatic support against failure
48.
Features of HBase Scalable Datacan be scaled across various nodes as it is stored in HDFS Consistent read and write HBase provides consistent read and write of data JAVA API for client access Provides easy to use JAVA API for clients Automatic failure support Write Ahead Log across clusters which provides automatic support against failure
49.
Features of HBase Scalable Datacan be scaled across various nodes as it is stored in HDFS Consistent read and write HBase provides consistent read and write of data JAVA API for client access Provides easy to use JAVA API for clients Automatic failure support Write Ahead Log across clusters which provides automatic support against failure Block cache and bloom filters Supports block cache and bloom filters for high volume query optimization
HBase column orientedstorage Column Family 1 Column Family 2 Column Family 3 Rowid Col 1 Col 2 Row 1 Row 2 Row 3 Col 3 Col 3Col 1 Col 2 Col 3Col 1 Col 2 Row Key Column Family Column Qualifiers Cells
52.
HBase column orientedstorage Personal data Professional dataRowid name 1 2 3 Row Key Column Family Column Qualifiers Cells city age salaryempid Angela Dwayne David Chicago Boston Seattle 31 35 29 Data Analyst Web Developer Big Data Architect $70,000 $65,000 $55,000 designation
HBase Architectural Components RegionServer HLog MemStore StoreFile StoreFile HFile HFile StoreRegion Region Server HLog MemStore StoreFile StoreFile HFile HFile StoreRegion Region Server HLog MemStore StoreFile StoreFile HFile HFile StoreRegion HDFS HMaster HBase Master assigns regions and load balancing ZooKeeper is used for monitoring Region server serves data for read and write
55.
HBase Architectural Components- Regions Key col col xxx val val xxx val val Key col col xxx val val xxx val val Key col col xxx val Val xxx val Val Key col col xxx val val xxx val val Region 1 Region 2 Region 3 Region 4 ……... ….…. startKey endKey endKey Client HBase tables are divided horizontally by row key range into “Regions” A region contains all rows in the table between the region’s start key and end key Regions are assigned to the nodes in the cluster, called “Region Servers” These servers serve data for read and write startKey get Region Server 1 Region Server 2
56.
HBase Architectural Components- HMaster Key col col xxx val val xxx val val Key col col xxx val val xxx val val Key col col xxx val Val xxx val Val Key col col xxx val val xxx val val Region 1 Region 2 Region 3 Region 4 ……... ….…. ClientRegion assignment, Data Definition Language operation (create, delete) are handled by HMaster Assigning and re-assigning regions for recovery or load balancing and monitoring all servers Region Server 1 Region Server 2 HMaster create, delete, update table
57.
HBase Architectural Components- HMaster Key col col xxx val val xxx val val Key col col xxx val val xxx val val Key col col xxx val Val xxx val Val Key col col xxx val val xxx val val Region 1 Region 2 Region 3 Region 4 ……... ….…. ClientRegion assignment, Data Definition Language operation (create, delete) are handled by HMaster Assigning and re-assigning regions for recovery or load balancing and monitoring all servers Region Server 1 Region Server 2 HMaster create, delete, update table Monitors region servers
58.
HBase Architectural Components- HMaster Key col col xxx val val xxx val val Key col col xxx val val xxx val val Key col col xxx val Val xxx val Val Key col col xxx val val xxx val val Region 1 Region 2 Region 3 Region 4 ……... ….…. ClientRegion assignment, Data Definition Language operation (create, delete) are handled by HMaster Assigning and re-assigning regions for recovery or load balancing and monitoring all servers Region Server 1 Region Server 2 HMaster create, delete, update table Monitors region servers Assigns regions to region servers HBase has a distributed environment where HMaster alone is not sufficient to manage everything. Hence, ZooKeeper was introduced Assigns regions to region servers
59.
Inactive HMaster HBase Architectural Components- ZooKeeper Key col col xxx val val xxx val val Key col col xxx val val xxx val val Key col col xxx val Val xxx val Val Key col col xxx val val xxx val val Region 1 Region 2 Region 3 Region 4 ……... ….…. ZooKeeper is a distributed coordination service to maintain server state in the cluster Zookeeper maintains which servers are alive and available, and provides server failure notification Region Server 1 Region Server 2 Active HMaster ZooKeeper Active HMaster sends a heartbeat signal to ZooKeeper indicating that its active
60.
Inactive HMaster HBase Architectural Components- ZooKeeper Key col col xxx val val xxx val val Key col col xxx val val xxx val val Key col col xxx val Val xxx val Val Key col col xxx val val xxx val val Region 1 Region 2 Region 3 Region 4 ……... ….…. ZooKeeper is a distributed coordination service to maintain server state in the cluster Zookeeper maintains which servers are alive and available, and provides server failure notification Region Server 1 Region Server 2 Active HMaster heartbeat Region servers send their status to ZooKeeper indicating they are ready for read and write operation ZooKeeper
61.
Inactive HMaster HBase Architectural Components- ZooKeeper Key col col xxx val val xxx val val Key col col xxx val val xxx val val Key col col xxx val Val xxx val Val Key col col xxx val val xxx val val Region 1 Region 2 Region 3 Region 4 ……... ….…. ZooKeeper is a distributed coordination service to maintain server state in the cluster Zookeeper maintains which servers are alive and available, and provides server failure notification Region Server 1 Region Server 2 Active HMaster heartbeat Inactive server acts as a backup. If the active HMaster fails, it will come to rescue ZooKeeper
62.
How the componentswork together? Key col col xxx val val xxx val val Key col col xxx val val xxx val val Key col col xxx val Val xxx val Val Key col col xxx val val xxx val val Region 1 Region 2 Region 3 Region 4 ……... ….…. Region Server 1 Region Server 2 HMaster ZooKeeper 1 master is active • Active HMaster selection • Region Server session Active HMaster and Region Servers connect with a session to ZooKeeper
63.
How the componentswork together? Key col col xxx val val xxx val val Key col col xxx val val xxx val val Key col col xxx val Val xxx val Val Key col col xxx val val xxx val val Region 1 Region 2 Region 3 Region 4 ……... ….…. Region Server 1 Region Server 2 HMaster heartbeat 1 master is active • Active HMaster selection • Region Server session Active HMaster and Region Servers connect with a session to ZooKeeper ZooKeeper
64.
How the componentswork together? Key col col xxx val val xxx val val Key col col xxx val val xxx val val Key col col xxx val Val xxx val Val Key col col xxx val val xxx val val Region 1 Region 2 Region 3 Region 4 ……... ….…. Region Server 1 Region Server 2 HMaster heartbeat 1 master is active Ephemeral node Ephemeral node • Active HMaster selection • Region Server session ZooKeeper maintains ephemeral nodes for active sessions via heartbeats to indicate that region servers are up and running ZooKeeper
HBase Read orWrite ZooKeeper .META location is stored in ZooKeeper There is a special HBase Catalog table called the META table, which holds the location of the regions in the cluster Here is what happens the first time a client reads or writes data to HBase Client Region Server Region Server DataNode DataNode The client gets the Region Server that hosts the META table from ZooKeeper Request for Region Server
67.
HBase Read orWrite ZooKeeper .META location is stored in ZooKeeper There is a special HBase Catalog table called the META table, which holds the location of the regions in the cluster Here is what happens the first time a client reads or writes data to HBase Client Region Server Region Server DataNode DataNode Meta table location The client gets the Region Server that hosts the META table from ZooKeeper Request for Region Server
68.
HBase Read orWrite ZooKeeper There is a special HBase Catalog table called the META table, which holds the location of the regions in the cluster Here is what happens the first time a client reads or writes data to HBase Client Meta Cache Region Server Region Server DataNode DataNode The client will query the .META server to get the region server corresponding to the row key it wants to access The client caches this information along with the META table location Meta table location Request for Region Server Get region server for row key from meta table .META location is stored in ZooKeeper
69.
HBase Read orWrite ZooKeeper There is a special HBase Catalog table called the META table, which holds the location of the regions in the cluster Here is what happens the first time a client reads or writes data to HBase Client Region Server Region Server DataNode DataNode Put row Meta Cache It will get the Row from the corresponding Region Server Get region server for row key from meta table .META location is stored in ZooKeeper Meta table location Request for Region Server Get row
HBase Meta Table MetaTable Row key value table, key, region region server Key col col xxx val val xxx val val Key col col xxx val val xxx val val Region 1 Region 2 Region Server Key col col xxx val val xxx val val Key col col xxx val val xxx val val Region 3 Region 4 Region Server Special HBase catalog table that maintains a list of all the Region Servers in the HBase storage system META table is used to find the Region for a given Table key
HBase Write Mechanism WAL RegionServer Region MemStore MemStore HFile HFile HDFS DataNodeClient 1 When client issues a put request, it will write the data to the write-ahead log (WAL)1 Write Ahead Log (WAL) is a file used to store new data that is yet to be put on permanent storage. It is used for recovery is the case of failure.
74.
HBase Write Mechanism WAL RegionServer Region MemStore MemStore HFile HFile HDFS DataNodeClient 1 2 Once data is written to the WAL, it is then copied to the MemStore2 MemStore is the write cache that stores new data that has not yet been written to disk. There is one MemStore per column family per region.
75.
HBase Write Mechanism WAL RegionServer Region MemStore MemStore HFile HFile HDFS DataNodeClient 1 3 ACK 2 Once the data is placed in MemStore, the client then receives the acknowledgment3
76.
HBase Write Mechanism WAL RegionServer Region MemStore MemStore HFile HFile HDFS DataNodeClient 1 3 ACK 2 4 4 When the MemStore reaches the threshold, it dumps or commits the data into a HFile4 Hfiles store the rows of data as sorted KeyValue on disk