NoSQL Database: ApacheNoSQL Database: Apache CassandraCassandra www.folio3.com@folio_3
Folio3 – OverviewFolio3 – Overview www.folio3.com @folio_3
Who We Are  We are a Development Partner for our customers  Design software solutions, not just implement them  Focus on the solution – Platform and technology agnostic  Expertise in building applications that are: Mobile Social Cloud-based Gamified
What We Do  Areas of Focus  Enterprise  Custom enterprise applications  Product development targeting the enterprise  Mobile  Custom mobile apps for iOS, Android, Windows Phone, BB OS  Mobile platform (server-to-server) development  Social Media  CMS based websites for consumers and enterprise (corporate, consumer, community & social networking)  Social media platform development (enterprise & consumer)
Folio3 At a Glance  Founded in 2005  Over 200 full time employees  Offices in the US, Canada, Bulgaria & Pakistan  Palo Alto, CA.  Sofia, Bulgaria  Karachi, Pakistan Toronto, Canada
Areas of Focus: Enterprise  Automating workflows  Cloud based solutions  Application integration  Platform development  Healthcare  Mobile Enterprise  Digital Media  Supply Chain
Some of Our Enterprise Clients
Areas of Focus: Mobile  Serious enterprise applications for Banks, Businesses  Fun consumer apps for app discovery, interaction, exercise gamification and play  Educational apps  Augmented Reality apps  Mobile Platforms
Some of Our Mobile Clients
Areas of Focus: Web & Social Media  Community Sites based on Content Management Systems  Enterprise Social Networking  Social Games for Facebook & Mobile  Companion Apps for games
Some of Our Web Clients
NoSQL Database: ApacheNoSQL Database: Apache CassandraCassandra www.folio3.com @folio_3
Agenda  What is NOSQL?  Motivations for NOSQL?  Brewer’s CAP Theorem  Taxonomy of NOSQL databases  Apache Cassandra  Features  Data Model  Consistency  Operations  Cluster Membership  What Does NOSQL means for RDBMS?
What is NOSQL?  Refers to databases that differs from traditional relational database management system (RDBMS)  Distributed, flexible, horizontally scalable data stores  Confusion with the term NOSQL  NOSQL != No SQL (or Anti-SQL)  NOSQL = Not Only SQL  NOSQL is an inaccurate term since it is commonly used to refer to "non-relational" databases but the term has stuck
Motivations for NOSQL  Classical RDBMS unsuitable for today's web applications because:  Performance (Latency): Variable  Flexibility: Low  Scalability: Variable  Functionality
Brewer's CAP Theorm  Consistency (C)  Availability (A)  Partition Tolerance (P)  Pick any two  Most NOSQL databases sacrifice Consistency in favor of high Availability and Performance
Taxonomy of NOSQL  Key/Value Stores - Distributed Hash Tables (DHT)  Memcached, Amazon’s Dynamo, Redis, PStore  Document Stores  Semi structured data (stores entire documents)  CouchDB, MongoDB, RDDB, Riak  Graph Databases *  Based on graph theory  ActiveRDF, AllegroGraph, Neo4J  Object Database *  Versant, Objectivity  Column-oriented Stores  * these are considered soft NOSQL databases and are usually in NOSQL category because of being "non-relational".
Column-Oriented Data Stores  Semi-structured column-based data stores  Stores each column separately so that aggregate operations for one column of the entire table are significantly quicker than the traditional row storage model  Popular examples  Hadoop/HBASE  Apache Cassandra  Google's BigTable  HyperTable  Amazon's SimpleDB
Apache Cassandra  Fully distributed column oriented data store  Also provides Map Reduce implementation using Hadoop (increased performance)  Based on Google's BigTable (Data Model) and Amazon's Dynamo (Consistency & Partition Tolerance)  Cassandra values Availability and Partitioning tolerance (AP) while providing tunable consistency levels.
History  Developed at Facebook  Released as open source project on Google Code in July 2008  Became an Apache Incubator Project in March 2009  Became a top level Apache project in February 2010 Performance  Rumors of Facebook having started working on its own separate version of Cassandra
Features  Fully Distributed  Highly Scalable  Fault Tolerant (No single point of failure)  Tunable Consistency (Eventually Consistent)  Semi-structured key-value store  High Availability  No Referential Integrity  No Joins
Data Model  KeySpace (Uppermost namespace)  Column Family / Super Column Family (analogous to table)  Super Column  Column (Name, Value, Timestamp)  Rows are referenced through keys  Each column is stored in a separate physical file
Standard Column Family
Super Column Family
Super Column Family: Static/Static
Super Column Family: Static/Static
Super Column Family: Static/Dynamic
Super Column Family: Static/Dynamic
Super Column Family: Dynamic/Static
Super Column Family: Dynamic/Static
Super Column Family: Dynamic/Dynamic
Super Column Family: Dynamic/Dynamic
Apache Cassandra: Consistency  Consistency refers to whether a system is left in a consistent state after an operation. In distributed data systems like Cassandra, this usually means that once a writer has written, all readers will see that write.  If W + R > N, you will have strong consistent behavior; that is, readers will always see the most recent write  W is the number of nodes to block for on write  R is the number to block for on reads  N is the replication factor (number of replicas)
Apache Cassandra: Consistency  Relational databases provide strong consistency (ACID)  Cassandra provide eventual consistency (BASE) meaning the database will eventually reach a consistent state  QUORUM reads and writes gives consistency while still allowing availability  Q = (N / 2) + 1 (simple majority)  If latency is more important than consistency, you can lower values for either or both W and R.
Apache Cassandra: Consistency Levels  Write  ZERO  ANY  ONE  QUORUM  ALL  Read  ZERO  ANY  ONE  QUORUM  ALL
Write Operation  Client sends a write request to a random node; the random node forwards the request to the proper node (1st replica responsible for the partition - coordinator)  Coordinator sends requests to N replicas  If W replicas confirm the write operation then OK  Always writable, hinted handoff (If a replica node for the key is down, Cassandra will write a hint to the live replica node indicating that the write needs to be replayed to the unavailable node.)
Read Operation  Coordinator sends requests to N replicas, if R replicas respond then OK  If different versions are returned then reconcile and write back the reconciled version (Read Repair)
Cluster Membership  Gossip Protocol  Every T seconds each node increments its heartbeat counter and gossips to another node about the state of the cluster; the receiving node merges the cluster info with its own copy  Cluster state (node in/out, failure) propagated quickly: O(LogN) where N is the number of nodes in the cluster
Storage Ring  Cassandra cluster nodes are organized in a virtual ring.  Each node has a single unique token that defines its place in the ring and which keys it is responsible for  Key ranges are adjusted when the nodes join or leave
Apache Cassandra: MySQL Comparison  MySQL (> 50 GB data)  Read Average: ~ 350 ms  Write Average: ~ 300 ms  Cassandra (> 50 GB data)  Read Average: 15 ms  Write Average: 0.12 ms
Apache Cassandra: Client API  Low level API  Thrift  High Level API  Java  Hector, Pelops, Kundera  .NET  FluentCassandra, Aquiles  Python  Telephus, Pycassa  PHP  phpcassa, SimpleCassie
Apache Cassandra: Where to Use?  Use Cassandra, if you want/need  High write throughput  Near-Linear scalability  Automated replication/fault tolerance  Can tolerate low consistency  Can tolerate missing RDBMS features
Apache Cassandra: Users  Facebook (of course)  To power inbox search (previously)  Twitter  To handle user relationships, analytics (but not for tweets)  Digg & Reddit  Both use Cassandra to handle user comments and votes  Rackspace  IBM  To build scalable email system  Cisco's WebEx  To store user feed and activity in near real time
What does NOSQL mean for the future of RDBMS?  No worries! RDBMSs are here to stay for the foreseeable future  NOSQL data stores can be used in combination with RDBMS in some situations  NOSQL still has a long way to go, in order to reach the widespread (mainstream) use and support of the RDBMS
Weakness of NOSQL  No or limited support for complex queries  No transactions available (operations are atomic)  No standard interface for NOSQL databases (like SQL in relational databases)  No or limited administrative features available for NOSQL databases  Not suitable (yet) for mainstream use
Why Still Use RDBMS?  All the weaknesses of NOSQL  Relational databases are widely used and understood  RDBMS DBAs and developers are easily available in the market  For big business, relational databases are a safe choice because they have heavily invested in relational technology  Many database design and development tools available
References  http://www.allthingsdistributed.com/2008/12/eventually_consistent. html  http://wiki.apache.org/cassandra/FrontPage  http://en.wikipedia.org/wiki/Apache_Cassandra  http://www.slideshare.net/gdusbabek/cassandra-presentation-for- san-antonio-jug  http://www.slideshare.net/Eweaver/cassandra-presentation-at-nosql  http://nosql-database.org/  http://nosqlpedia.com/
Contact  For more details about our services, please get in touch with us. contact@folio3.com US Office: (408) 365-4638 www.folio3.com

NOSQL Database: Apache Cassandra

  • 1.
    NoSQL Database: ApacheNoSQLDatabase: Apache CassandraCassandra www.folio3.com@folio_3
  • 2.
    Folio3 – OverviewFolio3– Overview www.folio3.com @folio_3
  • 3.
    Who We Are We are a Development Partner for our customers  Design software solutions, not just implement them  Focus on the solution – Platform and technology agnostic  Expertise in building applications that are: Mobile Social Cloud-based Gamified
  • 4.
    What We Do Areas of Focus  Enterprise  Custom enterprise applications  Product development targeting the enterprise  Mobile  Custom mobile apps for iOS, Android, Windows Phone, BB OS  Mobile platform (server-to-server) development  Social Media  CMS based websites for consumers and enterprise (corporate, consumer, community & social networking)  Social media platform development (enterprise & consumer)
  • 5.
    Folio3 At aGlance  Founded in 2005  Over 200 full time employees  Offices in the US, Canada, Bulgaria & Pakistan  Palo Alto, CA.  Sofia, Bulgaria  Karachi, Pakistan Toronto, Canada
  • 6.
    Areas of Focus:Enterprise  Automating workflows  Cloud based solutions  Application integration  Platform development  Healthcare  Mobile Enterprise  Digital Media  Supply Chain
  • 7.
    Some of OurEnterprise Clients
  • 8.
    Areas of Focus:Mobile  Serious enterprise applications for Banks, Businesses  Fun consumer apps for app discovery, interaction, exercise gamification and play  Educational apps  Augmented Reality apps  Mobile Platforms
  • 9.
    Some of OurMobile Clients
  • 10.
    Areas of Focus:Web & Social Media  Community Sites based on Content Management Systems  Enterprise Social Networking  Social Games for Facebook & Mobile  Companion Apps for games
  • 11.
    Some of OurWeb Clients
  • 12.
    NoSQL Database: ApacheNoSQLDatabase: Apache CassandraCassandra www.folio3.com @folio_3
  • 13.
    Agenda  What isNOSQL?  Motivations for NOSQL?  Brewer’s CAP Theorem  Taxonomy of NOSQL databases  Apache Cassandra  Features  Data Model  Consistency  Operations  Cluster Membership  What Does NOSQL means for RDBMS?
  • 14.
    What is NOSQL? Refers to databases that differs from traditional relational database management system (RDBMS)  Distributed, flexible, horizontally scalable data stores  Confusion with the term NOSQL  NOSQL != No SQL (or Anti-SQL)  NOSQL = Not Only SQL  NOSQL is an inaccurate term since it is commonly used to refer to "non-relational" databases but the term has stuck
  • 15.
    Motivations for NOSQL Classical RDBMS unsuitable for today's web applications because:  Performance (Latency): Variable  Flexibility: Low  Scalability: Variable  Functionality
  • 16.
    Brewer's CAP Theorm Consistency (C)  Availability (A)  Partition Tolerance (P)  Pick any two  Most NOSQL databases sacrifice Consistency in favor of high Availability and Performance
  • 17.
    Taxonomy of NOSQL Key/Value Stores - Distributed Hash Tables (DHT)  Memcached, Amazon’s Dynamo, Redis, PStore  Document Stores  Semi structured data (stores entire documents)  CouchDB, MongoDB, RDDB, Riak  Graph Databases *  Based on graph theory  ActiveRDF, AllegroGraph, Neo4J  Object Database *  Versant, Objectivity  Column-oriented Stores  * these are considered soft NOSQL databases and are usually in NOSQL category because of being "non-relational".
  • 18.
    Column-Oriented Data Stores Semi-structured column-based data stores  Stores each column separately so that aggregate operations for one column of the entire table are significantly quicker than the traditional row storage model  Popular examples  Hadoop/HBASE  Apache Cassandra  Google's BigTable  HyperTable  Amazon's SimpleDB
  • 19.
    Apache Cassandra  Fullydistributed column oriented data store  Also provides Map Reduce implementation using Hadoop (increased performance)  Based on Google's BigTable (Data Model) and Amazon's Dynamo (Consistency & Partition Tolerance)  Cassandra values Availability and Partitioning tolerance (AP) while providing tunable consistency levels.
  • 20.
    History  Developed atFacebook  Released as open source project on Google Code in July 2008  Became an Apache Incubator Project in March 2009  Became a top level Apache project in February 2010 Performance  Rumors of Facebook having started working on its own separate version of Cassandra
  • 21.
    Features  Fully Distributed Highly Scalable  Fault Tolerant (No single point of failure)  Tunable Consistency (Eventually Consistent)  Semi-structured key-value store  High Availability  No Referential Integrity  No Joins
  • 22.
    Data Model  KeySpace(Uppermost namespace)  Column Family / Super Column Family (analogous to table)  Super Column  Column (Name, Value, Timestamp)  Rows are referenced through keys  Each column is stored in a separate physical file
  • 23.
  • 24.
  • 25.
    Super Column Family:Static/Static
  • 26.
    Super Column Family:Static/Static
  • 27.
    Super Column Family:Static/Dynamic
  • 28.
    Super Column Family:Static/Dynamic
  • 29.
    Super Column Family:Dynamic/Static
  • 30.
    Super Column Family:Dynamic/Static
  • 31.
    Super Column Family:Dynamic/Dynamic
  • 32.
    Super Column Family:Dynamic/Dynamic
  • 33.
    Apache Cassandra: Consistency Consistency refers to whether a system is left in a consistent state after an operation. In distributed data systems like Cassandra, this usually means that once a writer has written, all readers will see that write.  If W + R > N, you will have strong consistent behavior; that is, readers will always see the most recent write  W is the number of nodes to block for on write  R is the number to block for on reads  N is the replication factor (number of replicas)
  • 34.
    Apache Cassandra: Consistency Relational databases provide strong consistency (ACID)  Cassandra provide eventual consistency (BASE) meaning the database will eventually reach a consistent state  QUORUM reads and writes gives consistency while still allowing availability  Q = (N / 2) + 1 (simple majority)  If latency is more important than consistency, you can lower values for either or both W and R.
  • 35.
    Apache Cassandra: ConsistencyLevels  Write  ZERO  ANY  ONE  QUORUM  ALL  Read  ZERO  ANY  ONE  QUORUM  ALL
  • 36.
    Write Operation  Clientsends a write request to a random node; the random node forwards the request to the proper node (1st replica responsible for the partition - coordinator)  Coordinator sends requests to N replicas  If W replicas confirm the write operation then OK  Always writable, hinted handoff (If a replica node for the key is down, Cassandra will write a hint to the live replica node indicating that the write needs to be replayed to the unavailable node.)
  • 37.
    Read Operation  Coordinatorsends requests to N replicas, if R replicas respond then OK  If different versions are returned then reconcile and write back the reconciled version (Read Repair)
  • 38.
    Cluster Membership  GossipProtocol  Every T seconds each node increments its heartbeat counter and gossips to another node about the state of the cluster; the receiving node merges the cluster info with its own copy  Cluster state (node in/out, failure) propagated quickly: O(LogN) where N is the number of nodes in the cluster
  • 39.
    Storage Ring  Cassandracluster nodes are organized in a virtual ring.  Each node has a single unique token that defines its place in the ring and which keys it is responsible for  Key ranges are adjusted when the nodes join or leave
  • 40.
    Apache Cassandra: MySQLComparison  MySQL (> 50 GB data)  Read Average: ~ 350 ms  Write Average: ~ 300 ms  Cassandra (> 50 GB data)  Read Average: 15 ms  Write Average: 0.12 ms
  • 41.
    Apache Cassandra: ClientAPI  Low level API  Thrift  High Level API  Java  Hector, Pelops, Kundera  .NET  FluentCassandra, Aquiles  Python  Telephus, Pycassa  PHP  phpcassa, SimpleCassie
  • 42.
    Apache Cassandra: Whereto Use?  Use Cassandra, if you want/need  High write throughput  Near-Linear scalability  Automated replication/fault tolerance  Can tolerate low consistency  Can tolerate missing RDBMS features
  • 43.
    Apache Cassandra: Users Facebook (of course)  To power inbox search (previously)  Twitter  To handle user relationships, analytics (but not for tweets)  Digg & Reddit  Both use Cassandra to handle user comments and votes  Rackspace  IBM  To build scalable email system  Cisco's WebEx  To store user feed and activity in near real time
  • 44.
    What does NOSQLmean for the future of RDBMS?  No worries! RDBMSs are here to stay for the foreseeable future  NOSQL data stores can be used in combination with RDBMS in some situations  NOSQL still has a long way to go, in order to reach the widespread (mainstream) use and support of the RDBMS
  • 45.
    Weakness of NOSQL No or limited support for complex queries  No transactions available (operations are atomic)  No standard interface for NOSQL databases (like SQL in relational databases)  No or limited administrative features available for NOSQL databases  Not suitable (yet) for mainstream use
  • 46.
    Why Still UseRDBMS?  All the weaknesses of NOSQL  Relational databases are widely used and understood  RDBMS DBAs and developers are easily available in the market  For big business, relational databases are a safe choice because they have heavily invested in relational technology  Many database design and development tools available
  • 47.
    References  http://www.allthingsdistributed.com/2008/12/eventually_consistent. html  http://wiki.apache.org/cassandra/FrontPage http://en.wikipedia.org/wiki/Apache_Cassandra  http://www.slideshare.net/gdusbabek/cassandra-presentation-for- san-antonio-jug  http://www.slideshare.net/Eweaver/cassandra-presentation-at-nosql  http://nosql-database.org/  http://nosqlpedia.com/
  • 48.
    Contact  For moredetails about our services, please get in touch with us. contact@folio3.com US Office: (408) 365-4638 www.folio3.com