Overview of NoSQL ...motivation, technologies, should you care?
Overview ● Evolution of/motivation for NoSQL databases ● Characterization of NoSQL databases ● Classification of NoSQL databases ● Popularity/usage of NoSQL systems
A brief history of NoSQL ● Originally coined in 1998 by Strozzi for specific non-rel database ○ easy to use, free, text based data storage, easy manipulation of contents of db ● Reintroduced by Evans (Rackspace) in 2009 for conf on open source distributed databases ○ in response to increase in interest in non RDBMS solutions ■ bringing together Cassandra, Mongo, Couch, etc ● Has grown as a movement over last 3 years
Current status ● Significant buzz within community in 2010 ○ initial development of technology ○ pioneer deployments ○ lots of meetups/conferences/birds of feathers ● Many key technologies evolved later 2010, 2011 ○ more large deployments for some technologies ○ small companies with no legacy basing operations on NoSQL
Current Status ● 2012 ○ buzz/hype is fading ○ technology continues to mature ○ increased number of deployments ○ skills sought in job market
NoSQL - a negative definition ● NoSQL simply defined by being non- relational ○ diverse set of technologies fall into NoSQL camp ● Motivations mixed ○ open source ○ scale - TB, PB - particulary for read/write latency ○ increased flexibility over RDBMS systems ○ ability to work with raw data ○ ACID not always most appropriate design choice ■ analytics data is excellent example ● Results in many different NoSQL technologies
Typical characteristics ● Don't use SQL! ● Open Source ● Intended to deliver performance ○ in some dimension ● Typically JOIN not supported ○ performance hit ● Consistency often relaxed ○ eventual consistency ● More flexibility in schema ○ if schema used at all!
Diversity of NoSQL databases ● 122 seperate technologies listed on http: //nosql-database.org/ ○ mix of commercial, open source and some inbetween ● Vary in many dimensions: ○ architecture ○ interfaces ■ api/languages ○ internal data storage ○ distribution mechanisms ■ redundancy, reliability ○ usage - deployments & support community ○ maturity
Classification of NoSQL systems ● Column based solutions ● Document store solutions ● Key/Value solutions ● Graph based solutions ● Less significantly: ○ XML databases ○ Object databases ○ Mulitvalue databases
Column based solutions ● Structured data ○ similar to classical tables ● Generally much more flexible ○ no rigorous schema necessary ○ can typically add columns in ad hoc fashion ■ often without explicitly declaring column ● However, can result in very different usage ○ eg can have millions of columns associated with given row ● Examples: Hadoop/HBase, Cassandra, Hypertable, SimpleDB
Document based solutions ● Less structured data ○ DB composed of 'documents' containing arbitrary data ■ usually containing longer form content eg CMS ● Documents contain some structure to support query/search/filter, etc ● Somewhat less emphasis on a key ○ can be autogenerated ● Quite unlike classical databases ● Examples: MongoDB, CouchDB
Key/value stores ● DBs inspired by memcache ○ simple, fast key/value stores ● Attempt to retain most of DB in memory ○ fast response times ● Different designs for scalability ○ single node/multi node ● Much emphasis on the keys in this type of DB ● Write usually overwrites entire previous entry ● Examples: Redis, Couchbase/Membase, DynamoDB, Riak
Graph based solutions ● Obviously different from previous categories ○ Focus specifically on graphs ● Queries supported are graph-specific ○ eg get nodes related to specified node ● Typically support for solving standard graph problems ○ eg shortest path, general graph traversal ● Can deliver very significant performance over non-graph specific solutions ○ for graph problems! ● Examples: Neo4j
It's a noisy space... ● Very many candidate technologies ● Relatively small amount of real world solutions ● Differences between classifications above is one of emphasis... ○ column based and document based arrive at semi- structured sweet spot from opposite ends of spectrum ● ...although this results in different preferred use cases... ○ document based solution better for document problems, eg CMS
Common techniques used ● Hashing techniques used to map data to nodes in cluster ● Internode communication via Gossip ● Common replication techniques ● Thrift is used in a few cases ● MapReduce often used to search over distributed system
Comparison (oldish)...
Comparison (oldish)
Comparison (oldish)
Horses for courses... ● SQL is perfectly good solution for many problems ○ tried and tested ● Some problems require alternative solution ○ typically driven by scale and/or flexibility ● NoSQL offers (many) alternatives ○ although relatively easy to identify realistic options ● Column based approaches good for mostly structured data with enhanced flexibility ● Document based approaches good for document oriented problems
...so let's dive into one NoSQL database... ● Cassandra...

Overview of no sql

  • 1.
  • 2.
    Overview ● Evolution of/motivationfor NoSQL databases ● Characterization of NoSQL databases ● Classification of NoSQL databases ● Popularity/usage of NoSQL systems
  • 3.
    A brief historyof NoSQL ● Originally coined in 1998 by Strozzi for specific non-rel database ○ easy to use, free, text based data storage, easy manipulation of contents of db ● Reintroduced by Evans (Rackspace) in 2009 for conf on open source distributed databases ○ in response to increase in interest in non RDBMS solutions ■ bringing together Cassandra, Mongo, Couch, etc ● Has grown as a movement over last 3 years
  • 4.
    Current status ● Significantbuzz within community in 2010 ○ initial development of technology ○ pioneer deployments ○ lots of meetups/conferences/birds of feathers ● Many key technologies evolved later 2010, 2011 ○ more large deployments for some technologies ○ small companies with no legacy basing operations on NoSQL
  • 5.
    Current Status ● 2012 ○ buzz/hype is fading ○ technology continues to mature ○ increased number of deployments ○ skills sought in job market
  • 6.
    NoSQL - anegative definition ● NoSQL simply defined by being non- relational ○ diverse set of technologies fall into NoSQL camp ● Motivations mixed ○ open source ○ scale - TB, PB - particulary for read/write latency ○ increased flexibility over RDBMS systems ○ ability to work with raw data ○ ACID not always most appropriate design choice ■ analytics data is excellent example ● Results in many different NoSQL technologies
  • 7.
    Typical characteristics ● Don'tuse SQL! ● Open Source ● Intended to deliver performance ○ in some dimension ● Typically JOIN not supported ○ performance hit ● Consistency often relaxed ○ eventual consistency ● More flexibility in schema ○ if schema used at all!
  • 8.
    Diversity of NoSQL databases ●122 seperate technologies listed on http: //nosql-database.org/ ○ mix of commercial, open source and some inbetween ● Vary in many dimensions: ○ architecture ○ interfaces ■ api/languages ○ internal data storage ○ distribution mechanisms ■ redundancy, reliability ○ usage - deployments & support community ○ maturity
  • 9.
    Classification of NoSQL systems ● Column based solutions ● Document store solutions ● Key/Value solutions ● Graph based solutions ● Less significantly: ○ XML databases ○ Object databases ○ Mulitvalue databases
  • 10.
    Column based solutions ●Structured data ○ similar to classical tables ● Generally much more flexible ○ no rigorous schema necessary ○ can typically add columns in ad hoc fashion ■ often without explicitly declaring column ● However, can result in very different usage ○ eg can have millions of columns associated with given row ● Examples: Hadoop/HBase, Cassandra, Hypertable, SimpleDB
  • 11.
    Document based solutions ●Less structured data ○ DB composed of 'documents' containing arbitrary data ■ usually containing longer form content eg CMS ● Documents contain some structure to support query/search/filter, etc ● Somewhat less emphasis on a key ○ can be autogenerated ● Quite unlike classical databases ● Examples: MongoDB, CouchDB
  • 12.
    Key/value stores ● DBsinspired by memcache ○ simple, fast key/value stores ● Attempt to retain most of DB in memory ○ fast response times ● Different designs for scalability ○ single node/multi node ● Much emphasis on the keys in this type of DB ● Write usually overwrites entire previous entry ● Examples: Redis, Couchbase/Membase, DynamoDB, Riak
  • 13.
    Graph based solutions ●Obviously different from previous categories ○ Focus specifically on graphs ● Queries supported are graph-specific ○ eg get nodes related to specified node ● Typically support for solving standard graph problems ○ eg shortest path, general graph traversal ● Can deliver very significant performance over non-graph specific solutions ○ for graph problems! ● Examples: Neo4j
  • 14.
    It's a noisyspace... ● Very many candidate technologies ● Relatively small amount of real world solutions ● Differences between classifications above is one of emphasis... ○ column based and document based arrive at semi- structured sweet spot from opposite ends of spectrum ● ...although this results in different preferred use cases... ○ document based solution better for document problems, eg CMS
  • 15.
    Common techniques used ●Hashing techniques used to map data to nodes in cluster ● Internode communication via Gossip ● Common replication techniques ● Thrift is used in a few cases ● MapReduce often used to search over distributed system
  • 16.
  • 17.
  • 18.
  • 19.
    Horses for courses... ●SQL is perfectly good solution for many problems ○ tried and tested ● Some problems require alternative solution ○ typically driven by scale and/or flexibility ● NoSQL offers (many) alternatives ○ although relatively easy to identify realistic options ● Column based approaches good for mostly structured data with enhanced flexibility ● Document based approaches good for document oriented problems
  • 20.
    ...so let's diveinto one NoSQL database... ● Cassandra...