Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cachirulo valley Bigtable and friends

Cachirulo valley Bigtable and friends

Quick overview of NoSQL and some BigTable flavours.

Avatar for Juan Luis Belmonte

Juan Luis Belmonte

September 13, 2012
Tweet

Other Decks in Programming

Transcript

  1. NoSQL It’s a DB engine which doesn’t stick to the

    usual RDBMs parameters! Main differences with RDBMs! !- It might not give ACID guarantees! !- SQL is not available, not fully supported or is not the focus.! !- No Schema! !- Distributed ! !- Designed to scale in some way! !- Designed to suit 21st century needs! ! ! 9/13/12   BigTable  &  Friends  
  2.       Abuse of usage of joins has a

    critical performance impact. ! Human language to query data!     When one server is not enough. !   SQL 9/13/12   BigTable  &  Friends  
  3. SQL teach you to think about the queries. 
 Which

    is good! 9/13/12   BigTable  &  Friends  
  4. RDBMs issues when you reach the limit When  they  were

     designed  petabytes  were  science  fic>on.     Expensive  :    Money,    Learning  curve     Sharding  and  replica>on  :  Master  –  Slave,  Single  points  of  failure,  Each   solu>on  is  homebrewed       Some  of    RDBMs  magic    is  difficult  to  achieve   when  you  break  the  one  server  barrier.   9/13/12   BigTable  &  Friends  
  5. RDBMs are ROW oriented Table: collection of rows written in

    the same file! Row: Whole chunk of info (Key + columns)! foo    john  doe   32   bar    Chuck  Norris   72   baz    john  doe   28   Key   Name   Age   9/13/12   BigTable  &  Friends  
  6. Bigtable overview Column oriented! ! Ordered key value map.! !

    Column families : 1 row fixed Column Families but non-fixed number of Columns! ! Write optimization : Sequential writes! ! Merge reads! ! ! ! ! !Index-> SSTables (Sorted String Table)! ! Simple client API! ! Client API (big table paper says)! “Client applications can write or delete values in Bigtable, look up values from individual rows, or iterate over a subset of the data in a table.! ! 9/13/12   BigTable  &  Friends  
  7. BigTable Schema Sorted map Indexed By Row Key Column Key

    and Timestamp! ! Row: Arbitrary string that indexes data in tablets:! Column family : Usually stored in the same file, contains several column keys. ! Columns: are indexed by column name.! Timestamps: each column have a timestamp! ! ! Name   surname   Name   surname   Chuck   Norris   Eva   foo   33   33   123   123   rowK:  foo   rowK:  bar   surname   bar   235   address   …..   999   Example use data container column identifiers eg with dates! 9/13/12   BigTable  &  Friends   We can use column names as useful pieces of info instead as just a coordinate.!
  8. Design thinking on Queries Are we going to just need

    a row key? ! Do we always know the key?! Do we want to query by some field value?! Do we know always the schema or could be variable?! 9/13/12   BigTable  &  Friends  
  9. HBase Based on BigTable! ! Relies on HDFS so the

    DB engine doesn’t care about replication.! ! Region server farms and Fileserver latency.! ! More than one process. Hadoop, HDFS, Zookeeper, Hbase.! ! ! 9/13/12   BigTable  &  Friends  
  10. Cassandra Based on Bigtable and Amazon’s Dynamo DB! ! Partitioning

    consistent hashing : 1 key maps to a server.! ! Fully tuneable replication factor, consistency level! ! Hand-off severs! ! Ring, Gossip protocol, Server simetry and p2p.! ! ! ! ! 9/13/12   BigTable  &  Friends  
  11. Hbase Column families! byte[] for key! Byte[] for fields and

    column names! No fixed schema! Cassandra Column families! byte[] for key! Byte[] for fields and column names! ! SuperColumn families! “Groups of column families “! Can define schema for column families! 9/13/12   BigTable  &  Friends  
  12. public void put() throws IOException{ ProfileDenormalizer.denormalizeProfile(solrProfile); HTableInterface table = HBaseManager.getManager().getTable(TABLE);

    Put putCommand = new Put(getRowQualifier()); putCommand.add(COLUMNF, COLUM_QUALIFIER, HBaseSerializers.objectToBytes(solrProfile)); try { table.put(putCommand); } catch (IOException e) { table.flushCommits(); } finally { table.close(); } } 9/13/12   BigTable  &  Friends   public void delete() throws IOException{ HTableInterface table = HBaseManager.getManager().getTable(TABLE); table.delete(new Delete().deleteColumn(COLUMNF, getRowQualifier())); if(!table.isAutoFlush()) table.flushCommits(); table.close(); } Put! Delete! *  Hbase  serializers  are  just  u>ls  to  serialize  to  a  Byte[]  a  Object.  
  13. 9/13/12   BigTable  &  Friends   public HBaseProfile get(byte[] rowQualifier)

    throws IOException{ HTableInterface table = null; try { table = HBaseManager.getManager().getTable(TABLE); Get getCommand = new Get(rowQualifier); getCommand.addFamily(COLUMNF); Result resultSet; resultSet = table.get(getCommand); if (resultSet.isEmpty()) return null; List<KeyValue> keyValues = resultSet.getColumn(COLUMNF, COLUM_QUALIFIER); this.solrProfile = (SolrProfile) … get   HBaseManager manager = HBaseManager.getManager(); table = manager.getTable(tableName); Scan scanCommand = new Scan(); scanCommand.setBatch(max); scanCommand.setCaching(max); scanCommand.setMaxVersions(1); table.getScanner(scanCommand)             scan     ResultScanner Implements iterable<Result>! Result encapsulates a collection of KV! ! ! ! !
  14. Secondary indexing We need to create new indexes (tables) to

    point the rows we want to fetch! Denormalization! ! The  normal  forms  (abbrev.  NF)  of  rela>onal  database  theory  provide  criteria  for   determining  a  table's  degree  of  vulnerability  to  logical  inconsistencies  and  anomalies     You (usually) should take care of the inconsistencies The good point is that! We hate relations! 9/13/12   BigTable  &  Friends  
  15. Design thinking on Queries! I  have  700MM  rows  and  30

     TB  on  my  DB  and  I  need  to  build  a  new  secondary   index…     9/13/12   BigTable  &  Friends  
  16. Hbase Secondary indexing Just using contrib packages! ! After several

    hours looking for an easy secondary indexing…! DIY   9/13/12   BigTable  &  Friends  
  17. Cassandra Column families and super column families! Built-in secondary indexing

    since 0.7! Non atomic counters since 0.7! Deletes aren't so obvious due to the replication and compaction.! Row resurrection.! Eventually persistent (writes in memory)! ! ! 9/13/12   BigTable  &  Friends  
  18. Playing with schemas create column family commits with comparator=UTF8Type and

    key_validation_class=TimeUUIDType and column_metadata=[{column_name: forge, validation_class: UTF8Type}, {column_name: revision, validation_class:UTF8Type, index_type:KEYS}, {column_name: repoRevision, validation_class:UTF8Type, index_type:KEYS}, {column_name: message, validation_class:UTF8Type}, {column_name: addedFiles, validation_class:UTF8Type}, {column_name: deletedFiles, validation_class:UTF8Type}, {column_name: modifiedFiles, validation_class:UTF8Type}, {column_name: replacedFiles, validation_class:UTF8Type}, {column_name: renamedFiles, validation_class:UTF8Type}, {column_name: repository, validation_class:LongType, index_type:KEYS}, {column_name: user, validation_class:LongType}, {column_name: date, validation_class:LongType}]; create column family project_contributions_stats_scf with column_type='Super'; create column family randoomStuff; 9/13/12   BigTable  &  Friends  
  19. Mutators Mutator counter = HFactory.createMutator(keyspace, getStringSerializer()); counter.insertCounter("counter", cassandra.CassandraConstants.CF_COMMIT_COUNTER, HFactory.createCounterColumn("commits", increment));

    counter.execute(); 9/13/12   BigTable  &  Friends   put   Mutator mutator = HFactory.createMutator(keyspace, UUIDSerializer.get()) mutator.addInsertion(key, CF,HFactory.createStringColumn(”foo", foo)) .addInsertion(key, CF, HFactory.createStringColumn(”bar", bar)) .addInsertion(key, CF, HFactory.createStringColumn("message", message)); sMutator.execute(); Mutator mutator = HFactory.createMutator(keyspace, UUIDSerializer.get()) mutator.delete(key, CF, null, CassandraManager.getUUIDSerializer()); delete  
  20. // We use bytebuffer because each column is a different

    type IndexedSlicesQuery<UUID, String, ByteBuffer> query = HFactory.createIndexedSlicesQuery(CassandraManager.getKeyspace(), CassandraManager.getUUIDSerializer(), StringSerializer.get(), ByteBufferSerializer.get()); query.setColumnNames(Constants.COMMIT_COLUMNS); query.addEqualsExpression(”foo", bs.fromByteBuffer( CassandraManager.getLongSerializer().toByteBuffer(id))); query.setColumnFamily(CF); 9/13/12   BigTable  &  Friends   final String[] COLUMNS= {”foo", ”bar", ”baz", ”murrico”}; //foo is indexed query   get   get(UUID uuid) { SliceQuery<UUID, String, ByteBuffer> result = `` HFactory.createSliceQuery(CassandraManager.getKeyspace(), UUIDSerializer.get(), //key StringSerializer.get(), // comn qualifier ByteBufferSerializer.get()); // data result.setColumnFamily("commits"); result.setKey(TimeUUIDUtils.toUUID(TimeUUIDUtils.asByteArray(uuid))); result.setColumnNames(Constants.COMMIT_COLUMNS); QueryResult <ColumnSlice<String, ByteBuffer>> columnSlice = result.execute(); Query
  21. Cassandra! !better for heavy write applications! !ease of replication (user

    just don’t touch anything)! !ease to set up! !Hundreds of column! !Random partitioner instead of ordered tree as bigtable! !More complex schemas due to supercolumns! !You can control everything.! ! Hbase! !read heavy applications ! !Row locking! !MapReduce the data doesn’t travel! !Beter performance on Range scans! !NameNode is a single point of failure! !Less verbose client! ! ! 9/13/12   BigTable  &  Friends