Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lightning Talks . Cassandra London August 20th 2014

Ale
August 28, 2014

Lightning Talks . Cassandra London August 20th 2014

Phil Meredith
Paul Brown
Ashic Mahtab
Oleksii Mandrychenko
Alok Dwivedi

Ale

August 28, 2014
Tweet

More Decks by Ale

Other Decks in Technology

Transcript

  1. SHARE YOUR USE CASE! We are always looking for speakers

    to share their experience and have created a Speakers Program full of benefits! If you are interested please contact us for details. Need help, or want to contribute? Email [email protected]
  2. Meetup Agenda - August 20th 2014 19:00 – Reception (with

    drinks and snacks) 
 19:30 – Welcome 19:35 - Phil Meredith 
 19:40 - Paul Brown 
 19:45 - Chris 
 19:50 - Ashic Mahtab 
 19:55 - Oleksii Mandrychenko 
 20:00 - Alok Dwivedi 20:05 - Christopher Batey 
 20:45 – Q&A 
 21:00 – Networking
  3. Phil Meredith ! 19:35 - Phil Meredith “Cassandra and the

    power of the Merkle tree" Joined Credit Suisse as a software engineer two years ago after graduating from computer science. Have developed an interest in big data problems and have been working on the banks leading big data team for just over 6 months.

  4. 10 AM $2 Million 11 AM $1 Million Spot price

    $/£ Foreign Exchange Interest Rate
  5. Partition key Column key Column key … Column key 10am

    IBM FXRate/UsdGbp … Interest rate {price:$10, amount: 200K} {1:1} {UK:3} FXRate/UsdGbp ! {1:1} Partition key Column key Column key … Column key 11am IBM FXRate/UsdGbp … Interest rate {price:$10, amount: 200K} {1:0.5} {UK:3} FXRate/UsdGbp ! {1:0.5}
  6. Partition key Column key Column key Column key 10am Stocks

    FxRates InterestRate /stocks/v1 /fx/v1 /interest/v1 11am Stocks FxRates InterestRate /stocks/v1 /fx/v2 /interest/v1 Partition key Column Key Column Key /stocks/v1 IBM MS V1 V1 /stocks/v2 IBM MS V1 V2 /fx/v1 USDtoGBP V1 /interest/v2 UK Rate V1
  7. Partition Key Column Key Column Key IBM V1 {nominal: £10,

    amnt:100} MS V1 V2 {nominal:£0.77, amnt: 1000} {nominal:£0.77, amnt:5000} USDtoGBP V1 {rate: 0.5} Partition key Column Key Column Key /stocks/v1 IBM MS V1 V1 /stocks/v2 IBM MS V1 V2 /fx/v1 USDtoGBP V1 /interest/v1 UK Rate V1
  8. Partition Key MS V1 V2 100 60 Partition Key IBM

    V1 11 Partition Key /fx/v1 USDtoGBP V1:Hash:us d /stock/v1 MS IBM V1:Hash-MS V1:Hash-IBM Partition Key Hash /interest/v1 Hash:1 /fx/v1 Hash:2 /stock/v1 Hash:3 Partition Key 10am Top level hash Top:level:hash:1
  9. 10am Top:level:hash:1 11am Top:level:hash:2 Interest fx stock … Hash:1 Hash:2

    Hash: 3 … Interest fx stock … Hash:1 Hash: 2a Hash:3 … EUR/USD USD/GBP USD/JPY … V1:hash: eur V1:hash: usd V1:hash :jpy … EUR/USD USD/GBP USD/JPY … V1:hash: eur V2:hash: usd V1:hash :jpy … USD/GBP {1:1} USD/GBP {1:0.5}
  10. “NORMAL” APPS • Line of Business • Not petabyte scale

    • Learning apps (data fits on one box) • Track users • Track interactions • Track events • SQL Server/ MySql / etc
  11. COMMON ISSUES • Single data model for everything – reads,

    writes, reports, analytics. • Data oriented modelling • Performance issues • Maintainability issues • Coupling, cohesion and all that jazz. • Fault tolerance. (Headless chicken mode)
  12. COMMON SOLUTIONS • Add a caching layer – what can’t

    an extra level of indirection solve? • Copy whole database and do reporting / analytics off that. • Give up on consistency as a whole. • Start using /dev/null databases (!) • Build strange things solve specific performance issues. • CQRS, Event Sourcing… but all those viewmodels…need a fast store for them.
  13. “BETTER” SOLUTIONS • Polyglot data … add more moving cogs

    to the machine. • Create a file based minimalistic version of something like Cassandra (and repent).
  14. WHY CASSANDRA • Fast. Really fast. As long as you

    model for queries. • Modelling for queries gets you back analysing use cases. • Constraints as guides. • Relentless denormalisation… no use clutching onto one true schema to rule them all. • Gears towards service orientation.
  15. WHY CASSANDRA • Fault tolerance…sleep easy. Node went down? Ah

    well ☺ • Tools and libraries – nowhere near SQL, but getting there…fast. • Query language – CQL. • If you’re in .NET, the DataStax driver has EF like contexts that you can use…nice and familiar to start with. • Dead easy to pick up from app dev perspective. Progressive learning. • A little modelling goes a long way. Fast. • Easy install. Free, open source. Commercial version + support from DataStax.
  16. WHY CASSANDRA • Data analytics…Hadoop / Pig integration. Hive with

    DataStax Enterprise. • Spark connector is awesome…open up access to Spark ML, Spark SQL and all other loveliness built on top of spark. • Import / Export via csv.
  17. Oleksii Mandrychenko - All the way from Edinburgh for today!!

    ! 19:55 - Oleksii Mandrychenko “Using Cassandra to create write-heavy real-time security application” I am a senior software engineer at ZoneFox. I have been involved in IT for the last 10 years, where I was doing coding, hacking, configuring, deploying, and testing. My area of expertise is real-time forensic and behavioural analysis. I am responsible for architecting scalable, high performance, analytical system to detect insider thread attacks. I hold MSc in computer science and currently writing up a PhD thesis on a stack of computer science and health.
  18. What we store Events - Date, Machine, User, App, File,

    Activity Alerts - Reason, Events[]
  19. Volume Per day - 600 active users - 300M events

    - 1k alerts - 10 GB disk space
  20. Alok Dwivedi - All the way from Reading! :) !

    20:00 - Alok Dwivedi "Quickly enumerating sub sets of very large C* tables using composite partition keys" I am a Senior developer working for Symantec, with over 14 years of Software development and design experience mainly in RDBMS and various server side technologies using Java/C++/ C#. Since last one year I have been working on NoSQL and Big Data technologies mainly Cassandra.
  21. 45 Quickly enumerating sub sets of very large C* tables

    using composite partition keys Alok Dwivedi
  22. - We collect metadata for billions of objects on which

    customers need to query and get various stats interactively (slice and dice the data to get stats) 46 Use Case In Brief Several billions of items Ingest Grouped in to buckets containing 100’s of millions of rows Primary C* Table Secondary C* Table Transform and aggregate 10’s of millions of items Searchable index in Elastic Search
  23. - Primary table stores one record for every object for

    which we collected metadata ! ! ! ! ! - Secondary table groups all the inserted items into various buckets on which we later want to produce statistics. 47 Table layout Item Container Bucket_a Bucket_b Bucket_c Items<ItemId,Size> container1 a_bucket1 b_bucket1 c_bucket1 Item1,100;item2,80 container2 a_bucket2 b_bucket2 c_bucket2 Item3,500;item4,510 ItemId Item Container Attrib_a Attrib_b Attrib_c Size item1 container1 a_value1 b_value1 c_value1 100 item2 container1 a_value1 b_value1 c_value2 80 item3 container2 a_value2 b_value2 c_value3 500 item4 container2 a_value2 b_value2 c_value4 510
  24. 48 ! - Obvious solution is to use ContainerId as

    partition key and all columns except map of items as PK i.e. - PRIMARY KEY(ContainerId,Bucket-a,Bucket_b,Bucket_c) - This doesn’t works as it will create wide rows which means all the data for a Container is stored in one physical C* row. Initial solutions ContainerId1 Bucket_a1|Bucket_b1| bucket_c1 {itemId1,size1; itemid2,size2; …} Bucket_a1|Bucket_b2| bucket_c1 {itemId3,size3; itemid4,size4; …} Bucket_a1|Bucket_b1| bucket_c2 {itemId1,size1; itemid2,size2; …} All other combinations for ContainerId1 ContainerId2 Bucket_a2|Bucket_b1| bucket_c1 {itemId11,size11; itemid12,size2; …} Bucket_a2|Bucket_b2| bucket_c1 {itemId13,size13; itemid14,size14; …} Bucket_a2|Bucket_b1| bucket_c2 {itemId14,size21; itemid15,size22; …} All other combinations for ContainerId2
  25. 49 ! - Create one table per container. - Advantage

    - During transformation phase when processing a container we could pick up relevant table for that container and iterate through it without need for any filtering - Disadvantage - Number of tables can grow very large (may be 20K per keyspace). - We gave it a try and found - Issues related to Table Schema not being propagated to all nodes quickly - Performance was never at par with what we got with one table Initial solutions
  26. 50 ! ! - We know all the possible values

    of bucket-A and bucket-B. They have fixed range of values. - By combining them with ContainerId and making it composite partition key we can reduce the physical row size. Solution that worked ContainerId1/Bucket_a1/Bucket_b1 Bucket-c1 {itemId1,size1; itemid2,size2; …} Bucket-c2 {itemId3,size3; itemid4,size4; …} Bucket-c3 {itemId1,size1; itemid2,size2; …} All other combinations for container1 & bucket-a1 & bucket_b1 Bucket-c4 {itemId11,size11; itemid12,size2; …} Bucket-c5 {itemId13,size13; itemid14,size14; …} Bucket-c6 {itemId14,size21; itemid15,size22; …} All other combinations for container1 & bucket-a1 & bucket_b2 ContainerId1/Bucket_a1/Bucket_b2
  27. 51 ! - Use composite partition key that can avoid

    very wide rows - PRIMARY EY((ContainerId1,Bucket_a,Bucket_b),Bucket_c) - To enumerate all the rows for a container, we make queries using all the know combinations of bucket-A and bucket-B along with that container id. - For e.g. if there are 50 values in bucket A and 10 in bucket-B then we will make 500 queries for that container id one by one. - For container Id1 - For each bucketA - For each size bucketB - Get all rows from secondary table by containerId1, bucket_ValueA1, Bucket_valueB1 - Disadvantage is that some combinations of container and bucket values may not have any record but since these are lookups using partition key so they are extremely quick when no data is present. Solution that worked
  28. Paul Brown ! 19:40 - Paul Brown "The view from

    the bottom of the cliff!" I am currently a development manager building retail POS and booking software for Neill technologies. I've been working with databases since Oracle v6, and since then, have done training, consultancy, development, analysis and most things in between. I'm currently 3 months into my NOSQL adventure.
  29. How to Learn Cassandra? ! 1. Possibly Don't 2. Solve

    a real problem 3. Seek simplicity
  30. Have a look at: ! •Patrick McFadin Data Modelling Webinars

    (3 part) •Cassandra summit sessions on You Tube •DataStax Docs
  31. How to Teach Cassandra? ! 1. Avoid "sit and watch"

    2. Adults are goal centered learners 3. Never explain What or How before you've explained Why 3. "If you don't know this then..."