Lightning Talks . Cassandra London August 20th 2014

WELCOME To the London CASSANDRA MEETUP Need help, or want
to contribute? Email [email protected] / [email protected] or Tweet @PlanetCassandra

SHARE YOUR USE CASE! We are always looking for speakers
to share their experience and have created a Speakers Program full of benefits! If you are interested please contact us for details. Need help, or want to contribute? Email [email protected]

EU Summit Call For Papers is OPEN!!

Meetup Agenda - August 20th 2014 19:00 – Reception (with
drinks and snacks)   19:30 – Welcome 19:35 - Phil Meredith   19:40 - Paul Brown   19:45 - Chris   19:50 - Ashic Mahtab   19:55 - Oleksii Mandrychenko   20:00 - Alok Dwivedi 20:05 - Christopher Batey   20:45 – Q&A   21:00 – Networking

Phil Meredith ! 19:35 - Phil Meredith “Cassandra and the
power of the Merkle tree" Joined Credit Suisse as a software engineer two years ago after graduating from computer science. Have developed an interest in big data problems and have been working on the banks leading big data team for just over 6 months. 

And the Power of the Merkle Tree Phil Meredith

10 AM $2 Million 11 AM $1 Million Spot price
$/£ Foreign Exchange Interest Rate

Partition key Column key Column key … Column key 10am
IBM FXRate/UsdGbp … Interest rate {price:$10, amount: 200K} {1:1} {UK:3} FXRate/UsdGbp ! {1:1} Partition key Column key Column key … Column key 11am IBM FXRate/UsdGbp … Interest rate {price:$10, amount: 200K} {1:0.5} {UK:3} FXRate/UsdGbp ! {1:0.5}

20 Boxes 2TB Unique data 100K’s Columns 3 Prod Clusters

Partition key Column key Column key Column key 10am Stocks
FxRates InterestRate /stocks/v1 /fx/v1 /interest/v1 11am Stocks FxRates InterestRate /stocks/v1 /fx/v2 /interest/v1 Partition key Column Key Column Key /stocks/v1 IBM MS V1 V1 /stocks/v2 IBM MS V1 V2 /fx/v1 USDtoGBP V1 /interest/v2 UK Rate V1

Partition Key Column Key Column Key IBM V1 {nominal: £10,
amnt:100} MS V1 V2 {nominal:£0.77, amnt: 1000} {nominal:£0.77, amnt:5000} USDtoGBP V1 {rate: 0.5} Partition key Column Key Column Key /stocks/v1 IBM MS V1 V1 /stocks/v2 IBM MS V1 V2 /fx/v1 USDtoGBP V1 /interest/v1 UK Rate V1

Partition Key MS V1 V2 100 60 Partition Key IBM
V1 11 Partition Key /fx/v1 USDtoGBP V1:Hash:us d /stock/v1 MS IBM V1:Hash-MS V1:Hash-IBM Partition Key Hash /interest/v1 Hash:1 /fx/v1 Hash:2 /stock/v1 Hash:3 Partition Key 10am Top level hash Top:level:hash:1

Partition Key 10am Top level hash Top:level:hash:1 Partition Key 11am
Top level hash Top:level:hash:2

10am Top:level:hash:1 11am Top:level:hash:2 Interest fx stock … Hash:1 Hash:2
Hash: 3 … Interest fx stock … Hash:1 Hash: 2a Hash:3 … EUR/USD USD/GBP USD/JPY … V1:hash: eur V1:hash: usd V1:hash :jpy … EUR/USD USD/GBP USD/JPY … V1:hash: eur V2:hash: usd V1:hash :jpy … USD/GBP {1:1} USD/GBP {1:0.5}

Thank You!

Ashic Mahtab ! 19:50 - Ashic Mahtab "Cassandra - awesome
for non-petabyte scale (too)"

CASSANDRA - AWESOME FOR NON-PETABYTE SCALE (TOO) Ashic Mahtab

“NORMAL” APPS • Line of Business • Not petabyte scale
• Learning apps (data fits on one box) • Track users • Track interactions • Track events • SQL Server/ MySql / etc

COMMON ISSUES • Single data model for everything – reads,
writes, reports, analytics. • Data oriented modelling • Performance issues • Maintainability issues • Coupling, cohesion and all that jazz. • Fault tolerance. (Headless chicken mode)

COMMON SOLUTIONS • Add a caching layer – what can’t
an extra level of indirection solve? • Copy whole database and do reporting / analytics off that. • Give up on consistency as a whole. • Start using /dev/null databases (!) • Build strange things solve specific performance issues. • CQRS, Event Sourcing… but all those viewmodels…need a fast store for them.

“BETTER” SOLUTIONS • Polyglot data … add more moving cogs
to the machine. • Create a file based minimalistic version of something like Cassandra (and repent).

WHY CASSANDRA • Fast. Really fast. As long as you
model for queries. • Modelling for queries gets you back analysing use cases. • Constraints as guides. • Relentless denormalisation… no use clutching onto one true schema to rule them all. • Gears towards service orientation.

WHY CASSANDRA • Fault tolerance…sleep easy. Node went down? Ah
well ☺ • Tools and libraries – nowhere near SQL, but getting there…fast. • Query language – CQL. • If you’re in .NET, the DataStax driver has EF like contexts that you can use…nice and familiar to start with. • Dead easy to pick up from app dev perspective. Progressive learning. • A little modelling goes a long way. Fast. • Easy install. Free, open source. Commercial version + support from DataStax.

WHY CASSANDRA • Data analytics…Hadoop / Pig integration. Hive with
DataStax Enterprise. • Spark connector is awesome…open up access to Spark ML, Spark SQL and all other loveliness built on top of spark. • Import / Export via csv.

THANK YOU • Probably already out of time ☺ !
! ! • @ashic | www.heartysoft.com | [email protected]

Oleksii Mandrychenko - All the way from Edinburgh for today!!
! 19:55 - Oleksii Mandrychenko “Using Cassandra to create write-heavy real-time security application” I am a senior software engineer at ZoneFox. I have been involved in IT for the last 10 years, where I was doing coding, hacking, conﬁguring, deploying, and testing. My area of expertise is real-time forensic and behavioural analysis. I am responsible for architecting scalable, high performance, analytical system to detect insider thread attacks. I hold MSc in computer science and currently writing up a PhD thesis on a stack of computer science and health.

! Oleksii Mandrychenko, Engineer [email protected] Write-heavy security app using Cassandra

ZoneFox Founded in 2008, B2B Protects intellectual property On-premises and
cloud offering

How ZoneFox Works The relevant part

Agents Agents Collection Servers Cassandra

What we store Events - Date, Machine, User, App, File,
Activity Alerts - Reason, Events[]

Volume Per day - 600 active users - 300M events
- 1k alerts - 10 GB disk space

What we tried Plain text files

What else MS SQL Raven DB Redis HBASE

Cassandra uid (timeuuid) events (blob) 7065fae0-25d9-11e4- a2fc-8d5aad979e84 ≈ 200 protobuf
serialised [] events … …

Current speed Cassandra: 3 nodes (RF=2, CL=1) Writes 600M/day Reads
1B events in 1.5 hour

Lessons learned Sucks HDD/SAS Wide rows Hadoop config Rocks SSD
Data design TTL Puppet

Things to watch Spark - MLib - GraphX - Streaming
Shark

Alok Dwivedi - All the way from Reading! :) !
20:00 - Alok Dwivedi "Quickly enumerating sub sets of very large C* tables using composite partition keys" I am a Senior developer working for Symantec, with over 14 years of Software development and design experience mainly in RDBMS and various server side technologies using Java/C++/ C#. Since last one year I have been working on NoSQL and Big Data technologies mainly Cassandra.

45 Quickly enumerating sub sets of very large C* tables
using composite partition keys Alok Dwivedi

- We collect metadata for billions of objects on which
customers need to query and get various stats interactively (slice and dice the data to get stats) 46 Use Case In Brief Several billions of items Ingest Grouped in to buckets containing 100’s of millions of rows Primary C* Table Secondary C* Table Transform and aggregate 10’s of millions of items Searchable index in Elastic Search

- Primary table stores one record for every object for
which we collected metadata ! ! ! ! ! - Secondary table groups all the inserted items into various buckets on which we later want to produce statistics. 47 Table layout Item Container Bucket_a Bucket_b Bucket_c Items<ItemId,Size> container1 a_bucket1 b_bucket1 c_bucket1 Item1,100;item2,80 container2 a_bucket2 b_bucket2 c_bucket2 Item3,500;item4,510 ItemId Item Container Attrib_a Attrib_b Attrib_c Size item1 container1 a_value1 b_value1 c_value1 100 item2 container1 a_value1 b_value1 c_value2 80 item3 container2 a_value2 b_value2 c_value3 500 item4 container2 a_value2 b_value2 c_value4 510

48 ! - Obvious solution is to use ContainerId as
partition key and all columns except map of items as PK i.e. - PRIMARY KEY(ContainerId,Bucket-a,Bucket_b,Bucket_c) - This doesn’t works as it will create wide rows which means all the data for a Container is stored in one physical C* row. Initial solutions ContainerId1 Bucket_a1|Bucket_b1| bucket_c1 {itemId1,size1; itemid2,size2; …} Bucket_a1|Bucket_b2| bucket_c1 {itemId3,size3; itemid4,size4; …} Bucket_a1|Bucket_b1| bucket_c2 {itemId1,size1; itemid2,size2; …} All other combinations for ContainerId1 ContainerId2 Bucket_a2|Bucket_b1| bucket_c1 {itemId11,size11; itemid12,size2; …} Bucket_a2|Bucket_b2| bucket_c1 {itemId13,size13; itemid14,size14; …} Bucket_a2|Bucket_b1| bucket_c2 {itemId14,size21; itemid15,size22; …} All other combinations for ContainerId2

49 ! - Create one table per container. - Advantage
- During transformation phase when processing a container we could pick up relevant table for that container and iterate through it without need for any filtering - Disadvantage - Number of tables can grow very large (may be 20K per keyspace). - We gave it a try and found - Issues related to Table Schema not being propagated to all nodes quickly - Performance was never at par with what we got with one table Initial solutions

50 ! ! - We know all the possible values
of bucket-A and bucket-B. They have fixed range of values. - By combining them with ContainerId and making it composite partition key we can reduce the physical row size. Solution that worked ContainerId1/Bucket_a1/Bucket_b1 Bucket-c1 {itemId1,size1; itemid2,size2; …} Bucket-c2 {itemId3,size3; itemid4,size4; …} Bucket-c3 {itemId1,size1; itemid2,size2; …} All other combinations for container1 & bucket-a1 & bucket_b1 Bucket-c4 {itemId11,size11; itemid12,size2; …} Bucket-c5 {itemId13,size13; itemid14,size14; …} Bucket-c6 {itemId14,size21; itemid15,size22; …} All other combinations for container1 & bucket-a1 & bucket_b2 ContainerId1/Bucket_a1/Bucket_b2

51 ! - Use composite partition key that can avoid
very wide rows - PRIMARY EY((ContainerId1,Bucket_a,Bucket_b),Bucket_c) - To enumerate all the rows for a container, we make queries using all the know combinations of bucket-A and bucket-B along with that container id. - For e.g. if there are 50 values in bucket A and 10 in bucket-B then we will make 500 queries for that container id one by one. - For container Id1 - For each bucketA - For each size bucketB - Get all rows from secondary table by containerId1, bucket_ValueA1, Bucket_valueB1 - Disadvantage is that some combinations of container and bucket values may not have any record but since these are lookups using partition key so they are extremely quick when no data is present. Solution that worked

52 Thank You!

Paul Brown ! 19:40 - Paul Brown "The view from
the bottom of the cliff!" I am currently a development manager building retail POS and booking software for Neill technologies. I've been working with databases since Oracle v6, and since then, have done training, consultancy, development, analysis and most things in between. I'm currently 3 months into my NOSQL adventure.

How to Learn Cassandra? ! 1. Possibly Don't 2. Solve
a real problem 3. Seek simplicity

Have a look at: ! •Patrick McFadin Data Modelling Webinars
(3 part) •Cassandra summit sessions on You Tube •DataStax Docs

How to Teach Cassandra? ! 1. Avoid "sit and watch"
2. Adults are goal centered learners 3. Never explain What or How before you've explained Why 3. "If you don't know this then..."

Lightning Talks . Cassandra London August 20th ...

Lightning Talks . Cassandra London August 20th 2014

More Decks by Ale

Other Decks in Technology

Featured

Transcript