Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Billion Records from SQL to Cassandra, lessons learned

Billion Records from SQL to Cassandra, lessons learned

30min talk at the #CassandraSummit London 2014 with @BriceDutheil and @DoanDuyHai

At Libon, we process a lot of data. A lot means billions of messages/calls and billions of backed up contacts data. All the messaging backup is done in Cassandra with read & write latency below 10ms. Since we are happy with the level of performance achieved with Cassandra, we decided to migrate the contacts backup as well into Cassandra. During this session, we'll see how to tackle a migration project, from the SQL and Hibernate couple to Cassandra and Achilles. We'll highlight some mistakes to avoid and some tricks to make the migration smoother and more resilient.

Presentation video : https://youtu.be/U6oa3Tsdtp4

Brice Dutheil

December 04, 2014
Tweet

More Decks by Brice Dutheil

Other Decks in Programming

Transcript

  1. Billion Records from SQL to Cassandra, lessons learned
    DuyHai Doan Brice Dutheil

    View Slide

  2. #CassandraSummit @doanduyhai @BriceDutheil
    Who are we ?
    Brice Dutheil
    Mockito
    Java Track Lead @ Devoxx France
    Independant contractor @ Libon
    (Orange-Vallée)
    DuyHai Doan
    Achilles
    Cassandra Technical Advocate
    Former Java Developer @ Libon
    2

    View Slide

  3. #CassandraSummit @doanduyhai @BriceDutheil
    Agenda
    •  Libon context
    •  Migration strategy
    •  Business code migration
    •  Data Modeling
    •  Take Away
    3

    View Slide

  4. #CassandraSummit @doanduyhai @BriceDutheil
    Libon Context

    View Slide

  5. #CassandraSummit @doanduyhai @BriceDutheil
    What is Libon ?
    •  Messaging app
    •  VOIP (out)
    •  Custom voicemail & greetings
    •  SMS/chat/file transfer
    •  Contacts matching
    5

    View Slide

  6. #CassandraSummit @doanduyhai @BriceDutheil
    Contact Matching
    6
    Libon User

    View Slide

  7. #CassandraSummit @doanduyhai @BriceDutheil
    Contact Matching
    7
    Libon User Friend

    View Slide

  8. #CassandraSummit @doanduyhai @BriceDutheil
    Contact Matching
    8
    Libon User Friend
    Contact matching

    View Slide

  9. #CassandraSummit @doanduyhai @BriceDutheil
    Contact Matching
    9
    Libon User Friend
    Accept link

    View Slide

  10. #CassandraSummit @doanduyhai @BriceDutheil
    Project Context
    •  Application grew over the years
    10

    View Slide

  11. #CassandraSummit @doanduyhai @BriceDutheil
    Project Context
    •  Application grew over the years
    •  Already using Cassandra to handle events
    •  messaging / file sharing / SMS / notifications
    •  Cassandra R/W latencies ≈ 0,4 ms
    •  server response time under 10 ms
    11

    View Slide

  12. #CassandraSummit @doanduyhai @BriceDutheil
    Project Context
    •  About contacts …
    12

    View Slide

  13. #CassandraSummit @doanduyhai @BriceDutheil
    Project Context
    •  About contacts …
    •  stored as relational model in RDBMS (Oracle)
    13

    View Slide

  14. #CassandraSummit @doanduyhai @BriceDutheil
    Project Context
    •  About contacts …
    •  stored as relational model in RDBMS (Oracle)
    •  1 user ≈ 300 contacts
    14

    View Slide

  15. #CassandraSummit @doanduyhai @BriceDutheil
    Project Context
    •  About contacts …
    •  stored as relational model in RDBMS (Oracle)
    •  1 user ≈ 300 contacts
    •  with millions users ‛ billions of contacts to handle
    15

    View Slide

  16. #CassandraSummit @doanduyhai @BriceDutheil
    Project Context
    •  About contacts …
    •  stored as relational model in RDBMS (Oracle)
    •  1 user ≈ 300 contacts
    •  with millions users ‛ billions of contacts to handle
    •  query latency unpredictable
    16

    View Slide

  17. #CassandraSummit @doanduyhai @BriceDutheil
    17

    View Slide

  18. #CassandraSummit @doanduyhai @BriceDutheil
    Fixing the problem
    •  Tune the RDBMS
    18

    View Slide

  19. #CassandraSummit @doanduyhai @BriceDutheil
    Fixing the problem
    •  Tune the RDBMS
    •  indices
    19

    View Slide

  20. #CassandraSummit @doanduyhai @BriceDutheil
    Fixing the problem
    •  Tune the RDBMS
    •  indices
    •  partitioning
    20

    View Slide

  21. #CassandraSummit @doanduyhai @BriceDutheil
    Fixing the problem
    •  Tune the RDBMS
    •  indices
    •  partitioning
    •  less joins, simplified relational model
    21

    View Slide

  22. #CassandraSummit @doanduyhai @BriceDutheil
    Fixing the problem
    •  Tune the RDBMS
    •  indices
    •  partitioning
    •  less joins, simplified relational model
    •  hardware capacity increased
    22

    View Slide

  23. #CassandraSummit @doanduyhai @BriceDutheil
    Fixing the problem
    •  Tune the RDBMS
    •  indices
    •  partitioning
    •  less joins, simplified relational model
    •  hardware capacity increased
    That worked
    23

    View Slide

  24. #CassandraSummit @doanduyhai @BriceDutheil
    Fixing the problem
    •  Tune the RDBMS
    •  indices
    •  partitioning
    •  less joins, simplified relational model
    •  hardware capacity increased
    That worked
    but …
    24

    View Slide

  25. #CassandraSummit @doanduyhai @BriceDutheil
    Back-end application
    RDBMS Cassandra
    25

    View Slide

  26. #CassandraSummit @doanduyhai @BriceDutheil
    Next Challenges
    •  High Availability (DB failure, site failure …)
    26

    View Slide

  27. #CassandraSummit @doanduyhai @BriceDutheil
    Next Challenges
    •  High Availability (DB failure, site failure …)
    •  Predictable performance at scale
    27

    View Slide

  28. #CassandraSummit @doanduyhai @BriceDutheil
    Next Challenges
    •  High Availability (DB failure, site failure …)
    •  Predictable performance at scale
    •  Going to multi data-centers
    28

    View Slide

  29. #CassandraSummit @doanduyhai @BriceDutheil
    Going for Cassandra
    •  Denormalize (if possible …)
    29

    View Slide

  30. #CassandraSummit @doanduyhai @BriceDutheil
    Going for Cassandra
    •  Denormalize (if possible …)
    •  Know your business ‛ know your queries
    30

    View Slide

  31. #CassandraSummit @doanduyhai @BriceDutheil
    Going for Cassandra
    •  Denormalize (if possible …)
    •  Know your business ‛ know your queries
    •  Linear scaling out
    31

    View Slide

  32. #CassandraSummit @doanduyhai @BriceDutheil
    Going for Cassandra
    •  Denormalize (if possible …)
    •  Know your business ‛ know your queries
    •  Linear scaling out
    •  Consistent performance
    32

    View Slide

  33. #CassandraSummit @doanduyhai @BriceDutheil
    Data Migration Strategy

    View Slide

  34. #CassandraSummit @doanduyhai @BriceDutheil
    Objectives
    •  No downtime
    34

    View Slide

  35. #CassandraSummit @doanduyhai @BriceDutheil
    Objectives
    •  No downtime
    •  No concurrency corner-cases
    35

    View Slide

  36. #CassandraSummit @doanduyhai @BriceDutheil
    Objectives
    •  No downtime
    •  No concurrency corner-cases
    •  Safe rollback possible
    36

    View Slide

  37. #CassandraSummit @doanduyhai @BriceDutheil
    Objectives
    •  No downtime
    •  No concurrency corner-cases
    •  Safe rollback possible
    •  Replay-ability & resume-ability
    37

    View Slide

  38. #CassandraSummit @doanduyhai @BriceDutheil
    Strategy
    •  3 phases
    38

    View Slide

  39. #CassandraSummit @doanduyhai @BriceDutheil
    Strategy
    •  3 phases
    •  Write contacts to both data stores
    39

    View Slide

  40. #CassandraSummit @doanduyhai @BriceDutheil
    Strategy
    •  3 phases
    •  Write contacts to both data stores
    •  Old contacts migration
    40

    View Slide

  41. #CassandraSummit @doanduyhai @BriceDutheil
    Strategy
    •  3 phases
    •  Write contacts to both data stores
    •  Old contacts migration
    •  Switch to Cassandra …
    •  … and deprecate SQL
    41

    View Slide

  42. #CassandraSummit @doanduyhai @BriceDutheil
    Migration Phase 1
    Back end server
    ·
    ·
    ·
    SQL
    SQL
    SQL
    C*
    C*
    C*
    C*
    C*
    Write
    contactUUID
    42
    contactId … contactUUID
    129363 123e4567-
    e89b-12d3…
    834849
    contacId(long) + contactUUID

    View Slide

  43. #CassandraSummit @doanduyhai @BriceDutheil
    Migration Phase 1
    Back end server
    ·
    ·
    ·
    SQL
    SQL
    SQL
    C*
    C*
    C*
    C*
    C*
    Read
    43

    View Slide

  44. #CassandraSummit @doanduyhai @BriceDutheil
    Migration Phase 2
    SQL
    SQL
    SQL
    C*
    C*
    C*
    C*
    C*
    For each batch of users
    SELECT * FROM contacts
    WHERE user_id = …
    AND contact_uuid IS NULL
    •  On live production, migrate old contacts
    44
    Old contacts created
    before phase 1

    View Slide

  45. #CassandraSummit @doanduyhai @BriceDutheil
    Migration Phase 2
    SQL
    SQL
    SQL
    C*
    C*
    C*
    C*
    C*
    For each batch of users
    SELECT * FROM contacts
    WHERE user_id = …
    AND contact_uuid IS NULL
    Logged batches of
    INSERT INTO contacts(..)
    VALUES(…)
    USING TIMESTAMP
    now() - 1 week
    •  On live production, migrate old contacts
    45
    Old contacts created
    before phase 1

    View Slide

  46. #CassandraSummit @doanduyhai @BriceDutheil
    Migration Phase 2
    USING TIMESTAMP now() - 1 week

    46

    View Slide

  47. #CassandraSummit @doanduyhai @BriceDutheil
    Migration Phase 2
    •  During data migration …
    47

    View Slide

  48. #CassandraSummit @doanduyhai @BriceDutheil
    Migration Phase 2
    •  During data migration …
    •  … concurrent writes from the migration batch …
    48

    View Slide

  49. #CassandraSummit @doanduyhai @BriceDutheil
    Migration Phase 2
    •  During data migration …
    •  … concurrent writes from the migration batch …
    •  … and updates from production for the same contact
    49

    View Slide

  50. #CassandraSummit @doanduyhai @BriceDutheil
    Migration Phase 2
    contact_uuid
    name (now -1 week)
    … name (now)

    Johny … Johnny …
    Insert from batch
    (to the past)
    Update from production
    50

    View Slide

  51. #CassandraSummit @doanduyhai @BriceDutheil
    Migration Phase 2
    contact_uuid
    name (now -1 week)
    … name (now)

    Johny … Johnny …
    Future reads pick the most up-to-date value
    51

    View Slide

  52. #CassandraSummit @doanduyhai @BriceDutheil
    Migration Phase 2
    "Write to the Past…
    to save the Future"
    Libon – 2014/10/08
    52

    View Slide

  53. #CassandraSummit @doanduyhai @BriceDutheil
    Migration Phase 3
    Back end server
    ·
    ·
    ·
    SQL
    SQL
    SQL
    C*
    C*
    C*
    C*
    C*
    Write

    53

    View Slide

  54. #CassandraSummit @doanduyhai @BriceDutheil
    Business Code Refactoring

    View Slide

  55. #CassandraSummit @doanduyhai @BriceDutheil
    Code Inventory
    •  Written for RDBMS
    55

    View Slide

  56. #CassandraSummit @doanduyhai @BriceDutheil
    Code Inventory
    •  Written for RDBMS
    •  Lots of joins (no surprise)
    56

    View Slide

  57. #CassandraSummit @doanduyhai @BriceDutheil
    Code Inventory
    •  Written for RDBMS
    •  Lots of joins (no surprise)
    •  Designed around transactions
    57

    View Slide

  58. #CassandraSummit @doanduyhai @BriceDutheil
    Code Inventory
    •  Written for RDBMS
    •  Lots of joins (no surprise)
    •  Designed around transactions
    •  Spring @Transactional everywhere
    58

    View Slide

  59. #CassandraSummit @doanduyhai @BriceDutheil
    Code Inventory cont.
    •  Entities go through Services & Repositories
    59
    Repositories
    Services
    ContactEntity

    View Slide

  60. #CassandraSummit @doanduyhai @BriceDutheil
    Code Inventory cont.
    •  Hibernate is auto-magic
    60

    View Slide

  61. #CassandraSummit @doanduyhai @BriceDutheil
    Code Inventory cont.
    •  Hibernate is auto-magic
    •  lazy loading
    •  1st level cache
    •  N+1 select
    61
    Repositories
    Services
    ContactEntity

    View Slide

  62. #CassandraSummit @doanduyhai @BriceDutheil
    Which options ?
    •  Throw existing code …
    •  … and re-design from scratch for Cassandra
    62

    View Slide

  63. #CassandraSummit @doanduyhai @BriceDutheil
    Which options ?
    •  Throw existing code …
    •  … and re-design from scratch for Cassandra
    No way !
    63

    View Slide

  64. #CassandraSummit @doanduyhai @BriceDutheil
    Code Quality
    •  Existing business code has…
    •  … ≈ 3500 unit tests
    64

    View Slide

  65. #CassandraSummit @doanduyhai @BriceDutheil
    Code Quality
    •  Existing business code has…
    •  … ≈ 3500 unit tests
    •  and ≈600+ integration tests
    65

    View Slide

  66. #CassandraSummit @doanduyhai @BriceDutheil
    Code Quality
    •  We are TDD aficionados …
    66

    View Slide

  67. #CassandraSummit @doanduyhai @BriceDutheil
    Code Quality
    •  We are TDD aficionados …
    •  … and we love our code coverage
    67

    View Slide

  68. #CassandraSummit @doanduyhai @BriceDutheil
    Code Quality
    "The code coverage
    is one of your most
    valuable technical asset"
    Libon – since beginning
    68

    View Slide

  69. #CassandraSummit @doanduyhai @BriceDutheil
    Repositories
    Services
    Refactoring Strategy
    ContactMatchingService
    ContactService
    ContactSync
    ContactEntity
    n
    1
    n n
    69

    View Slide

  70. #CassandraSummit @doanduyhai @BriceDutheil
    Repositories
    Services
    Refactoring Strategy
    ContactMatchingService
    ContactService
    ContactNoSQLEntity
    ContactSync
    ContactEntity
    n
    1
    n n
    70
    Proxy

    View Slide

  71. #CassandraSummit @doanduyhai @BriceDutheil
    Repositories
    Services
    Refactoring Strategy
    ContactMatchingService
    ContactService
    ContactNoSQLEntity
    ContactSync
    ContactEntity
    n
    1
    n n
    Denorm2
    … DenormN
    Denorm1
    71
    Proxy

    View Slide

  72. #CassandraSummit @doanduyhai @BriceDutheil
    Refactoring Strategy
    •  Use CQRS
    •  ContactReadRepository
    •  ContactWriteRepository
    •  ContactUpdateRepository
    •  ContactDeleteRepository
    72

    View Slide

  73. #CassandraSummit @doanduyhai @BriceDutheil
    Refactoring Strategy
    •  ContactReadRepository
    •  direct sequential read
    •  no joins
    •  1 read ≈ 1 SELECT
    73

    View Slide

  74. #CassandraSummit @doanduyhai @BriceDutheil
    Refactoring Strategy
    •  ContactWriteRepository
    •  write to all denormalized tables
    •  using CQL logged batches
    •  use TTLs
    74

    View Slide

  75. #CassandraSummit @doanduyhai @BriceDutheil
    Refactoring Strategy
    •  ContactUpdateRepository
    •  read-before-write most of the time
    •  rare updates ‛ acceptable perf penalty
    75

    View Slide

  76. #CassandraSummit @doanduyhai @BriceDutheil
    Refactoring Strategy
    •  ContactDeleteRepository
    •  delete
    •  update contact modification date
    76

    View Slide

  77. #CassandraSummit @doanduyhai @BriceDutheil
    Outcome
    •  5 months of 2 men work
    77

    View Slide

  78. #CassandraSummit @doanduyhai @BriceDutheil
    Outcome
    •  5 months of 2 men work
    •  Many iterations to fix bugs (thanks to IT)
    78

    View Slide

  79. #CassandraSummit @doanduyhai @BriceDutheil
    Outcome
    •  5 months of 2 men work
    •  Many iterations to fix bugs (thanks to IT)
    •  Lots of performance benchmarks using Gatling
    79

    View Slide

  80. #CassandraSummit @doanduyhai @BriceDutheil
    Gatling Output
    80

    View Slide

  81. #CassandraSummit @doanduyhai @BriceDutheil
    Outcome
    •  5 months of 2 men work
    •  Many iterations to fix bugs (thanks to IT)
    •  Lots of performance benchmarks using Gatling
    ‛ data model & code validation
    81

    View Slide

  82. #CassandraSummit @doanduyhai @BriceDutheil
    Outcome
    •  5 months of 2 men work
    •  Many iterations to fix bugs (thanks to IT)
    •  Lots of performance benchmarks using Gatling
    ‛ data model & code validation
    •  … we are almost there for production
    82

    View Slide

  83. #CassandraSummit @doanduyhai @BriceDutheil
    Data Model

    View Slide

  84. #CassandraSummit @doanduyhai @BriceDutheil
    Denormalization, the good
    •  Support fast reads
    •  1 read ≈ 1 SELECT
    •  Worthy because mostly read, few updates
    84

    View Slide

  85. #CassandraSummit @doanduyhai @BriceDutheil
    Denormalization, the bad
    •  Updating mutable data can be nightmare
    •  Data model bound by existing client-facing API
    •  Update paths very error-prone without tests
    85

    View Slide

  86. #CassandraSummit @doanduyhai @BriceDutheil
    Data model in detail
    Contacts_by_id
    Contacts_by_identifiers
    Contacts_in_profiles
    Contacts_by_modification_date
    Contacts_by_firstname_lastname
    Contacts_linked_user
    86

    View Slide

  87. #CassandraSummit @doanduyhai @BriceDutheil
    Data model in detail
    Contacts_by_id
    Contacts_by_identifiers
    Contacts_in_profiles
    Contacts_by_modification_date
    Contacts_by_firstname_lastname
    Contacts_linked_user
    87
    user_id always component
    of partition key

    View Slide

  88. #CassandraSummit @doanduyhai @BriceDutheil
    Scalable design
    88
    n1
    n2
    n3
    n4
    n5
    n6
    n7
    n8
    A
    B
    C
    D
    E
    F
    G
    H
    user_id1
    user_id2
    user_id3
    user_id4
    user_id5

    View Slide

  89. #CassandraSummit @doanduyhai @BriceDutheil
    Scalable design
    89
    n1
    n2
    n3
    n4
    n5
    n6
    n7
    n8
    A
    B
    C
    D
    E
    F
    G
    H
    user_id1
    user_id2
    user_id3
    user_id4
    user_id5

    View Slide

  90. #CassandraSummit @doanduyhai @BriceDutheil
    Bloom filters in action
    90
    •  For some tables, partition key = (user_id, contact_id)
    ‛ fast look-up, leverages Bloom filters
    ‛ touches 1 SSTable most of the time

    View Slide

  91. #CassandraSummit @doanduyhai @BriceDutheil
    Data model in detail
    Contacts_by_id
    Contacts_by_identifiers
    Contacts_in_profiles
    Contacts_by_modification_date
    Contacts_by_firstname_lastname
    Contacts_linked_user
    91
    Wide partition
    Bucketed

    View Slide

  92. #CassandraSummit @doanduyhai @BriceDutheil
    A "queue" story
    92
    •  contacts_by_modification_date
    •  queue-like pattern

    View Slide

  93. #CassandraSummit @doanduyhai @BriceDutheil
    A "queue" story
    93
    •  contacts_by_modification_date
    •  queue-like pattern
    ‛ buckets to the rescue
    user_id:2014-12
    date35
    date12
    … … date47
    … … … …
    user_id:2014-11
    date11
    date12
    … … date34
    … … … …

    View Slide

  94. #CassandraSummit @doanduyhai @BriceDutheil
    Data model summary
    •  7 tables for denormalization
    94

    View Slide

  95. #CassandraSummit @doanduyhai @BriceDutheil
    Data model summary
    •  7 tables for denormalization
    •  Normalize some tables because rare access
    95

    View Slide

  96. #CassandraSummit @doanduyhai @BriceDutheil
    Data model summary
    •  7 tables for denormalization
    •  Normalize some tables because rare access
    •  Read-before write in most update scenarios
    96

    View Slide

  97. #CassandraSummit @doanduyhai @BriceDutheil
    Notes on contact_id
    •  In SQL, auto-generated long using sequence
    •  In Cassandra, auto-generated timeuuid
    97

    View Slide

  98. #CassandraSummit @doanduyhai @BriceDutheil
    Notes on contact_id
    •  How to store both types ?
    98

    View Slide

  99. #CassandraSummit @doanduyhai @BriceDutheil
    Notes on contact_id
    •  How to store both types ?
    •  As text ? ‛ easy solution …
    99

    View Slide

  100. #CassandraSummit @doanduyhai @BriceDutheil
    Notes on contact_id
    •  How to store both types ?
    •  As text ? ‛ easy solution …
    •  … but waste of space !
    •  because encoded as UTF-8 or ASCII in Cassandra
    100

    View Slide

  101. #CassandraSummit @doanduyhai @BriceDutheil
    Notes on contact_id
    •  Long ‛ 8 bytes
    •  Long as text(UTF-8: 1 byte)
    ‛ "digits count" bytes
    101

    View Slide

  102. #CassandraSummit @doanduyhai @BriceDutheil
    Notes on contact_id
    •  UUID ‛ 16 bytes
    •  32 hex chars + 4 hyphens = 36 chars
    •  UUID as text(UTF-8: 1 byte)
    ‛ 36 bytes
    •  Bytes overhead = 36 – 16 = 20 bytes
    102

    View Slide

  103. #CassandraSummit @doanduyhai @BriceDutheil
    Notes on contact_id
    •  20 bytes wasted per contact uuid
    103

    View Slide

  104. #CassandraSummit @doanduyhai @BriceDutheil
    Notes on contact_id
    •  20 bytes wasted per contact uuid
    •  × 7 denormalizations = 140 bytes per contact uuid
    104

    View Slide

  105. #CassandraSummit @doanduyhai @BriceDutheil
    Notes on contact_id
    •  20 bytes wasted per contact uuid
    •  × 7 denormalizations = 140 bytes per contact uuid
    •  × 109 contacts = 140 GB wasted

    105
    not even counting replication factor …

    View Slide

  106. #CassandraSummit @doanduyhai @BriceDutheil
    Notes on contact_id
    •  ‛ just save contact id as byte[ ]
    106

    View Slide

  107. #CassandraSummit @doanduyhai @BriceDutheil
    Notes on contact_id
    •  ‛ just save contact id as byte[ ]
    •  Achilles @TypeTransformer for automatic conversion
    (see later)
    107

    View Slide

  108. #CassandraSummit @doanduyhai @BriceDutheil
    Notes on contact_id
    •  ‛ just save contact id as byte[ ]
    •  Achilles @TypeTransformer for automatic conversion
    (see later)
    •  Use blobAsBigInt( ) or blobAsUUID( ) to view data
    108

    View Slide

  109. #CassandraSummit @doanduyhai @BriceDutheil
    Achilles
    •  Advanced "object mapper"
    •  Fluent API
    •  Tons of features
    •  TDD friendly
    109

    View Slide

  110. #CassandraSummit @doanduyhai @BriceDutheil
    Achilles
    •  Dirty checking, why is it important ?
    110

    View Slide

  111. #CassandraSummit @doanduyhai @BriceDutheil
    Achilles
    •  Dirty checking, why is it important ?
    •  1 contact ≈ 8 mutable fields
    111

    View Slide

  112. #CassandraSummit @doanduyhai @BriceDutheil
    Achilles
    •  Dirty checking, why is it important ?
    •  1 contact ≈ 8 mutable fields
    •  × 7 denormalizations = 56 update combinations …
    112

    View Slide

  113. #CassandraSummit @doanduyhai @BriceDutheil
    Achilles
    •  Dirty checking, why is it important ?
    •  1 contact ≈ 8 mutable fields
    •  × 7 denormalizations = 56 update combinations …
    •  and not even counting multiple fields updates …
    113

    View Slide

  114. #CassandraSummit @doanduyhai @BriceDutheil
    Achilles
    •  Are you going to manually generate 56+ prepared
    statements for all possible updates ?
    114

    View Slide

  115. #CassandraSummit @doanduyhai @BriceDutheil
    Achilles
    •  Are you going to manually generate 56+ prepared
    statements for all possible updates ?
    •  Or just use dynamic plain string statements and get
    some perf penalty ?
    115

    View Slide

  116. #CassandraSummit @doanduyhai @BriceDutheil
    Achilles
    •  Dirty check in action
    //No read-before-write
    ContactEntity proxy = manager.forUpdate(ContactEntity.class, contactId);
    proxy.setFirstName(…);
    proxy.setLastName(…); //type-safe updates
    proxy.setAddress(…);
    manager.update(proxy);
    116

    View Slide

  117. #CassandraSummit @doanduyhai @BriceDutheil
    Achilles
    117
    Empty
    Entity
    DirtyMap
    Proxy Setters interception
    PrimaryKey

    View Slide

  118. #CassandraSummit @doanduyhai @BriceDutheil
    Achilles
    •  Dynamic statements generation
    UPDATE contacts SET firstname=?, lastname=?,address=?
    WHERE contact_id=?
    118
    prepared statements are cached, of course

    View Slide

  119. #CassandraSummit @doanduyhai @BriceDutheil
    Achilles
    •  Insert strategy, what is it ?
    119

    View Slide

  120. #CassandraSummit @doanduyhai @BriceDutheil
    Achilles
    •  Simple INSERT prepared statement
    INSERT INTO
    contacts(contact_id,name,age,address,gender,avatar,…)
    VALUES(?, ?, ?, ? … ?);
    120

    View Slide

  121. #CassandraSummit @doanduyhai @BriceDutheil
    Achilles
    •  Runtime values binding
    •  some columns are optional
    preparedStatement.bind(49374,’John DOE’,33, null, null, …, null);
    121

    View Slide

  122. #CassandraSummit @doanduyhai @BriceDutheil
    Achilles
    Wait … are you saying inserting null in CQL???

    122

    View Slide

  123. #CassandraSummit @doanduyhai @BriceDutheil
    Achilles
    Inserting null 㲇 creating tombstones
    123

    View Slide

  124. #CassandraSummit @doanduyhai @BriceDutheil
    Achilles
    Inserting null 㲇 creating tombstones
    × 7 denormalizations
    124

    View Slide

  125. #CassandraSummit @doanduyhai @BriceDutheil
    Achilles
    Inserting null 㲇 creating tombstones
    × 7 denormalizations
    × billions of contacts created

    125
    not even counting replication factor …

    View Slide

  126. #CassandraSummit @doanduyhai @BriceDutheil
    Achilles
    @Entity(table = "contacts_by_id »)
    @Strategy(insert = InsertStrategy.NOT_NULL_FIELDS)
    public class ContactById {
    }
    126
    •  Simple annotation

    View Slide

  127. #CassandraSummit @doanduyhai @BriceDutheil
    Achilles
    •  Runtime dynamic INSERT statement
    INSERT INTO
    contacts(contact_id, name, age, address,)
    VALUES(:contact_id, :name, :age, :address);
    127
    prepared statements are cached, of course

    View Slide

  128. #CassandraSummit @doanduyhai @BriceDutheil
    Achilles
    @PartitionKey
    @Column(name = "contact_id")
    @TypeTransformer(valueCodecClass = ContactIdToBytes.class)
    private ContactId contactId;
    128
    •  Remember the contactId ⁶ byte[ ] conversion ?
    BYOC ‛ Bring Your Own Codec

    View Slide

  129. #CassandraSummit @doanduyhai @BriceDutheil
    Achilles
    public interface Codec {
    Class sourceType();
    Class targetType();
    TO encode(FROM fromJava)
    FROM decode(TO fromCassandra);
    }
    129

    View Slide

  130. #CassandraSummit @doanduyhai @BriceDutheil
    Achilles
    130
    2014-12-01 14:25:20,554 Bound statement : [INSERT INTO
    contacts.contacts_by_modification_date(user_id,month_bucket,modification_date,...) VALUES
    (:user_id,:month_bucket,:modification_date,...) USING TTL :ttl;] with CONSISTENCY LEVEL [LOCAL_QUORUM]
    2014-12-01 14:25:20,554 bound values : [222130151, 2014-12, e13d0d50-7965-11e4-af38-90b11c2549e0, ...]
    2014-12-01 14:25:20,701 Bound statement : [SELECT birthday,middlename,avatar_size,... FROM
    contacts.contacts_by_modification_date WHERE user_id=:user_id AND month_bucket=:month_bucket AND
    (modification_date)>=(:modification_date) ORDER BY modification_date ASC;] with CONSISTENCY LEVEL
    [LOCAL_QUORUM]
    2014-12-01 14:25:20,701 bound values : [222130151, 2014-10, be6bc010-6109-11e4-b385-000038377ead]
    •  Dynamic logging in action

    View Slide

  131. #CassandraSummit @doanduyhai @BriceDutheil
    Achilles
    131
    •  Dynamic logging
    •  runtime activation
    •  no need to recompile/re-deploy
    •  save us hours of debugging
    •  TRACE log level ‛ query tracing

    View Slide

  132. #CassandraSummit @doanduyhai @BriceDutheil
    Take Away

    View Slide

  133. #CassandraSummit @doanduyhai @BriceDutheil
    Conditions for success
    •  Data modeling is crucial
    133

    View Slide

  134. #CassandraSummit @doanduyhai @BriceDutheil
    Conditions for success
    •  Data modeling is crucial
    •  Double-run strategy & timestamp trick FTW
    134

    View Slide

  135. #CassandraSummit @doanduyhai @BriceDutheil
    Conditions for success
    •  Data modeling is crucial
    •  Double-run strategy & timestamp trick FTW
    •  Data type conversion can be tricky
    135

    View Slide

  136. #CassandraSummit @doanduyhai @BriceDutheil
    Conditions for success
    •  Data modeling is crucial
    •  Double-run strategy & timestamp trick FTW
    •  Data type conversion can be tricky
    •  Benchmark !
    136

    View Slide

  137. #CassandraSummit @doanduyhai @BriceDutheil
    Conditions for success
    •  Data modeling is crucial
    •  Double-run strategy & timestamp trick FTW
    •  Data type conversion can be tricky
    •  Benchmark !
    •  Mindset shifts for the team
    137

    View Slide

  138. Thank You

    View Slide