Save 37% off PRO during our Black Friday Sale! »

Getting data out of databases: a surprisingly tricky problem

Getting data out of databases: a surprisingly tricky problem

Slides from my talk at All Your Base, London, UK, 13 Nov 2015.
http://martin.kleppmann.com/2015/11/13/change-data-capture-at-all-your-base.html
http://allyourbaseconf.com/2015/speakers#martin-kleppmann

Abstract:

Writing to a database is easy, but getting the data out again is surprisingly hard.

Of course, if you just want to query the database and get some results, that’s fine. But what if you want a copy of your database contents in some other system — for example, to make it searchable in Elasticsearch, or to pre-fill caches so that they’re nice and fast, or to load it into a data warehouse for analytics, or if you want to migrate to a different database technology?

As the data is constantly changing, a one-off snapshot of the database is not enough: you need to tap into the ongoing stream of writes to the database. This technique is called Change Data Capture (CDC). At companies like LinkedIn and Facebook, this is how caches and indexes are kept up-to-date.

This talk explains why change data capture is so useful, and how it prevents race conditions and other ugly problems. Martin will explore the practical details of implementing CDC with PostgreSQL and Apache Kafka, and discuss the approaches you can use to do the same with various other databases.

References:

1. Martin Kleppmann: “Bottled Water: Real-time integration of PostgreSQL and Kafka.” 23 April 2015. http://blog.confluent.io/2015/04/23/bottled-water-real-time-integration-of-postgresql-and-kafka/

2. Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “All Aboard the Databus!,” at ACM Symposium on Cloud Computing (SoCC), October 2012. http://www.socc2012.org/s18-das.pdf

3. Yogeshwer Sharma, Philippe Ajoux, Petchean Ang, et al.: “Wormhole: Reliable Pub-Sub to Support Geo-replicated Internet Services,” at 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI), May 2015. https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-sharma.pdf

4. Jay Kreps: “I ♥︎ Logs.” O'Reilly Media, September 2014. http://shop.oreilly.com/product/0636920034339.do

5. Martin Kleppmann: “Designing data-intensive applications.” O’Reilly Media, to appear. http://dataintensive.net

6. Martin Kleppmann: “Turning the database inside-out with Apache Samza.” 4 March 2015. http://blog.confluent.io/2015/03/04/turning-the-database-inside-out-with-apache-samza/

7. Pat Helland: “Immutability Changes Everything,” at 7th Biennial Conference on Innovative Data Systems Research (CIDR), January 2015. http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf

0d4ef9af8e4f0cf5c162b48ba24faea6?s=128

Martin Kleppmann

November 13, 2015
Tweet

Transcript

  1. None
  2. None
  3. None
  4. None
  5. None
  6. None
  7. None
  8. None
  9. None
  10. None
  11. None
  12. None
  13. None
  14. None
  15. None
  16. None
  17. None
  18. None
  19. None
  20. None
  21. None
  22. None
  23. None
  24. 3rd ACM Symposium on Cloud Computing, San Jose, Oct 2012

    http://www.socc2012.org/s18-das.pdf
  25. None
  26. None
  27. None
  28. None
  29. None
  30. None
  31. None
  32. None
  33. None
  34. None
  35. None
  36. None
  37. None
  38. None
  39. None
  40. None
  41. None
  42. None
  43. None
  44. None
  45. None
  46. None
  47. None
  48. None
  49. None
  50. None
  51. None
  52. None
  53. None
  54. None
  55. None
  56. None
  57. None
  58. None
  59. None
  60. None
  61. None
  62. None
  63. None
  64. None
  65. None
  66. None
  67. None
  68. None
  69. Further reading 1.  Martin Kleppmann: “Bottled Water: Real-time integration of

    PostgreSQL and Kafka.” 23 April 2015. http://blog.confluent.io/2015/04/23/bottled-water-real-time-integration-of-postgresql-and-kafka/ 2.  Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “All Aboard the Databus!,” at ACM Symposium on Cloud Computing (SoCC), October 2012. http://www.socc2012.org/s18-das.pdf 3.  Yogeshwer Sharma, Philippe Ajoux, Petchean Ang, et al.: “Wormhole: Reliable Pub-Sub to Support Geo- replicated Internet Services,” at 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI), May 2015. https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper- sharma.pdf 4.  Jay Kreps: “I ♥︎ Logs.” O'Reilly Media, September 2014. http://shop.oreilly.com/product/ 0636920034339.do 5.  Martin Kleppmann: “Designing data-intensive applications.” O’Reilly Media, to appear. http:// dataintensive.net 6.  Martin Kleppmann: “Turning the database inside-out with Apache Samza.” 4 March 2015. http:// blog.confluent.io/2015/03/04/turning-the-database-inside-out-with-apache-samza/ 7.  Pat Helland: “Immutability Changes Everything,” at 7th Biennial Conference on Innovative Data Systems Research (CIDR), January 2015. http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf
  70. Free copies in the lunch break! Discount code: TS2015