Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kafka, Samza, and the Unix philosophy of distributed data

Kafka, Samza, and the Unix philosophy of distributed data

Talk given at the UK Hadoop Meetup (HUGUK), London, UK on 5 August 2015. http://martin.kleppmann.com/2015/08/05/samza-unix-philosophy-at-huguk.html

Transcript: http://martinkl.com/unix

Abstract:

One of the big ideas in Unix was to allow small, simple command-line tools to be chained together with pipes. Each of those tools would do one thing and do it well. Even now, 50 years later, Unix tools are one of the most powerful ways of getting things done: a one-liner of grep | awk | sort | uniq is still one of the fastest ways of processing data and analysing logs.

Many modern data systems are monolithic, the very opposite of the Unix philosophy. But Apache Samza is different: it is, in some sense, an attempt to bring the Unix philosophy into 21st-century distributed systems. In this talk, we will explore the design decisions behind Samza, and see how the Unix philosophy can help us build modern systems that are robust, scalable and maintainable.

Martin Kleppmann

August 05, 2015
Tweet

More Decks by Martin Kleppmann

Other Decks in Programming

Transcript

  1. 216.58.210.78 - - [27/Feb/2015:17:55:11 +0000] "GET /css/typography.css HTTP/1.1" 200 3377

    "http://martin.kleppmann.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36"
  2. cat access.log | awk '{print $7}' | sort | uniq

    -c | sort -rn | head –n 5 216.58.210.78 - - [27/Feb/2015:17:55:11 +0000] "GET /css/typography.css HTTP/1.1" 200 3377 "http://martin.kleppmann.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36"
  3. cat access.log | awk '{print $7}' | sort | uniq

    -c | sort -rn | head –n 5 /css/typography.css /index.html /favicon.ico /talks.html /favicon.ico /index.html /css/typography.css /favicon.ico
  4. cat access.log | awk '{print $7}' | sort | uniq

    -c | sort -rn | head –n 5 /css/typography.css /css/typography.css /favicon.ico /favicon.ico /favicon.ico /index.html /index.html /talks.html
  5. cat access.log | awk '{print $7}' | sort | uniq

    -c | sort -rn | head –n 5 2 /css/typography.css 3 /favicon.ico 2 /index.html 1 /talks.html
  6. cat access.log | awk '{print $7}' | sort | uniq

    -c | sort -rn | head –n 5 3 /favicon.ico 2 /css/typography.css 2 /index.html 1 /talks.html
  7. cat access.log | awk '{print $7}' | sort | uniq

    -c | sort -rn | head –n 5 3 /favicon.ico 2 /css/typography.css 2 /index.html 1 /talks.html
  8. References •  M D McIlroy, E N Pinson, and B

    A Tague: “UNIX Time-Sharing System: Foreword,” The Bell System Technical Journal, volume 57, number 6, pages 1899–1904, July 1978. https:// archive.org/details/bstj57-6-1899 •  Rob Pike and Brian W Kernighan: “Program design in the UNIX environment,” AT&T Bell Laboratories Technical Journal, volume 63, number 8, pages 1595–1605, October 1984. doi: 10.1002/j.1538-7305.1984.tb00055.x, http://harmful.cat-v.org/cat-v/unix_prog_design.pdf •  Dennis Ritchie: “Advice from Doug McIlroy.” http://cm.bell-labs.co/who/dmr/ mdmpipe.html •  Jay Kreps: “Putting Apache Kafka to use: A practical guide to building a stream data platform (part 2).” 24 February 2015. http://www.confluent.io/blog/stream-data- platform-2/ •  Jay Kreps: “I ♥︎ Logs.” O’Reilly Media, September 2014. http://shop.oreilly.com/product/ 0636920034339.do •  Martin Kleppmann: “Bottled Water: Real-time integration of PostgreSQL and Kafka.” 23 April 2015. http://www.confluent.io/blog/bottled-water-real-time-integration-of- postgresql-and-kafka/ •  Martin Kleppmann: “Designing Data-Intensive Applications.” O’Reilly Media, to appear in 2015. http://dataintensive.net