Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kontera: Applying Unix Philosophy Principles to...

Kontera: Applying Unix Philosophy Principles to Backend Architecture

Presented on DevConTLV 2013-02-14.

Avatar for Stas Krichevsky

Stas Krichevsky

February 14, 2013
Tweet

Other Decks in Technology

Transcript

  1. Applying Unix Philosophy Principles to Backend Architecture Wednesday, February 20,

    13 There are several versions of “Unix philosophy” (or “Unix design principles”). One i chose was formulated by Mike Gancarz in 1994 (published in book “The UNIX Philosophy” in 1995).
  2. Applying (some) Unix Philosophy Principles to Backend Architecture Wednesday, February

    20, 13 We’ll talk about subset of the “Unix Philosophy Principles”. I chose the most relevant for us.
  3. About Kontera Wednesday, February 20, 13 In one sentence: “Kontera

    is a technology company that analyzes what people write and read about on the internet and uses it for various products, ranging from online advertisement to analytics.”
  4. About Kontera • Over 400M API requests per day •

    Over 1B messages per day • ~1000 nodes in 4 geographic locations • Generate 100GB of data per day Wednesday, February 20, 13
  5. About Kontera < < In terms of API calls Wednesday,

    February 20, 13 Which, according to John Musser’s presentation @GlueCon 2012, puts us, in terms of API calls, somewhere between SalesForce and eBay. http://www.slideshare.net/jmusser/j-musser-apishotnotgluecon2012 -- “search for API billionaires club 2012” The reason i bring it up is not to brag about it. I want to show that we’re dealing with some real and complex problems.
  6. Kontera backend • Services • Pipelines • Scheduled workers •

    Frontends Wednesday, February 20, 13 Before we get to the unix philosophy principles, i want to setup some context. We have four major types of applications in our backend: Services, Pipelines, Scheduled workers, and Frontends. Let’s see what each type means.
  7. Services • Synchronous API (JSON over HTTP) • Scales horizontally

    • Accessible via load balancer (haproxy/ nginx) Wednesday, February 20, 13 Stateless applications with synchronous API. Most written in Clojure and Ruby (Sinatra), some in Scala, some in Java.
  8. Service Service Model JSON over HTTP Wednesday, February 20, 13

    Model -- can be local files, database, KV store.
  9. { "content" : "A key part of Apple's agency model

    was a \"most favored nation\" clause that prevented publishers from selling books to other retailers at prices lower than those offered to Apple. The clause was intended to prevent Amazon from striking deals to continue undercutting other retailers, but quickly drew criticism and the attention of regulators for potential price collusion effects.", "nf" : [ "person", "date", "organization" ] } Input Output { ... [[ 1, "The clause was intended to prevent Amazon from striking deals to continue undercutting other retailers, but quickly drew criticism and the attention of regulators for potential price collusion effects.", { "Amazon" : [ { "pattern" : [ "NNP" ], "type" : [ "np", “organization” ], "span" : [ 6, 7 ] } ], "striking deals" : [ { "pattern" : [ "JJ", "NNS" ], "type" : [ "np" ], "span" : [ 8, 10 ] } ], "other retailers" : [ { "pattern" : [ "JJ", "NNS" ], "type" : [ "np" ], "span" : [ 13, 15 ] } ], "potential price collusion effects" : [ { "pattern" : [ "JJ", "NN", "NN", "NNS" ], "type" : [ "np" ], "span" : [ 26, 30 ] } ] } ] ] 2nd sentence kontera-phrase Wednesday, February 20, 13 Example: Phrase extraction service Input: JSON object - content we want to analyze + named entity types we want to extract. Output: sentences, noun phrases, named entities + metadata
  10. Pipelines • Built from “pipeline workers” that talk to each

    other asynchronously • Distributed across the system Wednesday, February 20, 13 Pipelines are built from tiny applications we call “pipeline workers” They talk to each other asynchronously via message broker using JSON messages. Pipeline worker is typically one file, written in Ruby, uses our home-grown framework. The pipeline is distributed across the system, each “pipeline worker” has multiple instances running on different machines.
  11. Pipeline example Receive tweet Extract phrases Detect sentiment Increment counts

    KV store Phrase extractor Sentiment classifier Wednesday, February 20, 13 Here is a simplified pipeline example. This specific pipeline analyzes twitter stream (decahose). The goal is to be able to say in realtime what people think about certain products in terms of positive/negative mentions. This pipeline consists of 4 workers. We receive a tweet, extract phrases from the text (using phrase extractor service), detect sentiment of the text (neutral/negative/positive) (using sentiment classifier service), and then increment appropriate counts in KV store. Blue -- transformer workers Green -- aggregator workers Yellow -- services
  12. Pipeline example Receive tweet Extract phrases Detect sentiment Increment counts

    KV store Phrase extractor Sentiment classifier Wednesday, February 20, 13 Here is a simplified pipeline example. This specific pipeline analyzes twitter stream (decahose). The goal is to be able to say in realtime what people think about certain products in terms of positive/negative mentions. This pipeline consists of 4 workers. We receive a tweet, extract phrases from the text (using phrase extractor service), detect sentiment of the text (neutral/negative/positive) (using sentiment classifier service), and then increment appropriate counts in KV store. Blue -- transformer workers Green -- aggregator workers Yellow -- services
  13. Pipeline example Receive tweet Extract phrases Detect sentiment Increment counts

    KV store Phrase extractor Sentiment classifier Wednesday, February 20, 13 Here is a simplified pipeline example. This specific pipeline analyzes twitter stream (decahose). The goal is to be able to say in realtime what people think about certain products in terms of positive/negative mentions. This pipeline consists of 4 workers. We receive a tweet, extract phrases from the text (using phrase extractor service), detect sentiment of the text (neutral/negative/positive) (using sentiment classifier service), and then increment appropriate counts in KV store. Blue -- transformer workers Green -- aggregator workers Yellow -- services
  14. Pipeline example Receive tweet Extract phrases Detect sentiment Increment counts

    KV store Phrase extractor Sentiment classifier Wednesday, February 20, 13 Here is a simplified pipeline example. This specific pipeline analyzes twitter stream (decahose). The goal is to be able to say in realtime what people think about certain products in terms of positive/negative mentions. This pipeline consists of 4 workers. We receive a tweet, extract phrases from the text (using phrase extractor service), detect sentiment of the text (neutral/negative/positive) (using sentiment classifier service), and then increment appropriate counts in KV store. Blue -- transformer workers Green -- aggregator workers Yellow -- services
  15. Pipeline example Receive tweet Extract phrases Detect sentiment Increment counts

    KV store Phrase extractor Sentiment classifier Wednesday, February 20, 13 Here is a simplified pipeline example. This specific pipeline analyzes twitter stream (decahose). The goal is to be able to say in realtime what people think about certain products in terms of positive/negative mentions. This pipeline consists of 4 workers. We receive a tweet, extract phrases from the text (using phrase extractor service), detect sentiment of the text (neutral/negative/positive) (using sentiment classifier service), and then increment appropriate counts in KV store. Blue -- transformer workers Green -- aggregator workers Yellow -- services
  16. Pipeline example Receive tweet Extract phrases Detect sentiment Increment counts

    KV store Phrase extractor Sentiment classifier Wednesday, February 20, 13 Here is a simplified pipeline example. This specific pipeline analyzes twitter stream (decahose). The goal is to be able to say in realtime what people think about certain products in terms of positive/negative mentions. This pipeline consists of 4 workers. We receive a tweet, extract phrases from the text (using phrase extractor service), detect sentiment of the text (neutral/negative/positive) (using sentiment classifier service), and then increment appropriate counts in KV store. Blue -- transformer workers Green -- aggregator workers Yellow -- services
  17. Pipeline worker Mailbox Worker Exchange Mailbox Mailbox Mailbox KV store

    Wednesday, February 20, 13 If we zoom in on a worker, we’ll see following: Worker receives a message and processes it. There can be any number of instances of the worker, they all share same mailbox. Once the processing finished, it will either store results (in KV store or database) or create and send new message with some routing information to an exchange. Worker doesn’t care who sent the incoming message or who is the recipient of the outgoing message. Mailbox (queue) can be bound to multiple exchanges. Binding configuration is external to the worker. This approach has very nice side effect. Since worker doesn’t have to know who sent it the message and doesn’t know anything about recipients, it’s much easier to test it in isolation. We can configure only relevant part of the pipeline on local machine, record some real production data, and replay it on local machine by sending messages to the appropriate mailbox.
  18. Scheduled workers • Distributed “cron-like” tiny workers • Singletons •

    Used for: • Maintenance tasks • Model updates for services • Report generation Wednesday, February 20, 13 Distributed Singletons -- only one instance can run at the same time. Examples: - maintenance tasks: cleaning/archiving old data - service model updates - reports generation
  19. Frontends • Internal and external UI applications • External APIs

    • Diagnostic tools • Demo applications Wednesday, February 20, 13 Examples for each type Internal UI -- devops, caretaker External UI -- content-dashboard, content-activation Diagnostic tools -- ??? Demo applications -- superbowl page
  20. Small is beautiful Wednesday, February 20, 13 Choose and create

    tools that help making programs smaller. For example, choosing Clojure over Java (if we need JVM or have to use some java library) Choosing Sinatra over Rails where appropriate. Picking smaller 3rd-party libraries that have less dependencies.
  21. Small is beautiful • Less code ➙ less bugs •

    Easier to maintain • Easier to combine • Easier to replace Wednesday, February 20, 13
  22. Make each program do one thing well Wednesday, February 20,

    13 Well, i don’t actually know whether this guy does well what he does. But i’m sure most people will not do it well.
  23. Make each program do one thing well • Focus during

    development • Simpler API • Less dependencies Wednesday, February 20, 13 Allows to focus on single thing, which leads to faster development. Simpler API -> easier to understand, test in isolation, and use in different contexts (for services it means i can create new workers that use that service, for workers it means that i can re-use the worker itself in different pipeline or i can use the output of the worker by binding new mailbox to it’s exchange). Less dependencies -> easier to maintain, easier to deploy and scale
  24. Build a prototype as soon as possible Wednesday, February 20,

    13 We want to create an environment that simplifies experimenting with new services and pipelines and makes it easy to adopt new technologies.
  25. Example Receive tweet Extract phrases Detect sentiment Increment counts KV

    store Phrase extractor Sentiment classifier Frontend Report generator Wednesday, February 20, 13 This is more complete example. We have the pipeline (blue and green boxes), services (yellow boxes), frontend (orange) and scheduled worker (gray). Let’s say we want now to add one more thing. We want to see whether people say (on twitter) that certain entities are fun or boring.
  26. Receive tweet Extract phrases Detect sentiment and fun/boring Increment counts

    KV store Phrase extractor Sentiment classifier Frontend Report generator Fun/boring classifier Example (continued) Wednesday, February 20, 13 So we’ll add another service that can classify tweet message into neutral/fun/boring classes. It’s essentially the same code as sentiment classifier, but it uses different model. And we will rename the “detect sentiment” worker to “detect sentiment and fun/boring” worker. There is one problem. We’re disrupting existing pipeline. It’s also violates the “Do one thing well” principle. There is a better way.
  27. Receive tweet Extract phrases Detect sentiment and fun/boring Increment counts

    KV store Phrase extractor Sentiment classifier Frontend Report generator Fun/boring classifier Example (continued) Wednesday, February 20, 13 So we’ll add another service that can classify tweet message into neutral/fun/boring classes. It’s essentially the same code as sentiment classifier, but it uses different model. And we will rename the “detect sentiment” worker to “detect sentiment and fun/boring” worker. There is one problem. We’re disrupting existing pipeline. It’s also violates the “Do one thing well” principle. There is a better way.
  28. Receive tweet Extract phrases Detect sentiment and fun/boring Increment counts

    KV store Phrase extractor Sentiment classifier Frontend Report generator Fun/boring classifier Example (continued) Wednesday, February 20, 13 So we’ll add another service that can classify tweet message into neutral/fun/boring classes. It’s essentially the same code as sentiment classifier, but it uses different model. And we will rename the “detect sentiment” worker to “detect sentiment and fun/boring” worker. There is one problem. We’re disrupting existing pipeline. It’s also violates the “Do one thing well” principle. There is a better way.
  29. Receive tweet Extract phrases Detect sentiment and fun/boring Increment counts

    KV store Phrase extractor Sentiment classifier Frontend Report generator Fun/boring classifier Example (continued) Wednesday, February 20, 13 So we’ll add another service that can classify tweet message into neutral/fun/boring classes. It’s essentially the same code as sentiment classifier, but it uses different model. And we will rename the “detect sentiment” worker to “detect sentiment and fun/boring” worker. There is one problem. We’re disrupting existing pipeline. It’s also violates the “Do one thing well” principle. There is a better way.
  30. Example (continued) Receive tweet Extract phrases Detect sentiment Increment counts

    KV store Phrase extractor Sentiment classifier Detect fun/boring Increment counts Fun/boring classifier Frontend Report generator Wednesday, February 20, 13 We’ll add new worker that has it’s mailbox bound to the same exchange as the “Detect sentiment” worker. Now we can change frontend application when we want and either change or add new report generator to see the new data. We did it without disrupting existing flow.
  31. Example (continued) Receive tweet Extract phrases Detect sentiment Increment counts

    KV store Phrase extractor Sentiment classifier Detect fun/boring Increment counts Fun/boring classifier Frontend Report generator Wednesday, February 20, 13 We’ll add new worker that has it’s mailbox bound to the same exchange as the “Detect sentiment” worker. Now we can change frontend application when we want and either change or add new report generator to see the new data. We did it without disrupting existing flow.
  32. Choose portability over efficiency Wednesday, February 20, 13 It probably

    meant different thing 20 years ago. It was more about being able to port your software to different hardware. Now it’s more about software. JSON instead of binary formats for communication (talking to services, messages between workers). HTTP instead of proprietary tcp. Compressed JSON files for archiving. Using text-based protocols allows us to use standard unix utilities like curl, tcpdump, ngrep. It’s very useful. Interoperability -- easier to integrate with 3rd parties Easier to work with Amazon EMR because we don’t need to write our own deserializer.
  33. Make every program a filter Wednesday, February 20, 13 Services

    and pipelines are essentially filters. Pipelines are filters that are built from other filters (workers). Keeping it i mind enables us to create new applications quickly by combining these components in new ways. It also enables us to extend and add new pipelines without disrupting existing flows.
  34. Summary Wednesday, February 20, 13 Making small programs is not

    always easier. It makes you think harder. You have to learn new things. It might look like we’re trading one type of complexity to another. We have simpler components, but many of them. More moving parts. If we want to remain sane, we have to standardize APIs, create conventions, centralized configuration, develop frameworks and tools, use CI tool (Jenkins in our case). We also have to automate a lot of things. Deployments -- we developed deployment framework and use it for all our applications. This framework can work with any build tool (assuming it can be scripted). One tool from this framework is called Chuck (there is a rumor that it was named after Chuck Norris because it isn’t afraid of any build tool). We use Ant, SBT, Leiningen, SCons. Monitoring -- we monitor a lot of things, from basic stuff like avg load, cpu, memory, disk and networking utilization, to application-specific stuff like queue sizes, number of running worker instances, service response times, error ratios and many other metrics. We have lots of automated alerts. Eran Levy gave a talk about our DevOps tools on RailsIsrael conference last November, you can find it if you google “Eran Levi RailsIsrael 2012”.
  35. Summary • Small is hard Wednesday, February 20, 13 Making

    small programs is not always easier. It makes you think harder. You have to learn new things. It might look like we’re trading one type of complexity to another. We have simpler components, but many of them. More moving parts. If we want to remain sane, we have to standardize APIs, create conventions, centralized configuration, develop frameworks and tools, use CI tool (Jenkins in our case). We also have to automate a lot of things. Deployments -- we developed deployment framework and use it for all our applications. This framework can work with any build tool (assuming it can be scripted). One tool from this framework is called Chuck (there is a rumor that it was named after Chuck Norris because it isn’t afraid of any build tool). We use Ant, SBT, Leiningen, SCons. Monitoring -- we monitor a lot of things, from basic stuff like avg load, cpu, memory, disk and networking utilization, to application-specific stuff like queue sizes, number of running worker instances, service response times, error ratios and many other metrics. We have lots of automated alerts. Eran Levy gave a talk about our DevOps tools on RailsIsrael conference last November, you can find it if you google “Eran Levi RailsIsrael 2012”.
  36. Summary • Small is hard • More moving parts Wednesday,

    February 20, 13 Making small programs is not always easier. It makes you think harder. You have to learn new things. It might look like we’re trading one type of complexity to another. We have simpler components, but many of them. More moving parts. If we want to remain sane, we have to standardize APIs, create conventions, centralized configuration, develop frameworks and tools, use CI tool (Jenkins in our case). We also have to automate a lot of things. Deployments -- we developed deployment framework and use it for all our applications. This framework can work with any build tool (assuming it can be scripted). One tool from this framework is called Chuck (there is a rumor that it was named after Chuck Norris because it isn’t afraid of any build tool). We use Ant, SBT, Leiningen, SCons. Monitoring -- we monitor a lot of things, from basic stuff like avg load, cpu, memory, disk and networking utilization, to application-specific stuff like queue sizes, number of running worker instances, service response times, error ratios and many other metrics. We have lots of automated alerts. Eran Levy gave a talk about our DevOps tools on RailsIsrael conference last November, you can find it if you google “Eran Levi RailsIsrael 2012”.
  37. Summary • Small is hard • More moving parts •

    Standards, conventions, frameworks, tools, CI Wednesday, February 20, 13 Making small programs is not always easier. It makes you think harder. You have to learn new things. It might look like we’re trading one type of complexity to another. We have simpler components, but many of them. More moving parts. If we want to remain sane, we have to standardize APIs, create conventions, centralized configuration, develop frameworks and tools, use CI tool (Jenkins in our case). We also have to automate a lot of things. Deployments -- we developed deployment framework and use it for all our applications. This framework can work with any build tool (assuming it can be scripted). One tool from this framework is called Chuck (there is a rumor that it was named after Chuck Norris because it isn’t afraid of any build tool). We use Ant, SBT, Leiningen, SCons. Monitoring -- we monitor a lot of things, from basic stuff like avg load, cpu, memory, disk and networking utilization, to application-specific stuff like queue sizes, number of running worker instances, service response times, error ratios and many other metrics. We have lots of automated alerts. Eran Levy gave a talk about our DevOps tools on RailsIsrael conference last November, you can find it if you google “Eran Levi RailsIsrael 2012”.
  38. Summary • Small is hard • More moving parts •

    Standards, conventions, frameworks, tools, CI • Automation of deployment, monitoring, alerts Wednesday, February 20, 13 Making small programs is not always easier. It makes you think harder. You have to learn new things. It might look like we’re trading one type of complexity to another. We have simpler components, but many of them. More moving parts. If we want to remain sane, we have to standardize APIs, create conventions, centralized configuration, develop frameworks and tools, use CI tool (Jenkins in our case). We also have to automate a lot of things. Deployments -- we developed deployment framework and use it for all our applications. This framework can work with any build tool (assuming it can be scripted). One tool from this framework is called Chuck (there is a rumor that it was named after Chuck Norris because it isn’t afraid of any build tool). We use Ant, SBT, Leiningen, SCons. Monitoring -- we monitor a lot of things, from basic stuff like avg load, cpu, memory, disk and networking utilization, to application-specific stuff like queue sizes, number of running worker instances, service response times, error ratios and many other metrics. We have lots of automated alerts. Eran Levy gave a talk about our DevOps tools on RailsIsrael conference last November, you can find it if you google “Eran Levi RailsIsrael 2012”.
  39. Summary of Summary • You will need it anyway •

    True agile Wednesday, February 20, 13 So it looks like a lot of extra work. The thing is -- you have to have most of it anyway if you have more than few servers. And rest of it is really nice too. To summarize the summary: The benefits of this environment exceed (by far) the costs. This environment allows us to create new things very quickly and we create a lot of new things. It helps us to be agile.