Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Discovering Drugs with Kafka Streams

Ben Mabey
October 01, 2019

Discovering Drugs with Kafka Streams

Recursion Pharmaceuticals is turning drug discovery into a data science problem. This entails producing and processing petabytes of microscopy images from carefully designed biological experiments. In early 2017 the data production effort in our laboratory scaled to a point where the existing naive batch processing system was not reliably processing the data. The batch approach was also introducing unwanted lag between experiment image capture time and analysis results since an entire experiment, potentially 8TB+, would not begin processing until all the images were available. This was particularly troublesome for our laboratory as they wanted real time quality control metrics on the images. All of these reasons motivated us to replace the batch processing system with a streaming approach. The original data pipeline was implemented as microservices with no central orchestrator but instead relied on implicit flow between the services. The lack of visibility and robustness made the pipeline difficult and costly to operate. We wanted to address these concerns but also avoid rewriting the existing microservices. By building on top of Kafka Streams we created a flexible, highly available, and robust pipeline which leveraged our existing microservices giving us a clear migration path. This presentation will walk you through our thought process and explain the tradeoffs between using Kafka Streams and Spark for our specific use case. We’ll dive into the details of the workflow system we created on top of Kafka Streams that orchestrates these microservices. We’ve been operating with this system since mid 2017 and the additional scale and robustness has played a key role in enabling Recursion to succeed in its mission of discovering new treatments for various diseases. The messages flowing over our Kafka Streams have already led to clinical trials in humans and will hopefully translate into meaningful impact in patients lives one day.

Ben Mabey

October 01, 2019
Tweet

More Decks by Ben Mabey

Other Decks in Programming

Transcript

  1. Ben Mabey VP of Engineering @bmabey Discovering Drugs with Kafka

    Streams Scott Nielsen Director of Data Engineering K A F K A S U M M I T S F 2 0 1 9
  2. 0.00001 0.0001 0.001 0.01 0.1 1 10 100 1000 1971

    1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 Transistor Area (% of 1970 values) Moore’s Law
  3. 0.00001 0.0001 0.001 0.01 0.1 1 10 100 1000 1971

    1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 Transistor Area (% of 1970 values) 1 10 100 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 R&D Spend / Drug (% of 2007 values) Moore’s Law
  4. 0.00001 0.0001 0.001 0.01 0.1 1 10 100 1000 1971

    1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 Transistor Area (% of 1970 values) 1 10 100 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 R&D Spend / Drug (% of 2007 values) Moore’s Law Eroom’s Law
  5. 0 10 20 30 40 50 60 1993 1994 1995

    1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Number of Drugs Approved in US (1993-2016)
  6. Healthy child Healthy cells Child with rare genetic disease (Cornelia

    de Lange Syndrome) Genetic disease model cells (Cornelia de Lange Syndrome)
  7. Public Dataset: http://rxrx.ai Nature Article Machine learning brings cell imaging

    promises into focus https://tinyurl.com/ml-cells
 Learn more…
  8. Generate thumbnails Image metrics Fire and forget Experiment A Stream

    images to S3 Extract Features On-Premise Process experiments in batch
  9. Generate thumbnails Image metrics Fire and forget Stream images to

    S3 Extract Features On-Premise Process experiments in batch
  10. Generate thumbnails Image metrics Fire and forget Stream images to

    S3 Extract Features metrics, models, reports, etc On-Premise Process experiments in batch
  11. Generate thumbnails Image metrics Fire and forget Stream images to

    S3 Extract Features metrics, models, reports, etc On-Premise Process experiments in batch
  12. Generate thumbnails Image metrics Fire and forget Stream images to

    S3 Extract Features metrics, models, reports, etc On-Premise Process experiments in batch
  13. 100 6.9TB 300 20TB 700 48TB 1,300 90TB 1,700 118TB

    1,900 132 TB Kafka Streams solution was launched
  14. Migration Goals Move orchestration and processing to cloud. Faster feedback

    and less bursty workloads. Preserve existing micro-services logic.
  15. Migration Goals Move orchestration and processing to cloud. Faster feedback

    and less bursty workloads. Preserve existing micro-services logic. Make cheaper.
  16. well level features Images / channel level site (all channels/images)

    thumbnails site level features image level metrics site metrics
  17. well level features Images / channel level site (all channels/images)

    thumbnails site level features image level metrics site metrics metrics
  18. well level features Images / channel level site (all channels/images)

    thumbnails site level features image level metrics site metrics metrics plate level features metrics
  19. well level features Images / channel level site (all channels/images)

    thumbnails site level features experiment features image level metrics site metrics metrics plate level features metrics Experiment A
  20. well level features Images / channel level site (all channels/images)

    thumbnails site level features experiment features image level metrics site metrics metrics plate level features metrics metrics, models, reports, etc Experiment A
  21. dagger workflow library written on top of Kafka Streams that

    orchestrates microservices Dagger, ya know, because it is all about the workflows represented as directed acyclic graphs, i.e. DAGs.
  22. Core logic in library is ~2800 LOC New workflow system

    in 2017? Not Invented Here syndrome?
  23. Core logic in library is ~2800 LOC All of our

    our DAGs, including schema, task, and workflow definition ~1700 LOC New workflow system in 2017? Not Invented Here syndrome?
  24. Core logic in library is ~2800 LOC All of our

    our DAGs, including schema, task, and workflow definition ~1700 LOC New workflow system in 2017? Not Invented Here syndrome?
  25. well level features Images / channel level site (all channels/images)

    thumbnails site level features experiment features image level metrics site metrics metrics plate level features metrics metrics, models, reports, etc
  26. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    images_site stream final KTable<String, ExperimentMetadata> experimentMetadata = builder.table( EXPERIMENT_METADATA_TOPIC); final KStream<String, ChannelLevel> images = builder.stream( CHANNEL_IMAGES_TOPIC); final KStream<String, Site> sites = images .groupBy((exp, channel) -> channel.site()) .windowedBy(SessionWindows.with(Duration.ofHours(SESSION_WINDOW_HOURS))) .aggregate( () -> new AggState(), (site, channel, agg) -> agg.observe(channel.site(), channel.channel), (site, agg_a, agg_b) -> agg_a.merge(agg_b)) .join(experimentMetadata, (agg, expMeta) -> agg.markCompleted(expMeta.numChannels)) .filterValues(agg -> agg.isComplete()) .mapValues(agg -> agg.site()); sites.to(SITE_IMAGES_TOPIC);
  27. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    images_site stream final KTable<String, ExperimentMetadata> experimentMetadata = builder.table( EXPERIMENT_METADATA_TOPIC); final KStream<String, ChannelLevel> images = builder.stream( CHANNEL_IMAGES_TOPIC); final KStream<String, Site> sites = images .groupBy((exp, channel) -> channel.site()) .windowedBy(SessionWindows.with(Duration.ofHours(SESSION_WINDOW_HOURS))) .aggregate( () -> new AggState(), (site, channel, agg) -> agg.observe(channel.site(), channel.channel), (site, agg_a, agg_b) -> agg_a.merge(agg_b)) .join(experimentMetadata, (agg, expMeta) -> agg.markCompleted(expMeta.numChannels)) .filterValues(agg -> agg.isComplete()) .mapValues(agg -> agg.site()); sites.to(SITE_IMAGES_TOPIC);
  28. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    images_site stream final KTable<String, ExperimentMetadata> experimentMetadata = builder.table( EXPERIMENT_METADATA_TOPIC); final KStream<String, ChannelLevel> images = builder.stream( CHANNEL_IMAGES_TOPIC); final KStream<String, Site> sites = images .groupBy((exp, channel) -> channel.site()) .windowedBy(SessionWindows.with(Duration.ofHours(SESSION_WINDOW_HOURS))) .aggregate( () -> new AggState(), (site, channel, agg) -> agg.observe(channel.site(), channel.channel), (site, agg_a, agg_b) -> agg_a.merge(agg_b)) .join(experimentMetadata, (agg, expMeta) -> agg.markCompleted(expMeta.numChannels)) .filterValues(agg -> agg.isComplete()) .mapValues(agg -> agg.site()); sites.to(SITE_IMAGES_TOPIC);
  29. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    images_site stream final KTable<String, ExperimentMetadata> experimentMetadata = builder.table( EXPERIMENT_METADATA_TOPIC); final KStream<String, ChannelLevel> images = builder.stream( CHANNEL_IMAGES_TOPIC); final KStream<String, Site> sites = images .groupBy((exp, channel) -> channel.site()) .windowedBy(SessionWindows.with(Duration.ofHours(SESSION_WINDOW_HOURS))) .aggregate( () -> new AggState(), (site, channel, agg) -> agg.observe(channel.site(), channel.channel), (site, agg_a, agg_b) -> agg_a.merge(agg_b)) .join(experimentMetadata, (agg, expMeta) -> agg.markCompleted(expMeta.numChannels)) .filterValues(agg -> agg.isComplete()) .mapValues(agg -> agg.site()); sites.to(SITE_IMAGES_TOPIC);
  30. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    images_site stream final KTable<String, ExperimentMetadata> experimentMetadata = builder.table( EXPERIMENT_METADATA_TOPIC); final KStream<String, ChannelLevel> images = builder.stream( CHANNEL_IMAGES_TOPIC); final KStream<String, Site> sites = images .groupBy((exp, channel) -> channel.site()) .windowedBy(SessionWindows.with(Duration.ofHours(SESSION_WINDOW_HOURS))) .aggregate( () -> new AggState(), (site, channel, agg) -> agg.observe(channel.site(), channel.channel), (site, agg_a, agg_b) -> agg_a.merge(agg_b)) .join(experimentMetadata, (agg, expMeta) -> agg.markCompleted(expMeta.numChannels)) .filterValues(agg -> agg.isComplete()) .mapValues(agg -> agg.site()); sites.to(SITE_IMAGES_TOPIC);
  31. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    images_site stream final KTable<String, ExperimentMetadata> experimentMetadata = builder.table( EXPERIMENT_METADATA_TOPIC); final KStream<String, ChannelLevel> images = builder.stream( CHANNEL_IMAGES_TOPIC); final KStream<String, Site> sites = images .groupBy((exp, channel) -> channel.site()) .windowedBy(SessionWindows.with(Duration.ofHours(SESSION_WINDOW_HOURS))) .aggregate( () -> new AggState(), (site, channel, agg) -> agg.observe(channel.site(), channel.channel), (site, agg_a, agg_b) -> agg_a.merge(agg_b)) .join(experimentMetadata, (agg, expMeta) -> agg.markCompleted(expMeta.numChannels)) .filterValues(agg -> agg.isComplete()) .mapValues(agg -> agg.site()); sites.to(SITE_IMAGES_TOPIC);
  32. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    images_site stream Kafka Streams App External Service task input topic
  33. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    images_site stream Kafka Streams App External Service task input topic
  34. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    images_site stream Kafka Streams App External Service task input topic task output topic
  35. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    images_site stream Input topics & tables Stream operations Tasks
  36. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    images_site stream Input topics & tables Stream operations Tasks Output topics
  37. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}}
  38. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}}
  39. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}}
  40. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}}
  41. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}}
  42. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}}
  43. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}} Specify function to be used
  44. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}}
  45. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}}
  46. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}}
  47. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}}
  48. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}}
  49. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    {:name "extract-site-level-features", :graph {:images-channel {:type :topic-stream, :topic-name "images_channels"}, :experiment-metadata {:type :topic-table, :topic-name "experiment_metadata"}, :images-site {:type :stream-operation, :key-schema :long, :value-schema "job_site_level", :inputs [:images-channel, :experiment-metadata], :function (fn [images-channel experiment-metadata] …), :features-site {:type :external-task, :task-name "extract-features", :stream :images-site}, :features-output {:type :publish, :stream :features-site, :topic-name "extracted_features"}}} images_site stream
  50. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    {:name "extract-site-level-features", :graph {:images-channel {:type :topic-stream, :topic-name "images_channels"}, :experiment-metadata {:type :topic-table, :topic-name "experiment_metadata"}, :images-site {:type :stream-operation, :key-schema :long, :value-schema "job_site_level", :inputs [:images-channel, :experiment-metadata], :function (fn [images-channel experiment-metadata] …), :features-site {:type :external-task, :task-name "extract-features", :stream :images-site}, :features-output {:type :publish, :stream :features-site, :topic-name "extracted_features"}}} images_site stream
  51. extract site features images_channel topic experiment_metadata topic table extracted_features topic

    {:name "extract-site-level-features", :graph {:images-channel {:type :topic-stream, :topic-name "images_channels"}, :experiment-metadata {:type :topic-table, :topic-name "experiment_metadata"}, :images-site {:type :stream-operation, :key-schema :long, :value-schema "job_site_level", :inputs [:images-channel, :experiment-metadata], :function (fn [images-channel experiment-metadata] …), :features-site {:type :external-task, :task-name "extract-features", :stream :images-site}, :features-output {:type :publish, :stream :features-site, :topic-name "extracted_features"}}} Inline function directly images_site stream
  52. well level features Images / channel level site (all channels/images)

    thumbnails site level features experiment features image level metrics site metrics metrics plate level features metrics metrics, models, reports, etc
  53. Migration Goals Move orchestration and processing to cloud. Faster feedback

    and less bursty workloads. Preserve existing micro-services logic. ✓ ✓ ✓
  54. Migration Goals Move orchestration and processing to cloud. Faster feedback

    and less bursty workloads. Preserve existing micro-services logic. Make cheaper. ✓ ✓ ✓ ✓
  55. Migration Goals Move orchestration and processing to cloud. Faster feedback

    and less bursty workloads. Preserve existing micro-services logic. Make cheaper. ✓ ✓ ✓ ✓ EC2 and Lambda -> Google Clould preemptibles.