Discovering Drugs with Kafka Streams

Transcript

Ben Mabey VP of Engineering @bmabey Discovering Drugs with Kafka

Streams Scott Nielsen Director of Data Engineering K A F K A S U M M I T S F 2 0 1 9

Decoding Biology to Radically Improve Lives

© 2017 Recursion Pharmaceuticals 1000s of untreated genetic diseases Photo

of our wall?

0.00001 0.0001 0.001 0.01 0.1 1 10 100 1000 1971

1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 Transistor Area (% of 1970 values) Moore’s Law

0.00001 0.0001 0.001 0.01 0.1 1 10 100 1000 1971

1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 Transistor Area (% of 1970 values) 1 10 100 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 R&D Spend / Drug (% of 2007 values) Moore’s Law

0.00001 0.0001 0.001 0.01 0.1 1 10 100 1000 1971

1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 Transistor Area (% of 1970 values) 1 10 100 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 R&D Spend / Drug (% of 2007 values) Moore’s Law Eroom’s Law

0 10 20 30 40 50 60 1993 1994 1995

1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Number of Drugs Approved in US (1993-2016)

How can we fix this?

RecursionPharma.com

RecursionPharma.com Over 7 million per week

RecursionPharma.com hoechst (DNA)

RecursionPharma.com concanavalin A (ER)

RecursionPharma.com mitotracker (mitochondria)

RecursionPharma.com WGA (golgi apparatus, cell membrane)

RecursionPharma.com SYTO 14 (RNA, nucleoli)

RecursionPharma.com phalloidin (actin ﬁbers)

RecursionPharma.com combined

How do these pretty pictures help?

Healthy child Child with rare genetic disease (Cornelia de Lange

Syndrome)

Healthy child Healthy cells Child with rare genetic disease (Cornelia

de Lange Syndrome) Genetic disease model cells (Cornelia de Lange Syndrome)

Healthy Disease

Healthy Disease Disease + Drug?

Public Dataset: http://rxrx.ai Nature Article Machine learning brings cell imaging

promises into focus https://tinyurl.com/ml-cells  Learn more…

How is this data produced?

308 wells/plate

4 sites/well 308 wells/plate

6 channels (images)/site 7,392 images per plate 4 sites/well 308

wells/plate

6 channels (images)/site 7,392 images per plate 4 sites/well 308

wells/plate ~69GB per plate

Experiment A Experiment B Experiment C Experiment D

Our “Series A” System

On-Premise

Stream images to S3 On-Premise

Generate thumbnails Image metrics Stream images to S3 On-Premise

Generate thumbnails Image metrics Fire and forget Stream images to

S3 On-Premise

Generate thumbnails Image metrics Fire and forget Experiment A Stream

images to S3 On-Premise

Generate thumbnails Image metrics Fire and forget Experiment A Stream

images to S3 Extract Features On-Premise Process experiments in batch

Generate thumbnails Image metrics Fire and forget Stream images to

S3 Extract Features On-Premise Process experiments in batch

Generate thumbnails Image metrics Fire and forget Stream images to

S3 Extract Features metrics, models, reports, etc On-Premise Process experiments in batch

Generate thumbnails Image metrics Fire and forget Stream images to

Traditional, low throughput, biology

Traditional, low throughput, biology ~6-12 plates per week, ~400-800GB

© 2017 Recursion Pharmaceuticals High-throughput experiments Robots photo

None

100 6.9TB

100 6.9TB 300 20TB

100 6.9TB 300 20TB Kafka Streams solution was launched

100 6.9TB 300 20TB 700 48TB 1,300 90TB 1,700 118TB

1,900 132 TB Kafka Streams solution was launched

None

100 6.9TB 300 20TB 700 48TB 1,300 90TB 1,700 118TB

1,900 132 TB

100 6.9TB 300 20TB 700 48TB 1,300 90TB 1,700 118TB

1,900 132 TB 280 TB Today

So what was wrong with the original system?

Generate thumbnails Image metrics Extract Features metrics, models, reports, etc

On-Premise Process experiments in batch

Experiment A Experiment B Experiment C Experiment D Plates are

not imaged in order

None

Migration Goals

Migration Goals Move orchestration and processing to cloud.

Migration Goals Move orchestration and processing to cloud. Faster feedback

and less bursty workloads.

Migration Goals Move orchestration and processing to cloud. Faster feedback

and less bursty workloads.

Migration Goals Move orchestration and processing to cloud. Faster feedback

and less bursty workloads. Preserve existing micro-services logic.

Migration Goals Move orchestration and processing to cloud. Faster feedback

and less bursty workloads. Preserve existing micro-services logic. Make cheaper.

Let’s take a look at the logical pipeline that we

needed to implement…

Images / channel level

Images / channel level image level metrics

Images / channel level site (all channels/images) thumbnails image level

metrics

Images / channel level site (all channels/images) thumbnails site level

features image level metrics

Images / channel level site (all channels/images) thumbnails site level

features image level metrics

Images / channel level site (all channels/images) thumbnails site level

features image level metrics site metrics

well level features Images / channel level site (all channels/images)

thumbnails site level features image level metrics site metrics

well level features Images / channel level site (all channels/images)

thumbnails site level features image level metrics site metrics metrics

well level features Images / channel level site (all channels/images)

thumbnails site level features image level metrics site metrics metrics plate level features metrics

well level features Images / channel level site (all channels/images)

thumbnails site level features experiment features image level metrics site metrics metrics plate level features metrics Experiment A

well level features Images / channel level site (all channels/images)

thumbnails site level features experiment features image level metrics site metrics metrics plate level features metrics metrics, models, reports, etc Experiment A

None

Kafka Streams was just released…

dagger workﬂow library written on top of Kafka Streams that

orchestrates microservices

dagger workﬂow library written on top of Kafka Streams that

orchestrates microservices Dagger, ya know, because it is all about the workflows represented as directed acyclic graphs, i.e. DAGs.

dagger workﬂow library written on top of Kafka Streams that

orchestrates microservices

New workflow system in 2017?

New workflow system in 2017? Not Invented Here syndrome?

Core logic in library is ~2800 LOC New workflow system

in 2017? Not Invented Here syndrome?

Core logic in library is ~2800 LOC All of our

our DAGs, including schema, task, and workﬂow deﬁnition ~1700 LOC New workflow system in 2017? Not Invented Here syndrome?

Core logic in library is ~2800 LOC All of our

our DAGs, including schema, task, and workﬂow deﬁnition ~1700 LOC New workflow system in 2017? Not Invented Here syndrome?

well level features Images / channel level site (all channels/images)

thumbnails site level features experiment features image level metrics site metrics metrics plate level features metrics metrics, models, reports, etc

Let’s look at a small workflow using Kafka Streams initially…

extract site features images_channel topic experiment_metadata topic table extracted_features topic

images_site stream

extract site features images_channel topic experiment_metadata topic table extracted_features topic

images_site stream final KTable<String, ExperimentMetadata> experimentMetadata = builder.table( EXPERIMENT_METADATA_TOPIC); final KStream<String, ChannelLevel> images = builder.stream( CHANNEL_IMAGES_TOPIC); final KStream<String, Site> sites = images .groupBy((exp, channel) -> channel.site()) .windowedBy(SessionWindows.with(Duration.ofHours(SESSION_WINDOW_HOURS))) .aggregate( () -> new AggState(), (site, channel, agg) -> agg.observe(channel.site(), channel.channel), (site, agg_a, agg_b) -> agg_a.merge(agg_b)) .join(experimentMetadata, (agg, expMeta) -> agg.markCompleted(expMeta.numChannels)) .filterValues(agg -> agg.isComplete()) .mapValues(agg -> agg.site()); sites.to(SITE_IMAGES_TOPIC);

extract site features images_channel topic experiment_metadata topic table extracted_features topic

images_site stream Kafka Streams App External Service task input topic

extract site features images_channel topic experiment_metadata topic table extracted_features topic

images_site stream Kafka Streams App External Service task input topic

extract site features images_channel topic experiment_metadata topic table extracted_features topic

images_site stream Kafka Streams App External Service task input topic task output topic

How would you do the same workflow in dagger?

extract site features images_channel topic experiment_metadata topic table extracted_features topic

images_site stream

extract site features images_channel topic experiment_metadata topic table extracted_features topic

images_site stream Input topics & tables

extract site features images_channel topic experiment_metadata topic table extracted_features topic

images_site stream Input topics & tables Stream operations

extract site features images_channel topic experiment_metadata topic table extracted_features topic

images_site stream Input topics & tables Stream operations Tasks

extract site features images_channel topic experiment_metadata topic table extracted_features topic

images_site stream Input topics & tables Stream operations Tasks Output topics

extract site features images_channel topic experiment_metadata topic table extracted_features topic

images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}}

extract site features images_channel topic experiment_metadata topic table extracted_features topic

images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}} Specify function to be used

extract site features images_channel topic experiment_metadata topic table extracted_features topic

{:name "extract-site-level-features", :graph {:images-channel {:type :topic-stream, :topic-name "images_channels"}, :experiment-metadata {:type :topic-table, :topic-name "experiment_metadata"}, :images-site {:type :stream-operation, :key-schema :long, :value-schema "job_site_level", :inputs [:images-channel, :experiment-metadata], :function (fn [images-channel experiment-metadata] …), :features-site {:type :external-task, :task-name "extract-features", :stream :images-site}, :features-output {:type :publish, :stream :features-site, :topic-name "extracted_features"}}} images_site stream

extract site features images_channel topic experiment_metadata topic table extracted_features topic

{:name "extract-site-level-features", :graph {:images-channel {:type :topic-stream, :topic-name "images_channels"}, :experiment-metadata {:type :topic-table, :topic-name "experiment_metadata"}, :images-site {:type :stream-operation, :key-schema :long, :value-schema "job_site_level", :inputs [:images-channel, :experiment-metadata], :function (fn [images-channel experiment-metadata] …), :features-site {:type :external-task, :task-name "extract-features", :stream :images-site}, :features-output {:type :publish, :stream :features-site, :topic-name "extracted_features"}}} images_site stream

extract site features images_channel topic experiment_metadata topic table extracted_features topic

{:name "extract-site-level-features", :graph {:images-channel {:type :topic-stream, :topic-name "images_channels"}, :experiment-metadata {:type :topic-table, :topic-name "experiment_metadata"}, :images-site {:type :stream-operation, :key-schema :long, :value-schema "job_site_level", :inputs [:images-channel, :experiment-metadata], :function (fn [images-channel experiment-metadata] …), :features-site {:type :external-task, :task-name "extract-features", :stream :images-site}, :features-output {:type :publish, :stream :features-site, :topic-name "extracted_features"}}} Inline function directly images_site stream

Discovering Drugs with Kafka Streams

Discovering Drugs with Kafka Streams

More Decks by Ben Mabey

Other Decks in Programming

Featured

Transcript