Phoenix Data Conference 2014 - Vishnu Vyas

Apixio’s Big Data Infrastructure Evolu8on and Lessons Learnt
@vishnuvyas

Apixio was founded in 2009 to transform how providers
and healthcare organiza8ons access, analyze, and use clinical data for op8mal care. Its premier product is the Apixio HCC Op8mizer, a smart coding applica8on with automated extrac8on and analysis of clinical text and coded data for accurate risk scores at a lower cost for Medicare Advantage (MA) and individual/ small group plans. Apixio to our customers

Evaluate pa*ent data for evidence of care for purposes
of annual payment Manual Chart Audit Takes 20 person-‐years for 200k pa*ents

Apixio Automates Chart Audits Knowl edge Graph
Glucose Hemoglobin A1c Re8nal Eye Exam Echo Diabetes Type 1 Diabetes Type 2 Glucose A1c Re8nal Eye Exam Echo Diabetes Type 1 ICD 250.xx NLP & Machine Learning Pa=ern Analysis Flexible Ontology Endocrine and metabolic disorders Endocrine and metabolic disorders DM w/o complicaEon 4 Endocrine & metabolic disorders DM w/o complicaEon Encounter Note Endocrine & metabolic disorders DM w/o complicaEon Encounter Note

PlaForm & Architecture

MySQL Tomcat HTML/ JS Apixio PlaForm Architecture
-‐ IniEal

Audit (Trace CF) Logging (Hive/Trace CF)
Metrics (Graphite) Apixio Pipeline Receiver (HTTP) Cassandra Hive/HDFS S3 Apixio REST API (A network of micro-‐services) Web Tier Java/Python External Clients End Users Persistence Compute Job Control Pipeline Applica8ons Experimental Infrastructure Logging 10+ Services Apixio PlaForm Architecture -‐ Current

Apixio Technology Stack

•  Logging is cri8cal •  Respect the Data
•  Winning Ugly is s8ll winning! Lessons Learnt

Logging is criEcal

•  Lets you solve mysterious problems •  Lets you
solve new problems you encounter •  Lets you solve problems before they happen Logging is criEcal

Mysterious problem I – Upload Throughput A`er hours
of usage – throughput would be lower than predicted There’s your problem!

Mysterious problem II – Slow write throughput Our
write throughput was lower than predic8ons – all systems in perfect health! What we store What we process

New Problem -‐ Coder Accuracy How good are
our coders? Overlaps

Endocrin e & metaboli c disorders
DM w/o complicaEon Encounter Note Endocrin e & metaboli c disorders DM w/o complicaEon Encounter Note Check Agreements Coder Accuracy (Error Rates) LP New Problem -‐ Coder Accuracy How good are our coders? We didn’t even set out solve this problem with our logging!

How good are our coders? Bonus Finding : Fast
Coders are also accurate coders! New Problem -‐ Coder Accuracy

Logging lets you solve problems before they happen
Compound Documents! Mul8ple Documents s8tched into a single pdf – o`en for prin8ng/ scanning purposes

Compound Documents!

Compound Documents! Lets us manage customer expecta*ons!

Respect the Data

Structured Data Text Data Scanned Documents What
does our data look like?

•  10 TB / 200K PTS over 5 years
•  Structured: –  13 M unique events •  Narra8ve: –  338 M unique events 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00 1 % Data Types Structured Data Text Scanned Documents What does our data look like?

•  Immutable -‐ mostly •  Almost all of
the data is either text or an image – Impedance mismatch •  Data does unexpected things all the 8me – Data Quality is important Healthcare Data Challenges

•  Append Only Model in Cassandra •  Document
Based L0 •  Event Based Append Only Model •  Transient •  Used for Inference L1 •  Applica8on Speciﬁc Data Model •  Op8mized for Quick Retrieval L2 Taking advantage of Immutability

L0 -‐ Documents •  Stored in cassandra • 
2 Column Family / Customer •  Append only ApixioID DOCID1 DOCID2 DOCID3 Par8al Pa8ent Object Par8al Pa8ent Object Par8al Pa8ent Object Documents Column Family DocID:<DOCID> ApixioID APIXIOID Indices Column Family (2 types of data) DocHash:<HASH> ApixioID APIXIOID

L1 -‐ Events An event is an asser*on (fact)
about a speciﬁc subject (pa*ent) at a speciﬁc *me Storage Pub-‐Sub L0 L1

L1 -‐ Events •  Stored in cassandra in 8me
buckets •  Append only •  Published to consumers using a redis-‐pubsub queue ApixioID:TimeBucket EventID 1 EventID 2 EventID 3 Event Object Event Object Event Object TimeBucket EventID 1 EventID 2 EventID 3 Event Object Event Object Event Object Indexed By Pa8ents Indexed By Genera8on Time

Impedance Matching Text or Image? Parse
Iden8fy Bucket Persist Parse OCR Iden8fy Bucket Rela8vely Quick ~ O(10ms) Rela8vely Expensive ~ O(10mins) 10000x Expensive Maybe Solu8on : Add more nodes?

Impedance Matching Text or Image? Parse
Iden8fy Bucket Persist Parse OCR Iden8fy Bucket Rela8vely Quick ~ O(10ms) Rela8vely Expensive ~ O(10mins) 10000x Expensive Storage Layer Backs Up

Text or Image? Parse Iden8fy Bucket
Choke Parse OCR Iden8fy Bucket Rela8vely Quick ~ O(10ms) Rela8vely Expensive ~ O(10mins) 10000x Expensive Solu8on: Manage Back Pressure using Scheduling & Reducers Persist Impedance Matching

Data Quality •  All Large Systems fail – 
Larger systems, fail more o`en –  Check and recover automa8cally. •  Data valida8on happens before inference –  prevents garbage in – garbage out. –  Gives early warning about data issues.

Feedback Loops for Data Quality •  Feedback loop automa8cally
checks and recovers data •  Logs from feedback loop give accurate account of whats going on with the system •  A form of self-‐regula8on

Early Data ValidaEon •  Check Join Rates before ETL
•  Asser8ons about data constraints before inference •  Serves as early warning for bad-‐data at source.

Winning Ugly is sEll Winning!

Winning Ugly is sEll Winning! •  You don’t know
the answers ahead of 8me. •  Some8mes, you don’t even know the ques8ons ahead of 8me. •  Quickly iterate and move in the right direc8on and you will get there.

The SorEng Hat

The SorEng Hat – The problem Endocrine and
metabolic disorders Next work Item Next work Item •  Collisions •  Duplicate Work

SoluEon – Manually separate sets Endocrine and metabolic
disorders Next work Item Next work Item •  Manually Spliqng is Ineﬃcient •  Can not service more than a handful of coders •  Really hard to analyze ac8vity by combining across sets •  But – Customers were happy!

SoluEon II – Simple Rule Based Prototype Endocrine
and metabolic disorders Next work Item Next work Item •  Limited Flexibility on the rules – s8ll beser than the manual process •  Can service more coders, but not a lot coders •  But – Customers were thrilled! SQLite DB Queue Regex Rules

SoluEon III – Complex Automated Workﬂow management system
Next work Item •  Fully ﬂexible rule system. •  Completely Automated •  And – Customers are ecstaEc! Endocrine and metabolic disorders Event Pub/Sub Automated Work Item genera8on Complex DNF Rule Engine Repor8ng

•  Sor8ng Hat •  Coordinator –  Started with
plain Map-‐Reduce -‐ not enough u8liza8on –  Played with fair scheduling and queues – Either insuﬃcient u8liza8on / massive overcommit (causing blocks) •  Dynamic Re-‐Priori8zing didn’t work. –  Built our own •  Gets us near 100% utliza8on •  Repriori8ze dynamically Winning ugly is s8ll winning

Outline •  What is Apixio –  Intro Blurb
–  What we do and why we do? •  Apixio’s foray into big-‐data –  Our ﬁrst system –  Our second system •  De-‐normalized pa8ent model •  One monster job doing everything •  Update model •  Impedence mismatches –  Our third system •  separate I/O and Computa8on Jobs •  Move to an append only model •  “some-‐what” denormalized data (lots of small denormalized chunks) •  Logging – log everything in json •  Co-‐ordinator – A custom job management system. •  Lessons Learnt –  Log Everything (and if you can log as JSON, even beser) –  Logging is Data –  Impedence mismatches can kill performance of distributed systems – make sure you don’t overdrive any single component. –  Cassandra likes an append only model – if you are building systems using cassandra, try to build them around append only models. –  Lambda architecture -‐ we had arrived here – if we had started –  If you trust that you are going in the right direc8on and quickly iterate, and you will get to there, successfully – winning all the way. same place – and no-‐one else got there by planning to get there.

Logging is Everything –  Logging is everything -‐ stories
•  Gives you answers to mysterious problems (us dropping documents) •  Gives you answers to new problems (Coder performance, gumby – poky) •  Gives you answers to problems before they happen. –  Feasibility studies –  Valida8on, coder progress, managing the business –  Take aways •  Logging is for more than debugging. •  Log data , not messages (simple structured formats like json are worth more than detailed messages) •  Make it accessible to every one (Hive/Impala)

Respect the data •  Impedence matching –  Processing
different types of data takes different 8mes –  Different persistence systems have different write rates (cassandra / vs mysql) –  Using reducers to control input rates increases through-‐ put. •  Append only architecture (append only pa8ent objects, annota8ons and user ac8ons are append only) •  Understand your infrastructure pieces •  Insert QA steps to along your ETL pipeline, because data can have more surprises than you expect.

Build It when you need it •  Fast moving
data throws more surprises than you expect •  You don’t know all the answers ahead of 8me. •  Invest in infrastructure – Co-‐oordinator – Sor8ng Hat

What problem does apixio solve? EHR coded data
EHR text documents EHR scan documents Claims Parse OCR Norm. Load Client è Ingest Pipeline PaEent Object Model General Event Stream HCC Event Stream Quality Event Stream Referral Event Stream 3rd Party Event Stream API Clinical Knowledge Exchange Care OpEmizer Quality OpEmizer HCC OpEmizer 3rd Party Event Stream ApplicaEon Eligibility Provider ﬁles

Phoenix Data Conference 2014 - Vishnu Vyas

Phoenix Data Conference 2014 - Vishnu Vyas

More Decks by teamclairvoyant

Other Decks in Technology

Featured

Transcript