Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Phoenix Data Conference 2014 - Vishnu Vyas

Phoenix Data Conference 2014 - Vishnu Vyas

Big Data At Apixio

teamclairvoyant

October 25, 2014
Tweet

More Decks by teamclairvoyant

Other Decks in Technology

Transcript

  1. Apixio  was  founded  in  2009  to  transform  how   providers

     and  healthcare  organiza8ons  access,   analyze,  and  use  clinical  data  for  op8mal  care.   Its  premier  product  is  the  Apixio  HCC  Op8mizer,   a  smart  coding  applica8on  with  automated   extrac8on  and  analysis  of  clinical  text  and   coded  data  for  accurate  risk  scores  at   a  lower  cost  for  Medicare  Advantage  (MA)  and   individual/  small  group  plans.   Apixio  to  our  customers  
  2. Evaluate  pa*ent  data  for  evidence  of   care  for  purposes

     of  annual  payment   Manual  Chart  Audit   Takes  20  person-­‐years  for  200k   pa*ents  
  3. Apixio  Automates  Chart  Audits   Knowl edge   Graph  

    Glucose   Hemoglobin   A1c   Re8nal  Eye   Exam   Echo   Diabetes   Type  1   Diabetes   Type  2   Glucose    A1c   Re8nal  Eye  Exam   Echo   Diabetes  Type  1   ICD  250.xx   NLP  &     Machine  Learning   Pa=ern  Analysis   Flexible  Ontology   Endocrine  and    metabolic  disorders   Endocrine  and    metabolic  disorders   DM  w/o  complicaEon   4     Endocrine  &     metabolic   disorders     DM  w/o  complicaEon   Encounter  Note     Endocrine  &     metabolic   disorders     DM  w/o  complicaEon   Encounter  Note  
  4. Audit   (Trace  CF)   Logging   (Hive/Trace  CF)  

    Metrics  (Graphite)   Apixio  Pipeline   Receiver  (HTTP)   Cassandra   Hive/HDFS   S3   Apixio  REST  API  (A  network  of  micro-­‐services)   Web  Tier   Java/Python   External  Clients   End  Users   Persistence   Compute   Job  Control   Pipeline   Applica8ons   Experimental   Infrastructure   Logging   10+   Services   Apixio  PlaForm  Architecture  -­‐  Current  
  5. •  Logging  is  cri8cal   •  Respect  the  Data  

    •  Winning  Ugly  is  s8ll  winning!   Lessons  Learnt  
  6. •  Lets  you  solve  mysterious  problems   •  Lets  you

     solve  new  problems  you  encounter     •  Lets  you  solve  problems  before  they  happen   Logging  is  criEcal  
  7. Mysterious  problem  I  –  Upload   Throughput   A`er  hours

     of  usage  –  throughput  would  be  lower  than  predicted   There’s  your  problem!  
  8. Mysterious  problem  II  –  Slow  write   throughput   Our

     write  throughput  was  lower  than  predic8ons  –  all  systems  in  perfect  health!   What  we  store   What  we  process  
  9.   Endocrin e  &     metaboli c   disorders

        DM  w/o   complicaEon   Encounter  Note     Endocrin e  &     metaboli c   disorders     DM  w/o   complicaEon   Encounter  Note   Check   Agreements   Coder   Accuracy   (Error  Rates)   LP   New  Problem    -­‐  Coder  Accuracy   How  good  are  our  coders?   We  didn’t  even  set  out  solve  this  problem  with  our  logging!  
  10. How  good  are  our  coders?   Bonus  Finding  :  Fast

     Coders  are  also  accurate  coders!   New  Problem    -­‐  Coder  Accuracy  
  11. Logging  lets  you  solve  problems  before   they  happen  

    Compound  Documents!   Mul8ple  Documents   s8tched  into  a  single  pdf   –  o`en  for  prin8ng/ scanning  purposes  
  12. Logging  lets  you  solve  problems  before   they  happen  

    Compound  Documents!   Lets  us  manage  customer  expecta*ons!  
  13.   •  10  TB  /  200K  PTS  over  5  years

        •  Structured:     –  13  M  unique  events     •  Narra8ve:     –  338  M  unique  events   0.00   10.00   20.00   30.00   40.00   50.00   60.00   70.00   80.00   90.00   100.00   1   %  Data  Types     Structured   Data   Text   Scanned   Documents   What  does  our  data  look  like?  
  14. •  Immutable  -­‐  mostly     •  Almost  all  of

     the  data  is  either  text  or  an  image  –   Impedance  mismatch   •  Data  does  unexpected  things  all  the  8me  –  Data   Quality  is  important     Healthcare  Data  Challenges  
  15. •  Append  Only  Model   in  Cassandra   •  Document

     Based   L0   •  Event  Based  Append   Only  Model   •  Transient     •  Used  for  Inference   L1   •  Applica8on  Specific   Data  Model   •  Op8mized  for  Quick   Retrieval   L2   Taking  advantage  of  Immutability  
  16. L0  -­‐  Documents   •  Stored  in  cassandra   • 

    2  Column  Family  /  Customer   •  Append  only   ApixioID   DOCID1   DOCID2   DOCID3   Par8al  Pa8ent   Object   Par8al  Pa8ent   Object   Par8al  Pa8ent   Object   Documents  Column  Family   DocID:<DOCID>   ApixioID   APIXIOID   Indices  Column  Family  (2  types  of  data)   DocHash:<HASH>   ApixioID   APIXIOID  
  17. L1  -­‐  Events   An  event  is  an  asser*on  (fact)

     about  a  specific   subject  (pa*ent)  at  a  specific  *me   Storage   Pub-­‐Sub   L0   L1  
  18. L1  -­‐  Events   •  Stored  in  cassandra  in  8me

     buckets   •  Append  only   •  Published  to  consumers  using  a  redis-­‐pubsub  queue   ApixioID:TimeBucket   EventID  1   EventID  2   EventID  3   Event  Object   Event  Object   Event  Object   TimeBucket   EventID  1   EventID  2   EventID  3   Event  Object   Event  Object   Event  Object   Indexed  By  Pa8ents   Indexed  By  Genera8on  Time  
  19. Impedance  Matching   Text  or   Image?   Parse  

    Iden8fy   Bucket   Persist   Parse   OCR   Iden8fy   Bucket   Rela8vely  Quick  ~  O(10ms)   Rela8vely  Expensive  ~  O(10mins)   10000x  Expensive   Maybe   Solu8on  :  Add  more  nodes?    
  20. Impedance  Matching   Text  or   Image?   Parse  

    Iden8fy   Bucket   Persist   Parse   OCR   Iden8fy   Bucket   Rela8vely  Quick  ~  O(10ms)   Rela8vely  Expensive  ~  O(10mins)   10000x  Expensive   Storage   Layer  Backs    Up  
  21. Text  or   Image?   Parse   Iden8fy   Bucket

      Choke   Parse   OCR   Iden8fy   Bucket   Rela8vely  Quick  ~  O(10ms)   Rela8vely  Expensive  ~  O(10mins)   10000x  Expensive   Solu8on:  Manage  Back  Pressure  using  Scheduling   &  Reducers   Persist   Impedance  Matching  
  22. Data  Quality   •  All  Large  Systems  fail   – 

    Larger  systems,  fail  more  o`en   –  Check  and  recover  automa8cally.         •  Data  valida8on  happens  before  inference     –  prevents  garbage  in  –  garbage  out.   –  Gives  early  warning  about  data  issues.  
  23. Feedback  Loops  for  Data  Quality   •  Feedback  loop  automa8cally

     checks  and  recovers  data   •  Logs  from  feedback  loop  give  accurate  account  of  whats  going  on   with  the  system   •  A  form  of  self-­‐regula8on  
  24. Early  Data  ValidaEon   •  Check  Join  Rates  before  ETL

      •  Asser8ons  about  data  constraints  before   inference   •  Serves  as  early  warning  for  bad-­‐data  at   source.  
  25. Winning  Ugly  is  sEll  Winning!   •  You  don’t  know

     the  answers  ahead  of  8me.   •  Some8mes,  you  don’t  even  know  the   ques8ons  ahead  of  8me.   •  Quickly  iterate  and  move  in  the  right  direc8on   and  you  will  get  there.    
  26. The  SorEng  Hat  –  The  problem   Endocrine  and  

     metabolic  disorders   Next  work  Item   Next  work  Item   •  Collisions     •  Duplicate  Work  
  27. SoluEon  –  Manually  separate  sets   Endocrine  and    metabolic

     disorders   Next  work  Item   Next  work  Item   •  Manually  Spliqng  is  Inefficient     •  Can  not  service  more  than  a  handful  of  coders   •  Really  hard  to  analyze  ac8vity  by  combining  across  sets   •  But  –  Customers  were  happy!  
  28. SoluEon  II  –  Simple  Rule  Based   Prototype   Endocrine

     and    metabolic   disorders   Next  work  Item   Next  work  Item   •  Limited  Flexibility  on  the  rules  –  s8ll  beser  than  the   manual  process   •  Can  service  more  coders,  but  not  a  lot  coders   •  But  –  Customers  were  thrilled!   SQLite  DB   Queue   Regex  Rules  
  29. SoluEon  III  –  Complex  Automated   Workflow  management  system  

    Next  work  Item   •  Fully  flexible  rule  system.   •  Completely  Automated   •  And  –  Customers  are  ecstaEc!   Endocrine  and    metabolic   disorders       Event  Pub/Sub   Automated   Work  Item   genera8on   Complex   DNF  Rule   Engine   Repor8ng  
  30. •  Sor8ng  Hat   •  Coordinator   –  Started  with

     plain  Map-­‐Reduce  -­‐  not  enough   u8liza8on   –  Played  with  fair  scheduling  and  queues  –  Either   insufficient  u8liza8on  /  massive  overcommit  (causing   blocks)   •  Dynamic  Re-­‐Priori8zing  didn’t  work.   –  Built  our  own     •  Gets  us  near  100%  utliza8on   •  Repriori8ze  dynamically   Winning  ugly  is  s8ll  winning  
  31. Outline   •  What  is  Apixio   –  Intro  Blurb

      –  What  we  do  and  why  we  do?   •  Apixio’s  foray  into  big-­‐data   –  Our  first  system   –  Our  second  system   •  De-­‐normalized  pa8ent  model   •  One  monster  job  doing  everything   •  Update  model   •  Impedence  mismatches   –  Our  third  system     •  separate  I/O  and  Computa8on  Jobs   •  Move  to  an  append  only  model   •  “some-­‐what”  denormalized  data  (lots  of  small  denormalized  chunks)   •  Logging  –  log  everything  in  json   •  Co-­‐ordinator  –  A  custom  job  management  system.   •  Lessons  Learnt   –  Log  Everything  (and  if  you  can  log  as  JSON,  even  beser)   –  Logging  is  Data     –  Impedence  mismatches  can  kill  performance  of  distributed  systems  –  make  sure  you  don’t  overdrive  any  single  component.   –  Cassandra  likes  an  append  only  model  –  if  you  are  building  systems  using  cassandra,  try  to  build  them  around  append  only   models.   –  Lambda  architecture  -­‐  we  had  arrived  here  –  if  we  had  started     –  If  you  trust  that  you  are  going  in  the  right  direc8on  and  quickly  iterate,  and  you  will  get  to  there,  successfully  –  winning  all  the   way.  same  place  –  and  no-­‐one  else  got  there  by  planning  to  get  there.  
  32. Logging  is  Everything   –  Logging  is  everything  -­‐  stories

      •  Gives  you  answers  to  mysterious  problems  (us  dropping   documents)   •  Gives  you  answers  to  new  problems  (Coder  performance,   gumby  –  poky)   •  Gives  you  answers  to  problems  before  they  happen.   –  Feasibility  studies   –  Valida8on,  coder  progress,  managing  the  business   –  Take  aways   •  Logging  is  for  more  than  debugging.   •  Log  data  ,  not  messages  (simple  structured  formats  like  json   are  worth  more  than  detailed  messages)   •  Make  it  accessible  to  every  one  (Hive/Impala)  
  33. Respect  the  data   •  Impedence  matching   –  Processing

     different  types  of  data  takes  different  8mes   –  Different  persistence  systems  have  different  write  rates   (cassandra  /  vs  mysql)   –  Using  reducers  to  control  input  rates  increases  through-­‐ put.   •  Append  only  architecture  (append  only  pa8ent  objects,   annota8ons  and  user  ac8ons  are  append  only)   •  Understand  your  infrastructure  pieces   •  Insert  QA  steps  to  along  your  ETL  pipeline,  because   data  can  have  more  surprises  than  you  expect.  
  34. Build  It  when  you  need  it   •  Fast  moving

     data  throws  more  surprises  than   you  expect   •  You  don’t  know  all  the  answers  ahead  of  8me.   •  Invest  in  infrastructure   – Co-­‐oordinator   – Sor8ng  Hat  
  35. What  problem  does  apixio  solve?   EHR  coded   data

      EHR  text   documents   EHR  scan   documents   Claims   Parse   OCR   Norm.   Load   Client  è  Ingest  Pipeline   PaEent   Object   Model   General   Event  Stream   HCC   Event  Stream   Quality   Event  Stream   Referral   Event  Stream   3rd  Party   Event  Stream   API   Clinical  Knowledge  Exchange   Care   OpEmizer   Quality     OpEmizer   HCC   OpEmizer   3rd  Party   Event  Stream   ApplicaEon   Eligibility     Provider   files