Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Architecture Choices for Big and Tiny Data Problems

unnati_xyz
February 24, 2017

Architecture Choices for Big and Tiny Data Problems

unnati_xyz

February 24, 2017
Tweet

More Decks by unnati_xyz

Other Decks in Technology

Transcript

  1. FinTech | What are we solving? ▧ Evaluate college students

    to determine their creditworthiness ▧ Lack of credit history ▧ Tiny data ▧ Enrich data with alternate data sources ▧ Statistical modelling to evaluate students initially. ▧ As the user activity increases, build machine learning models to predict creditworthiness.
  2. FinTech | Thought process for Infrastructure ▧ Data velocity estimation

    for the next 6 months ▧ Complexity of data science algorithms ▧ No. of calls being serviced by the data science APIs ▧ Cost AWS instance x1, 8 GB ram, 4 cores
  3. FinTech | Learnings ▧ Small data problems are tricky ▧

    Go behind low hanging fruits first ▧ Need clever techniques ▧ Beware of data sanity with NoSQL ▧ Embracing data science early helps the business grow taller, stronger & sharper
  4. Campaign Management | What are we solving? ▧ Predict user

    behavior ▧ Business has amassed data over 2-3 years ▧ Educate team about data science & benefits ▧ Ideate & prioritize problems that can be solved ▧ RoI, pricing for new plugins
  5. Campaign Management | Thought process for Infrastructure ▧ 200+ Million

    rows ▧ Parallel Analytics data warehouse ▧ Data pipelines, automated workflows ▧ Distributed machine learning models ▧ Prediction as a Service ▧ Cost Dedicated bare metal server 32 GB ram | 8 cores | 1 TB SSD
  6. Campaign Management | Learnings ▧ Postgresql read replicas pause long

    running queries ▧ Understand postgresql WALs ▧ Data pipelines break. Exception handling, notifications, logging is utmost important ▧ We wired luigi exceptions to slack for notifications ▧ Pandas transformations are slow for large datasets ▧ PySpark to the rescue! ▧ Use monitoring tools like Munin for profiling
  7. Healthcare | What are we solving? ▧ Analytics on healthcare

    spend ▧ Medical claims - many data providers - no standard ▧ Data volume 500 M rows to start + high velocity ▧ Robust data ingestion, data cleaning system ▧ Data Security and HIPAA compliance ▧ Data pipeline is the heart of the platform
  8. Healthcare | Thought process for Infrastructure ▧ Flexible schema calling

    out for NoSQL ▧ Massive ingestion & cleaning tasks ▧ Denormalize + Wide format ▧ 100s of transformation & analytics tasks ▧ Luigi to the rescue ▧ Spark for transformation & analytics Database Instance Application Instance 32 GB ram | 8 cores | 5 TB SSD 4 GB ram | 2 cores | 500 GB SSD API Server
  9. Healthcare | Learnings ▧ Authorizations for databases are very important

    ▧ Aim to parallelize tasks for ingestion ▧ Data redundancy is totally fine for data science ▧ Polyglot of services - Use the right tools ▧ Understand business expectations & landscape before jumping into architecture