Architecture Choices for Big and Tiny Data Problems

Architecture Stories for Big and Tiny Data

We are Unnatidata Labs @raghothams @nischalhp www.unnati.xyz

3 Stories ? Velocity | Volume | Variety

What are we solving? Infrastructure Architecture Learnings

#1 FinTech Small Data | Early Startup | Data Driven

FinTech | What are we solving? ▧ Evaluate college students
to determine their creditworthiness ▧ Lack of credit history ▧ Tiny data ▧ Enrich data with alternate data sources ▧ Statistical modelling to evaluate students initially. ▧ As the user activity increases, build machine learning models to predict creditworthiness.

FinTech | Thought process for Infrastructure ▧ Data velocity estimation
for the next 6 months ▧ Complexity of data science algorithms ▧ No. of calls being serviced by the data science APIs ▧ Cost AWS instance x1, 8 GB ram, 4 cores

FinTech | Architecture

FinTech | Learnings ▧ Small data problems are tricky ▧
Go behind low hanging fruits first ▧ Need clever techniques ▧ Beware of data sanity with NoSQL ▧ Embracing data science early helps the business grow taller, stronger & sharper

#2 Campaign Management Medium Size Data | Established Startup

Campaign Management | What are we solving? ▧ Predict user
behavior ▧ Business has amassed data over 2-3 years ▧ Educate team about data science & benefits ▧ Ideate & prioritize problems that can be solved ▧ RoI, pricing for new plugins

Campaign Management | Thought process for Infrastructure ▧ 200+ Million
rows ▧ Parallel Analytics data warehouse ▧ Data pipelines, automated workflows ▧ Distributed machine learning models ▧ Prediction as a Service ▧ Cost Dedicated bare metal server 32 GB ram | 8 cores | 1 TB SSD

Campaign Management | Architecture

Campaign Management | Learnings ▧ Postgresql read replicas pause long
running queries ▧ Understand postgresql WALs ▧ Data pipelines break. Exception handling, notifications, logging is utmost important ▧ We wired luigi exceptions to slack for notifications ▧ Pandas transformations are slow for large datasets ▧ PySpark to the rescue! ▧ Use monitoring tools like Munin for profiling

#3 Unstructured Healthcare Medium - Big Size Data | Generic
Data Science Platform

Healthcare | What are we solving? ▧ Analytics on healthcare
spend ▧ Medical claims - many data providers - no standard ▧ Data volume 500 M rows to start + high velocity ▧ Robust data ingestion, data cleaning system ▧ Data Security and HIPAA compliance ▧ Data pipeline is the heart of the platform

“ Adding more servers is easy, writing more code is
hard

Healthcare | Thought process for Infrastructure ▧ Flexible schema calling
out for NoSQL ▧ Massive ingestion & cleaning tasks ▧ Denormalize + Wide format ▧ 100s of transformation & analytics tasks ▧ Luigi to the rescue ▧ Spark for transformation & analytics Database Instance Application Instance 32 GB ram | 8 cores | 5 TB SSD 4 GB ram | 2 cores | 500 GB SSD API Server

Healthcare | Architecture

Healthcare | Learnings ▧ Authorizations for databases are very important
▧ Aim to parallelize tasks for ingestion ▧ Data redundancy is totally fine for data science ▧ Polyglot of services - Use the right tools ▧ Understand business expectations & landscape before jumping into architecture

Toolbox

Want big impact? www.unnati.xyz

Fin. Any questions? tweet : @unnati_xyz [email protected]

Architecture Choices for Big and Tiny Data Prob...

Architecture Choices for Big and Tiny Data Problems

unnati_xyz

More Decks by unnati_xyz

Other Decks in Technology

Featured

Transcript

Architecture Stories for Big and Tiny Data

We are Unnatidata Labs @raghothams @nischalhp www.unnati.xyz

3 Stories ? Velocity | Volume | Variety

What are we solving? Infrastructure Architecture Learnings

#1 FinTech Small Data | Early Startup | Data Driven

FinTech | What are we solving? ▧ Evaluate college students

FinTech | Thought process for Infrastructure ▧ Data velocity estimation

FinTech | Architecture

FinTech | Learnings ▧ Small data problems are tricky ▧

#2 Campaign Management Medium Size Data | Established Startup

Campaign Management | What are we solving? ▧ Predict user

Campaign Management | Thought process for Infrastructure ▧ 200+ Million

Campaign Management | Architecture

Campaign Management | Learnings ▧ Postgresql read replicas pause long

#3 Unstructured Healthcare Medium - Big Size Data | Generic

Healthcare | What are we solving? ▧ Analytics on healthcare

“ Adding more servers is easy, writing more code is

Healthcare | Thought process for Infrastructure ▧ Flexible schema calling

Healthcare | Architecture

Healthcare | Learnings ▧ Authorizations for databases are very important

Toolbox

Want big impact? www.unnati.xyz

Fin. Any questions? tweet : @unnati_xyz [email protected]