Roksolana Diachuk
• Big Data Developer at
Captify
• Diversity & Inclusion
ambassador for Captify Kyiv
office
• Women Who Code Kyiv Data
Engineering Lead and
Mentor
• Speaker and traveller
Slide 3
Slide 3 text
Agenda
1. What is big Data
2. Difference between big
data and data science
3. Practical cases
Slide 4
Slide 4 text
BIG DATA
Slide 5
Slide 5 text
BIG DATA
Slide 6
Slide 6 text
BIG DATA
Slide 7
Slide 7 text
5 VS
Slide 8
Slide 8 text
Volume
Petabytes Terrabytes
Gigabytes Exabytes
Slide 9
Slide 9 text
Velocity
Batch data Streaming data
Slide 10
Slide 10 text
Variety
Slide 11
Slide 11 text
Veracity
Slide 12
Slide 12 text
Value
Slide 13
Slide 13 text
Big data
Structured Unstructured
Semi-structured
Slide 14
Slide 14 text
BIG DATA DATA SCIENCE
Slide 15
Slide 15 text
Garbage in - garbage out
Slide 16
Slide 16 text
General ML workflow
Slide 17
Slide 17 text
RDBMS ML model Metrics
Slide 18
Slide 18 text
RDBMS ML model Metrics
BIG DATA DATA SCIENCE
Slide 19
Slide 19 text
Background
Slide 20
Slide 20 text
Data
Scientist
Data
Analyst
Data
Engineer
Data
Communication
Math, Stats,
Algorithms
Software
Engineering
Slide 21
Slide 21 text
Software engineer
Slide 22
Slide 22 text
Data analysis
Slide 23
Slide 23 text
RESPONSIBILITIES
Slide 24
Slide 24 text
• Data processing and cleaning
• Developing data pipelines
• Storing data
• Infrastructure
• Data Pre-Processing
• Data Analysis
• Building ML models
• ML models tuning
Slide 25
Slide 25 text
No content
Slide 26
Slide 26 text
No content
Slide 27
Slide 27 text
PRACTICAL CASES
Slide 28
Slide 28 text
No content
Slide 29
Slide 29 text
CASE 1
Slide 30
Slide 30 text
Anomalies
detected
Streaming data
Batch data
Anomalies detection
Slide 31
Slide 31 text
Prophet
Tableau
Anomalies detection
Slide 32
Slide 32 text
CASE 2
Slide 33
Slide 33 text
Data Lake Feature
engineering
Credit score
Credit scoring
Slide 34
Slide 34 text
Psycopg
Tensorflow
Credit scoring
Slide 35
Slide 35 text
Psycopg
Tensorflow
Slide 36
Slide 36 text
CASE 3
Slide 37
Slide 37 text
Data Lake
Recommendations
User and
item
ratings
Recommender systems
Slide 38
Slide 38 text
No content
Slide 39
Slide 39 text
CASE 4
Slide 40
Slide 40 text
ETL
ETL
ETL
Data Lake
Data lake
Slide 41
Slide 41 text
Tableau
Data lake
Slide 42
Slide 42 text
CONCLUSIONS
Slide 43
Slide 43 text
Differences
• Goals
• Responsibilities
• Background
• Results of work
Slide 44
Slide 44 text
dead_flowers22
roksolana-d
roksolanadiachuk
roksolanad
My contact info