Goals of this talk • What Big Data actually is or isn’t • Introduce predictive modeling concepts • Contrast predictive analytics vs descriptive analytics
– Wikipedia “Big data is a blanket term for any collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.”
Not Big Data • Data that fits on a spreadsheet • Data that can be analyzed in RAM (< 10 GB) • Data operations that can be performed quickly by a traditional database, e.g. single node PostgreSQL server
Canonical Big Data problem: indexing the Web • Inverted index on tera bytes of text data • Process each HTML page as a URL + bag of words • For each word, aggregate the list of page URLs • 2 billion HTML pages: 100TB >10 days just to read sequentially
Other Big Data examples • GSM location event log from telco • Transaction log of a big retail network • Raw traffic data on a large website • Activity records for a service with 10s of millions of users
Not Big Data • Polls data (~10K data points) • Census data (~10M data points) • Real estate transactions data (~10M data points) • Any dataset publicly available for download
Predictive Modeling • Automated predictions of outcome on new data • Alternative to hard-coded rules written by experts • Extract the structure of historical data • Statistical tools to summarize the training data into an executable predictive model
New text doc image sound transaction Model Expected Label Predictive Modeling Data Flow Feature vector Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors
Descriptive Statistics • Ex: Sales by (day, months, year) x region • Graphical visualization: get insights, tell a story to explain what’s happening in the data • Realm of Business Intelligence: reports & dashboard for managers • A wrong decision can be very costly • Small number of important decisions made by a human
Predictive Statistics • Ex: Movie recommendations or targeted ads • Embedded in a user service to make it more useful / attractive / profitable • A wrong individual decision is not costly • Large number of small automated decisions • Humans would not be fast enough to make the predictions
Mixed models • Predictive modeling to identify interesting subsets of the data • Ex: Fraud detection, churn forecasting • Help human decision makers focus on important cases • Human expert feedback to improve predictive models
• Big Data ≠ Predictive Analytics • Predictive Analytics • Automated decision making embedded in products (e.g. recommenders) • Individual bad decisions are typically not costly • Descriptive Analytics • Business Intelligence: human decision making • Individual bad decisions can be very costly
Back to the Regionator • Use raw daily telco logs • Group By (phone, day) to extract daily trips • Join By GPS coordinates to “departement” names • Filter out small trips • Group By (home, work) “departements” • Count
New text doc image sound transaction Model Expected Label Small data Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors Feature vector
New text doc image sound transaction Model Expected Label Small / Medium data Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors Feature vector
New text doc image sound transaction Model Expected Label Small / Medium data with Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors Feature vector
New text doc image sound transaction Model Expected Label Small / Medium data with Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors Feature vector
Model Expected Label Big data with Machine Learning Algorithm New text doc image sound transaction Training text docs images sounds transactions Labels Feature vectors Feature vector
Model Expected Label Big data with Machine Learning Algorithm New text doc image sound transaction Training text docs images sounds transactions Labels Feature vectors Feature vector
Model Expected Label Big data with Machine Learning Algorithm New text doc image sound transaction Training text docs images sounds transactions Labels Feature vectors Feature vector
From Big to Small • Feature extraction often shrinks data • Filter / Join / Group By / Count • Machine Learning performed on aggregates • Sampling for fast in-memory iterative modeling
Type # rooms Surface (m2) Floor Public Transp. School (km) Flood plain Apart. 3 65 2 Yes 1.0 No House 5 110 NA No 25.0 Yes Duplex 4 95 4 Yes 0.5 No features samples
• Parametric e.g. linear model (traditional stats) vs Non-parametric e.g. Random Forests, Neural Networks (Machine Learning) • Understand a model with 10% accuracy vs blindly trust a model with 90% accuracy • Simple models e.g. F = m a, F = - G (m1 + m2) / r^2 will not become false(r) because of big data • New problems can be tackled: computer vision, speech recognition, natural language understanding
• the (experimental) scientific method introduced by Karl Popper is based on the falsifiability of formulated hypotheses • theory is correct as long as past predictions hold in new experiments • machine learning train-validation-test splits and cross-validation is similar in spirit • ml model is just a complex theory: correct as long as its predictions still hold