collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.”
• Data that can be analyzed in RAM (< 10 GB) • Data operations that can be performed quickly by a traditional database, e.g. single node PostgreSQL server
on tera bytes of text data • Process each HTML page as a URL + bag of words • For each word, aggregate the list of page URLs • 2 billion HTML pages: 100TB >10 days just to read sequentially
• Alternative to hard-coded rules written by experts • Extract the structure of historical data • Statistical tools to summarize the training data into an executable predictive model
region • Graphical visualization: get insights, tell a story to explain what’s happening in the data • Realm of Business Intelligence: reports & dashboard for managers • A wrong decision can be very costly • Small number of important decisions made by a human
Embedded in a user service to make it more useful / attractive / profitable • A wrong individual decision is not costly • Large number of small automated decisions • Humans would not be fast enough to make the predictions
the data • Ex: Fraud detection, churn forecasting • Help human decision makers focus on important cases • Human expert feedback to improve predictive models
Automated decision making embedded in products (e.g. recommenders) • Individual bad decisions are typically not costly • Descriptive Analytics • Business Intelligence: human decision making • Individual bad decisions can be very costly
• Group By (phone, day) to extract daily trips • Join By GPS coordinates to “departement” names • Filter out small trips • Group By (home, work) “departements” • Count
Random Forests, Neural Networks (Machine Learning) • Understand a model with 10% accuracy vs blindly trust a model with 90% accuracy • Simple models e.g. F = m a, F = - G (m1 + m2) / r^2 will not become false(r) because of big data • New problems can be tackled: computer vision, speech recognition, natural language understanding
based on the falsifiability of formulated hypotheses • theory is correct as long as past predictions hold in new experiments • machine learning train-validation-test splits and cross-validation is similar in spirit • ml model is just a complex theory: correct as long as its predictions still hold