• Data that can be analyzed in RAM (< 10 GB) • Data operations that can be performed quickly by a traditional database, e.g. single node PostgreSQL server
• Alternative to hard-coded rules written by experts • Extract the structure of historical data • Statistical tools to summarize the training data into an executable predictive model
Embedded in a user service to make it more useful / attractive / profitable • A wrong individual decision is not costly • Large number of small automated decisions • Humans would not be fast enough to make the predictions
• Group By (phone, day) to extract daily trips • Join By GPS coordinates to “departement” names • Filter out small trips • Group By (home, work) “departements” • Count
Random Forests, Neural Networks (Machine Learning) • Understand a model with 10% accuracy vs blindly trust a model with 90% accuracy • Simple models e.g. F = m a, F = - G (m1 + m2) / r^2 will not become false(r) because of big data • New problems can be tackled: computer vision, speech recognition, natural language understanding
based on the falsifiability of formulated hypotheses • theory is correct as long as past predictions hold in new experiments • machine learning train-validation-test splits and cross-validation is similar in spirit • ml model is just a complex theory: correct as long as its predictions still hold