Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Understanding All Data Jargons

Pacmann AI
August 04, 2020

Understanding All Data Jargons

This discussion will explain about all data "jargons" in industries and academia.

Pacmann AI

August 04, 2020
Tweet

More Decks by Pacmann AI

Other Decks in Science

Transcript

  1. PACMANN AI is a research startup focusing on the application

    and development of machine learning algorithms. We have implemented several machine learning projects in different fields in Indonesia. About Us Recent Projects • Crop Disease Prediction In March 2017, we built a machine learning algorithm to detect crops disease utilizing image recognition.
  2. About Us Recent Projects • News Monitoring System During the

    last quarter of 2017, we made a media monitoring and information extraction tool based on Natural Language Processing to identify sentiment towards specific topics. • Credit Scoring We made a Credit Scoring model based on Machine Learning Algorithm to minimize credit default and systematic risk for PT Permodalan Nasional Madani (PT PNM)
  3. Recent Projects • Logistics Optimization In 2019-2020 we worked for

    one of Sinarmas’ startup, Bizzy Indonesia. We have built several services to optimize their core logistics business. We built Vehicle Routing optimization “Truck Way”, Salesman optimization “Field Force”, Credit Scoring system and Recommendation system “Tokosmart”, Product-Toko visual recognition and build internal Machine Learning platform. About Us
  4. He is the CTO of ML startup, Pacmann ai. He

    was a Senior Data Engineer at Bizzy. Relevant experiences: Build marketing platform for Sampoerna, Qubicle. Worked as a developer for Mivo, Broadcast Media TV. Build decision optimization platform for Bizzy, Truck Routing. Currently, the CEO of Pacmann ai. He is an ex Research ML Scientist at Bizzy. Relevant experiences: Build Recommender System for Bizzy TokoSmart. Build Face Recognition, Person Detection, Age and Gender prediction for TokoSmart using Computer Vision. ADITYO SANJAYA RIYAD RIVANDI BADARUDDIN R MOTIK He is the COO of ML startup, Pacmann ai. He was an Independent Consultant, and had assist in creating Digital Media Solution for SME’s Relevant experiences: Co-Founder of Kitabisa.com. Work as an Ads Content Specialist at Google via Adecco.
  5. Contents 1. What is Data? 2. What is Statistics? 3.

    What is Machine Learning? 4. What is Artificial Intelligence? 5. What is Data Sciences? 6. Data Science Workflow 7. Do you need Statistics, ML, or Data Sciences? 8. Machine Learning Cases in Central Bank 9. Things you need to learn to apply good Statistics and Machine Learning in Central Bank
  6. “Data are characteristics or information, usually numerical, that are collected

    through observation” -- OECD Statistical Definition Data Examples: All Numericals • Excel Sheet Most common understanding of data is an excel sheet. Most of the time, it is generated by business processes, or surveys, or economy activities. What is Data?
  7. Data Examples • Time Series It’s just another “excel” data,

    but with time index. • Sound High frequency time series data, like a stock return plot. What is Data?
  8. Data Examples • Images It’s the same as our excel

    image, but it has 3 level: Red, Green and Blue. What is Data? Source: https://cs231n.github.io/classification/
  9. Data Examples • Images It’s the same as our excel

    image, but it has 3 level: Red, Green and Blue. What is Data? Source: Bhupendra (2015)
  10. Data Examples • Videos A sequence of images with time

    index. What is Data? Source: Bhupendra (2015)
  11. Data Examples • Networks Data A data that show relations

    between entities. What is Data? Source: https://transportgeography.org/?page_id=6969
  12. Data is growing from time to time, because our activities

    are increasing, cheap cost of storage, and internet boom. D ata per minute, 2012 Growth of Data
  13. Growth of Data Large amount of data generated from: •

    Webpages (content, graph) • Clicks (ad, page, social) • Users (OpenID, FB Connect) • e-mails (Hotmail, Y!Mail, Gmail) • Photos, Movies (Flickr, YouTube, Vimeo ...) • Installed apps (Android market etc.) • Location (Latitude, Loopt, Foursquared) • User generated content (Wikipedia & co) • Ads (display, text, DoubleClick, Yahoo) • Comments (Disqus, Facebook) • Reviews (Yelp, Y!Local) • Third party features (e.g. Experian) • Social connections (LinkedIn, Facebook) • Purchase decisions (Netflix, Amazon) Data per minute, 2014
  14. Big data is a field that treats ways to analyze,

    systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. - 10 terabytes of data ++ Growth of Data
  15. MapReduce is an algorithm to efficiently store more than 10

    terabytes of data. Growth of Data Source: https://www.todaysoftmag.com/images/articles/tsm33/large/a11.png
  16. Big RAM is eating big data – Size of datasets

    used for analytics 1. Is your data bigger than 10 terabytes? No? Then don’t use BigData 2. Is your data smaller than 16 GB? Yes? Just use your laptop. 3. You can’t fit your data into your RAM? Yes? Buy more RAM, it’s cheaper Small Data
  17. Small Data We use Statistics in Small Data, just like

    your daily cases and research problems. Next, we will discuss about Statistics.
  18. Statistics is a method to infer unknown parameters from samples,

    to approximate the unknown parameters at population level What is Statistics? Population Samples
  19. Statistics Examples • Approximate Number of Fish Random sampling to

    infer number of fish in the lake. If the fish under or overpopulated the lake, it would distort ecological equilibrium What is Statistics?
  20. Statistics Examples • Approximate Number of Fish Mark and capture

    as a standard method to approximate population of fish. What is Statistics?
  21. Statistics Examples • Approximate Marginal Propensity to Consume (MPC) Use

    linear regression to estimate MPC as indicator of economy healthiness. Low MPC, i.e higher MPS, might indicate uncertainties in the future. What is Statistics?
  22. Statistics Examples • Approximate Growth Rate of Covid19 Use Bayesian

    simulation to infer covid19 growth rate and the possibility of pandemic disease. What is Statistics?
  23. Examples What is Statistical Bias? An estimator of theta has

    high or low bias depending on whether its mean is far from or close to theta. It has high or low variance depending on whether its mass is spread out or concentrated.
  24. Problems - High bias minimize our capability to know the

    truth. - High bias make our parameter different from the population. - High bias make us draw a wrong conclusion. What is Statistical Bias?
  25. Spurious Correlation Correlation does not imply causation, but if two

    variables correlated, there might be a common factor. What is Causality?
  26. Spurious Correlation Correlation does not imply causation, but if two

    variables correlated, there might be a common factor. What is Causality?
  27. Causality • We say that X causes Y if… •

    were we to intervene and change the value of X without changing anything else… • then Y would also change as a result What is Causality?
  28. Case • Alcohol consumption correlated with lung cancer. Is it

    a causal relationship? What is Causality? Smoking Lung Cancer Drink alcohol
  29. Case • Alcohol consumption correlated with lung cancer. Is it

    a causal relationship? What is Causality? Smoking Lung Cancer Drink alcohol Block
  30. What is Machine Learning? Machine Learning “Field of study that

    gives computers the ability to learn without being explicitly programmed” ▪ Arthur Samuel (1959) Machine learning focus on accuracy, it doesn’t focus on inference or causality.
  31. What is Machine Learning? Visit: http://vision.stanford.edu/teaching/cs231n-demos/knn/ Examples KNN Find closest

    neighbours, vote or average according to the closest neighbours. It detect patterns in the data
  32. What is Machine Learning? Examples Deep Learning Is a subset

    of machine learning algorithm, it’s a neural networks with a new name.
  33. How to Improve Accuracy? We can improve our model accuracy

    by add more bias, in order to minimize variance, or vice versa.
  34. How to Improve Accuracy? We can improve our model accuracy

    by add more bias, in order to minimize variance, or vice versa.
  35. Machine Learning • Machine Learning focus on accuracy, a pattern

    recognition. • Machine Learning can’t infer unknown parameters • Machine Learning can’t detect causality • It is not that smart
  36. "Can machines think?"... The new form of the problem can

    be described in terms of a game which we call the 'imitation game. Alan Turing
  37. “Can machines think?"... The new form of the problem can

    be described in terms of a game which we call the 'imitation game." It is played with three people, a man (A), a woman (B), and an interrogator (C) who may be of either sex. The interrogator stays in a room apart front the other two. The object of the game for the interrogator is to determine which of the other two is the man and which is the woman. He knows them by labels X and Y, and at the end of the game he says either "X is A and Y is B" or "X is B and Y is A." The interrogator is allowed to put questions to A and B... We now ask the question, "What will happen when a machine takes the part of A in this game?" Will the interrogator decide wrongly as often when the game is played like this as he does when the game is played between a man and a woman? These questions replace our original, "Can machines think?”
  38. What is Artificial Intelligence? Goal: Imitate human intelligence. How: Well

    we don’t know how, but academia seems to have a hypothesis that Machine Learning/Pattern Recognition is the solution of AI problems.
  39. What is Artificial Shallow Intelligence? Goal: Imitate human intelligence. Current

    States: • Only can do pattern recognition. • Can not think • Does not infer causality • “It is a shallow AI” -- Andrew Ng
  40. The ability to take data – to be able to

    understand it, to process it, to extract value from it, to visualize it, to communicate it's going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data.” – Hal Varian, Google Chief of Economist What is Data Science?
  41. “A data scientist is someone who knows more statistics than

    a computer scientist and more computer science than a statistician.” - Josh Blumenstock What is Data Science? “Data Scientist = statistician + programmer + storyteller + artist” - Shlomo Aragmon + • Machine Learning • Subject Matter Expertise
  42. What do you need? Statistics Machine Learning Data Sciences Focus

    on parameter inference Focus on accuracy We do ML and Stats You need a valid conclusion You need accurate prediction We make separate model for inference and prediction Small data and noisy Unstructured data We do both noisy data and unstructured data Need subject matter expertise Does not need subject matter expertise Need subject matter expertise One time run only Predict rapidly Predict rapidly and time to time inference
  43. “One approach which could make this process more efficient, but

    also more accurate, is to train a machine learning model on a set of validated supervisory alerts which indicate the need for closer scrutiny of a particular firm.” Our first case study for supervised learning is the prediction of alerts associated with balance sheet items of financial institutions which could be reason for concerns.” Source: Machine Learning at Bank of England https://www.bankofengland.co.uk/-/media/boe/files/working-paper/2017/machine-lear ning-at-central-banks.pdf?la=en&hash=EF5C4AC6E7D7BDC1D68A4BD865EEF3D7EE 5D7806 “Regular close scrutiny of banks’ balance sheets has become a standard for financial supervisors following the financial crises. However, the manual inspection of hundreds or thousands of firms records’ can be inefficient. Most firms will be sound and spotting complex relations between items for firms which are not, can be difficult. “ Data Science Cases
  44. Source: Machine Learning at Bank of England https://www.bankofengland.co.uk/-/media/boe/files/working-paper/2017/machine-learning-at-central-banks.pdf?la= • Build

    model which can predict the Left or Right position of distribution from some variables. • Use it as an alert system. Data Science Cases
  45. Our Training Pacmann AI have made several public and corporate

    trainings in the past. Our focus is to teach a good practice of Statistics, Machine Learning, Optimization and Algorithms in industries. Last training • 500 ++ alumnis ◦ 410 alumnis with bachelor degrees ◦ 80 alumnis with master degrees ◦ 30 alumnis with doctoral degrees • 50 ++ institutions
  46. “Kurikulum yang disajikan sangat komprehensif, mencakup basic skill sampai yang

    sangat advance, yang bahkan tidak dipelajari secara umum di bangku Universitas.” -- Bimandra Djaafara, Researcher, Eijkman-Oxford Clinical Research Unit. -- PhD Student at Imperial College London
  47. “Salah satu kelas machine learning terbaik yang pernah saya hadiri,

    bahkan lebih bagus jauh daripada kelas machine learning kampus saya (NUS, Computer Engineering Department).” -- Prasetya Dwicahya -- Analyst, World Bank