Understanding All Data Jargons

D73dc2189cf378ae9088283c720d0331?s=47 Pacmann AI
August 04, 2020

Understanding All Data Jargons

This discussion will explain about all data "jargons" in industries and academia.

D73dc2189cf378ae9088283c720d0331?s=128

Pacmann AI

August 04, 2020
Tweet

Transcript

  1. Understanding All Data Jargon By Adityo Sanjaya

  2. About Us

  3. PACMANN AI is a research startup focusing on the application

    and development of machine learning algorithms. We have implemented several machine learning projects in different fields in Indonesia. About Us Recent Projects • Crop Disease Prediction In March 2017, we built a machine learning algorithm to detect crops disease utilizing image recognition.
  4. About Us Recent Projects • News Monitoring System During the

    last quarter of 2017, we made a media monitoring and information extraction tool based on Natural Language Processing to identify sentiment towards specific topics. • Credit Scoring We made a Credit Scoring model based on Machine Learning Algorithm to minimize credit default and systematic risk for PT Permodalan Nasional Madani (PT PNM)
  5. Recent Projects • Logistics Optimization In 2019-2020 we worked for

    one of Sinarmas’ startup, Bizzy Indonesia. We have built several services to optimize their core logistics business. We built Vehicle Routing optimization “Truck Way”, Salesman optimization “Field Force”, Credit Scoring system and Recommendation system “Tokosmart”, Product-Toko visual recognition and build internal Machine Learning platform. About Us
  6. He is the CTO of ML startup, Pacmann ai. He

    was a Senior Data Engineer at Bizzy. Relevant experiences: Build marketing platform for Sampoerna, Qubicle. Worked as a developer for Mivo, Broadcast Media TV. Build decision optimization platform for Bizzy, Truck Routing. Currently, the CEO of Pacmann ai. He is an ex Research ML Scientist at Bizzy. Relevant experiences: Build Recommender System for Bizzy TokoSmart. Build Face Recognition, Person Detection, Age and Gender prediction for TokoSmart using Computer Vision. ADITYO SANJAYA RIYAD RIVANDI BADARUDDIN R MOTIK He is the COO of ML startup, Pacmann ai. He was an Independent Consultant, and had assist in creating Digital Media Solution for SME’s Relevant experiences: Co-Founder of Kitabisa.com. Work as an Ads Content Specialist at Google via Adecco.
  7. Contents 1. What is Data? 2. What is Statistics? 3.

    What is Machine Learning? 4. What is Artificial Intelligence? 5. What is Data Sciences? 6. Data Science Workflow 7. Do you need Statistics, ML, or Data Sciences? 8. Machine Learning Cases in Central Bank 9. Things you need to learn to apply good Statistics and Machine Learning in Central Bank
  8. What is Data?

  9. “Data are characteristics or information, usually numerical, that are collected

    through observation” -- OECD Statistical Definition Data Examples: All Numericals • Excel Sheet Most common understanding of data is an excel sheet. Most of the time, it is generated by business processes, or surveys, or economy activities. What is Data?
  10. Data Examples • Time Series It’s just another “excel” data,

    but with time index. • Sound High frequency time series data, like a stock return plot. What is Data?
  11. Data Examples • Images It’s the same as our excel

    image, but it has 3 level: Red, Green and Blue. What is Data? Source: https://cs231n.github.io/classification/
  12. Data Examples • Images It’s the same as our excel

    image, but it has 3 level: Red, Green and Blue. What is Data? Source: Bhupendra (2015)
  13. Data Examples • Videos A sequence of images with time

    index. What is Data? Source: Bhupendra (2015)
  14. Data Examples • Networks Data A data that show relations

    between entities. What is Data? Source: https://transportgeography.org/?page_id=6969
  15. Source:https://gdcoder.com/nlp-transforming-tokens-into-features-tf-idf/ Data Examples • Text Data Data that show linguistic

    meaning or understanding What is Data?
  16. Growth of Data and BigData

  17. Data is growing from time to time, because our activities

    are increasing, cheap cost of storage, and internet boom. D ata per minute, 2012 Growth of Data
  18. Growth of Data Large amount of data generated from: •

    Webpages (content, graph) • Clicks (ad, page, social) • Users (OpenID, FB Connect) • e-mails (Hotmail, Y!Mail, Gmail) • Photos, Movies (Flickr, YouTube, Vimeo ...) • Installed apps (Android market etc.) • Location (Latitude, Loopt, Foursquared) • User generated content (Wikipedia & co) • Ads (display, text, DoubleClick, Yahoo) • Comments (Disqus, Facebook) • Reviews (Yelp, Y!Local) • Third party features (e.g. Experian) • Social connections (LinkedIn, Facebook) • Purchase decisions (Netflix, Amazon) Data per minute, 2014
  19. Growth of Data Data from industries

  20. Growth of Data You need servers for all this BigData

  21. Big data is a field that treats ways to analyze,

    systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. - 10 terabytes of data ++ Growth of Data
  22. MapReduce is an algorithm to efficiently store more than 10

    terabytes of data. Growth of Data Source: https://www.todaysoftmag.com/images/articles/tsm33/large/a11.png
  23. Small Data

  24. But, do you need BigData and MapReduce algorithm? Small Data

  25. Your Data

  26. But, do you need BigData and MapReduce algorithm? Small Data

  27. Big RAM is eating big data – Size of datasets

    used for analytics 1. Is your data bigger than 10 terabytes? No? Then don’t use BigData 2. Is your data smaller than 16 GB? Yes? Just use your laptop. 3. You can’t fit your data into your RAM? Yes? Buy more RAM, it’s cheaper Small Data
  28. Small Data We use Statistics in Small Data, just like

    your daily cases and research problems. Next, we will discuss about Statistics.
  29. What is Statistics

  30. Statistics is a method to infer unknown parameters from samples,

    to approximate the unknown parameters at population level What is Statistics? Population Samples
  31. Statistics Examples • Approximate Number of Fish Random sampling to

    infer number of fish in the lake. If the fish under or overpopulated the lake, it would distort ecological equilibrium What is Statistics?
  32. Statistics Examples • Approximate Number of Fish Mark and capture

    as a standard method to approximate population of fish. What is Statistics?
  33. Statistics Examples • Approximate Marginal Propensity to Consume (MPC) Use

    linear regression to estimate MPC as indicator of economy healthiness. Low MPC, i.e higher MPS, might indicate uncertainties in the future. What is Statistics?
  34. Statistics Examples • Approximate Growth Rate of Covid19 Use Bayesian

    simulation to infer covid19 growth rate and the possibility of pandemic disease. What is Statistics?
  35. What is Statistical Bias?

  36. Definition What is Statistical Bias? Source: https://mathigon.org/course/intro-statistics/point-estimation

  37. Examples What is Statistical Bias? An estimator of theta has

    high or low bias depending on whether its mean is far from or close to theta. It has high or low variance depending on whether its mass is spread out or concentrated.
  38. Problems - High bias minimize our capability to know the

    truth. - High bias make our parameter different from the population. - High bias make us draw a wrong conclusion. What is Statistical Bias?
  39. What is Causality?

  40. Spurious Correlation Correlation does not imply causation, but if two

    variables correlated, there might be a common factor. What is Causality?
  41. Spurious Correlation Correlation does not imply causation, but if two

    variables correlated, there might be a common factor. What is Causality?
  42. Causality • We say that X causes Y if… •

    were we to intervene and change the value of X without changing anything else… • then Y would also change as a result What is Causality?
  43. Case • Alcohol consumption correlated with lung cancer. Is it

    a causal relationship? What is Causality? Smoking Lung Cancer Drink alcohol
  44. Case • Alcohol consumption correlated with lung cancer. Is it

    a causal relationship? What is Causality? Smoking Lung Cancer Drink alcohol Block
  45. What is Machine Learning?

  46. What is Machine Learning? Machine Learning “Field of study that

    gives computers the ability to learn without being explicitly programmed” ▪ Arthur Samuel (1959) Machine learning focus on accuracy, it doesn’t focus on inference or causality.
  47. What is Machine Learning? Visit: http://vision.stanford.edu/teaching/cs231n-demos/knn/ Examples KNN Find closest

    neighbours, vote or average according to the closest neighbours. It detect patterns in the data
  48. What is Machine Learning? Examples Deep Learning Is a subset

    of machine learning algorithm, it’s a neural networks with a new name.
  49. How to improve accuracy?

  50. How to Improve Accuracy? We can improve our model accuracy

    by add more bias, in order to minimize variance, or vice versa.
  51. How to Improve Accuracy? We can improve our model accuracy

    by add more bias, in order to minimize variance, or vice versa.
  52. Machine Learning • Machine Learning focus on accuracy, a pattern

    recognition. • Machine Learning can’t infer unknown parameters • Machine Learning can’t detect causality • It is not that smart
  53. What is Artificial Intelligence?

  54. None
  55. Can a machine ‘Think’? Imitation Game, 2016

  56. "Can machines think?"... The new form of the problem can

    be described in terms of a game which we call the 'imitation game. Alan Turing
  57. “Can machines think?"... The new form of the problem can

    be described in terms of a game which we call the 'imitation game." It is played with three people, a man (A), a woman (B), and an interrogator (C) who may be of either sex. The interrogator stays in a room apart front the other two. The object of the game for the interrogator is to determine which of the other two is the man and which is the woman. He knows them by labels X and Y, and at the end of the game he says either "X is A and Y is B" or "X is B and Y is A." The interrogator is allowed to put questions to A and B... We now ask the question, "What will happen when a machine takes the part of A in this game?" Will the interrogator decide wrongly as often when the game is played like this as he does when the game is played between a man and a woman? These questions replace our original, "Can machines think?”
  58. What is Artificial Intelligence? Goal: Imitate human intelligence. How: Well

    we don’t know how, but academia seems to have a hypothesis that Machine Learning/Pattern Recognition is the solution of AI problems.
  59. What is Artificial Shallow Intelligence?

  60. What is Artificial Shallow Intelligence? Goal: Imitate human intelligence. Current

    States: • Only can do pattern recognition. • Can not think • Does not infer causality • “It is a shallow AI” -- Andrew Ng
  61. What is Data Science?

  62. The ability to take data – to be able to

    understand it, to process it, to extract value from it, to visualize it, to communicate it's going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data.” – Hal Varian, Google Chief of Economist What is Data Science?
  63. What is Data Science? Nate Silver

  64. What is Data Science? “Nate Silver won the election” --

    Harvard Business Review
  65. What is Data Science?

  66. What is Data Science?

  67. What is Data Science? Source: Harvard 109 Data science course

  68. “A data scientist is someone who knows more statistics than

    a computer scientist and more computer science than a statistician.” - Josh Blumenstock What is Data Science? “Data Scientist = statistician + programmer + storyteller + artist” - Shlomo Aragmon + • Machine Learning • Subject Matter Expertise
  69. How to do Data Sciences?

  70. Source: Data Science 109, Harvard How to do Data Sciences?

  71. Source: Data Science 109, Harvard DEMO

  72. Do you need Statistics, Machine Learning or Data Sciences?

  73. What do you need? Statistics Machine Learning Data Sciences Focus

    on parameter inference Focus on accuracy We do ML and Stats You need a valid conclusion You need accurate prediction We make separate model for inference and prediction Small data and noisy Unstructured data We do both noisy data and unstructured data Need subject matter expertise Does not need subject matter expertise Need subject matter expertise One time run only Predict rapidly Predict rapidly and time to time inference
  74. Data Science Cases in Central Bank

  75. “One approach which could make this process more efficient, but

    also more accurate, is to train a machine learning model on a set of validated supervisory alerts which indicate the need for closer scrutiny of a particular firm.” Our first case study for supervised learning is the prediction of alerts associated with balance sheet items of financial institutions which could be reason for concerns.” Source: Machine Learning at Bank of England https://www.bankofengland.co.uk/-/media/boe/files/working-paper/2017/machine-lear ning-at-central-banks.pdf?la=en&hash=EF5C4AC6E7D7BDC1D68A4BD865EEF3D7EE 5D7806 “Regular close scrutiny of banks’ balance sheets has become a standard for financial supervisors following the financial crises. However, the manual inspection of hundreds or thousands of firms records’ can be inefficient. Most firms will be sound and spotting complex relations between items for firms which are not, can be difficult. “ Data Science Cases
  76. Source: Machine Learning at Bank of England https://www.bankofengland.co.uk/-/media/boe/files/working-paper/2017/machine-learning-at-central-banks.pdf?la= • Build

    model which can predict the Left or Right position of distribution from some variables. • Use it as an alert system. Data Science Cases
  77. Source: Machine Learning at Bank of England https://www.bankofengland.co.uk/-/media/boe/files/working-paper/2017/machine-learning-at-central-banks.pdf?la= Data Science

    Cases • Predict inflation rate with machine learning • Use it as an alert system.
  78. Source: Machine Learning at Bank of England https://www.bankofengland.co.uk/-/media/boe/files/working-paper/2017/machine-learning-at-central-banks.pdf?la= Data Science

    Cases
  79. Source: Machine Learning at Bank of England https://www.bankofengland.co.uk/-/media/boe/files/working-paper/2017/machine-learning-at-central-banks.pdf?la= Data Science

    Cases
  80. Things you need to learn to apply ML and Statistics

    in Central Bank
  81. You need to learn: Python

  82. • Pandas Functionality You need to learn: Data Wrangling

  83. You need to learn: Data Wrangling • Data Cleansing Cases

  84. You need to learn: Visualization • Exploratory Data Analysis https://www.sciencedirect.com/science/article/pii/S2468502X18300561

  85. You need to learn: Statistics Source: Alex Smola 701 ML

    Intro course
  86. You need to learn: Machine Learning

  87. Our Training

  88. Our Training Pacmann AI have made several public and corporate

    trainings in the past. Our focus is to teach a good practice of Statistics, Machine Learning, Optimization and Algorithms in industries. Last training • 500 ++ alumnis ◦ 410 alumnis with bachelor degrees ◦ 80 alumnis with master degrees ◦ 30 alumnis with doctoral degrees • 50 ++ institutions
  89. Our Training

  90. Our Training

  91. Current Public Trainings

  92. None
  93. None
  94. None
  95. Current Public Training Generate your own curriculum

  96. 300 Students For 2 months

  97. “Kurikulum yang disajikan sangat komprehensif, mencakup basic skill sampai yang

    sangat advance, yang bahkan tidak dipelajari secara umum di bangku Universitas.” -- Bimandra Djaafara, Researcher, Eijkman-Oxford Clinical Research Unit. -- PhD Student at Imperial College London
  98. “Salah satu kelas machine learning terbaik yang pernah saya hadiri,

    bahkan lebih bagus jauh daripada kelas machine learning kampus saya (NUS, Computer Engineering Department).” -- Prasetya Dwicahya -- Analyst, World Bank
  99. bit.ly/brosurpacmannai

  100. Adityo Sanjaya adit@pacmannai.com Thank You