Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data / Data Science Infrastructure for TFL ...

charles-cai
November 14, 2015

Big Data / Data Science Infrastructure for TFL Urban Traffic Big Data Hack

DSLOGIX - Big Data / Data Science Platform for TFL Urban Traffic Big Data Hack #UTDH15
14-15 November 2015

charles-cai

November 14, 2015
Tweet

More Decks by charles-cai

Other Decks in Technology

Transcript

  1. Charles Cai, CTO / Data Science London Richard Shaw, Chief

    Architect / Data Science London Big Data / Data Science Infrastructure for TFL Urban Traffic Big Data Hack 14 November 2015 Saturday #UTDH15
  2. DSLOGIX / BigStep Big Data / Data Science Platform Intro

    u  Bio u  #FO #FICC: Investment Banking Front Office: FX/Commodities u  #ETRM: Energy Trading & Risk Management u  #entrepreneur #innovator #disruptor u  Voted as one of the UK’s Top 50 Data Leaders & Influencers u  Twitter: @caidong u  #big-data #IoT #data-science #MOOC #Mobile #Cloud #UX u  LinkedIn: http://uk.linkedin.com/in/charlescai/en u  Bio u  Senior Director, Tech Ops EMEA, Leading Hadoop Distributor u  Head of Engineering EMEA, Leading Hadoop Distributor u  Specialities u  NoSQL, DevOps, Big Data Architecture, Software Design and Development, Hadoop, Python, Shell, Perl, Ansible, Puppet Apache Drill, Ruby, OpenStack u  Twitter: @Aggress u  #Hadoop #DevOps @MapR #bicycles #cats #distributed- systems u  LinkedIn: https://uk.linkedin.com/in/richardshaw
  3. Where we are at with Big Data Analytics? By Thomas

    Davenport – Harvard Business Review
  4. Open Source Data Science Toolbox Hadoop / Mesos Distributed Storage

    + Scalable Computation Open Source Big Data / Data Science Platform 5 COTS Apps (Excel, Tableau, Qlik...) Statistical Time Series Analysis Wider Big Data Analytics eco-systems •  Shell/APIs: HDFS, Hive, Spark, HBase, Sqoop, JDBC/ODBC •  Languages: Julia, Python, R, Scala - Developed on: - Operated by: NLTK: Natural Language Distributed Time Series / Geospatial / Graph Databases GIT Repo Data Products WebSocket Drag + Drop (CZML/GeoJSON) Web Browser (collaboration) Export to CSV/ Excel Geospatial data Time Series data Public Data Market data Real-time Streaming Open Gov Data JDBC via phoenix HDFS Hive/Pig w/ Geospatial
  5. Key Sub-systems in Modern Big Data Analytics Stack Data Analytics

    Streaming Graph Computing Machine Learning …
  6. From Classic to Modern Architecture Full Text Search Natural Language

    Process CCTV / Voice Computer Vision + Q&A Deep Learning (CNN/RNN) Relational Database KV Store + Graph Database Business Intelligence Big Data, Machine Learning Lightweight Container + Microservices + API Harvesting n-tier architecture Semantic Search Keyword Search Named Entity Extraction Q&A N-Grams Faceted Search Geospatial Search Tables Primary Keys Foreign Keys Node / Vertex Label Edge / Relationship Properties Colours Shapes Complex Shapes Textiles Accessories Context What happened? What’s happening? Predictive Analytics Prescriptive Analysis “Make the trend!” Database App Server Web Front Cloud Distributed and Fault Tolerant “Data Centre as One Computer” Unstructured
  7. u  Working with HR Training team u  VTA Training Sessions

    u  Big Data Bootcamp u  Lunch and Learn KT Sessions Big Data Technology is evolving so fast… here’s Hadoop related: Big data ELT with Apache Sqoop BI vs Data Science Data Scientist Career Path MOOC and Machine Learning Machine Learning with Apache Spark Map Reduce 101 Big Data Security: Kerberos/Knox/Sentry Deep Learning and Use Cases Time Series and Geospatial Big Data Analytics with Impala HBase: Distributed Key-value BigTable Distributed Time Sereis DB: OpenTSDB Machine Learning with Hadoop and R Advanced Machine Bayesian Network