Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data Analytics with Time-Series Database kdb+

Big Data Analytics with Time-Series Database kdb+

Speaker: Louise McCluskey, kdb+ Engineer
Description: “Big Data Analytics with Time-Series Database kdb+" followed by a Kx for Dashboards Demo – watch cool visualizations of HUGE Data sets.

Where: Dogpatch Labs, CHQ, IFSC, Dublin 1

https://www.meetup.com/PyLadiesDublin/events/dclgvlyxjbzb/

PyLadies Dublin

June 19, 2018
Tweet

More Decks by PyLadies Dublin

Other Decks in Technology

Transcript

  1. Big Data Analytics with Time-Series Database kdb+
    PyLadies Dublin Meetup
    Louise Totten & Aimi McConnell
    19 June 2018

    View Slide

  2. \
    Louise Totten & Aimi McConnell
    June 2018

    View Slide

  3. 3
    • Introduction to FD/Kx
    • Core technology
    • Performance
    • Demos
    Content

    View Slide

  4. 4
    • Subsidiary of First Derivatives plc
    • 13 Global Offices Inc. NYC, Singapore, London & Tokyo
    • Large user community
    About us
    NORTH AMERICA
    AFRICA
    AUSTRALIA & NZ
    ASIA
    UK & EUROPE
    • Widely adopted in financial services over
    two decades
    • Now in - Hi-Tech Manufacturing, Utilities,
    Telco, Energy, Life Sciences, Earth
    Observation
    • Software & industry solutions, consulting
    and implementation services

    View Slide

  5. 5
    • World’s fastest time-series columnar database
    • Streaming, real-time and historical data in one platform
    • Runs on Linux, Windows, Solaris, and MacOS
    • Runs on commodity hardware, cloud, edge devices/appliances
    • Expressive query (qsql) and programming language (q)
    • In-memory compute engine for Complex Event Processing
    • Column-level compression
    • Integrates easily into legacy systems for performance augmentation
    • Multi-core / Multi-processor / Multi-thread / Multi-server
    Core technology
    kdb+ column based time-series database with in-built programming language q

    View Slide

  6. 6
    • Processing & analysis of large volumes of real-time and historical time series data
    • Extreme performance - low latency
    • Scalability without requiring significant infrastructure change
    • Provide the fastest, most efficient, most flexible tools and dashboards
    • Worldwide leader in high-volume, high-performance databases
    Known for

    View Slide

  7. 7
    • Interpreted
    • Functional
    • Array/Vector
    • Query
    • Time-Series
    q programming language

    View Slide

  8. 8
    select open: first price, high: max price, low: min
    price, close: last price from trade where date =
    2013.05.01, sym=`VOD.L
    Sample q query
    open high low close
    83.85 85.9 83.28 85.45

    View Slide

  9. 9
    q select sym qty by p.color from sp
    SQL select p.color, sum(sp.qty) from sp, p where
    sp.p=p.p group by p.color order by color
    q vs. SQL

    View Slide

  10. 10
    Single thread:
    {sum reverse sqrt log til x} each 1000000 * til 8
    Parallel:
    {sum reverse sqrt log til x} peach 1000000 * til 8
    Parallelizing in q

    View Slide

  11. 11
    Kx Architecture
    File & DB
    Sources
    File & DB
    Sources
    Scalability, High Availability, and Fault Tolerance
    Native Lambda/HTAP Architecture
    Stream for Kx Application Framework
    Kx for Flow
    Kx for Surveillance
    Kx for Algo
    Kx for DaaS
    Kx for Utilities
    Kx for Cyber
    Kx for Pharma
    Vertical
    Market Solutions
    Kx for Sensors
    Kx for Telco
    Real-Time
    Sources
    Core (kdb+)
    In Memory
    Database
    Historical
    Database
    In Memory
    Database
    Historical
    Database
    q language &
    qsql scripting
    Develop, configure,
    deploy, and manage
    solutions
    Control for Kx Monitor for Kx
    Scan, monitor and alerting
    of issues in software and
    hardware
    Third-Party Interoperability
    Pub/sub, SOA, ODBC, JDBC, web sockets
    R, Python, MatLab,

    Java, C#, C/C++
    Analyst for Kx
    Query, explore, transform,
    and import without
    programming
    Dashboards for Kx
    Build real-time
    visualizations for multiple
    devices
    File & DB
    Sources
    Batch
    Loader
    Batch
    Loader
    Stream

    Feed
    Handler
    Stream

    Feed Handler
    Stream Engine
    Ticker Plant
    Stream and
    Ingestion
    Engine
    Complex Event
    Processing
    • Queries
    • Transforms
    • Alerts
    • Control signals
    • Notifications
    • Micro-services

    View Slide

  12. 12
    Kx Clients

    View Slide

  13. 13
    Health & Life Sciences
    Automotive
    Utilities
    Space and Telco
    Retail
    Earth Observation
    Geospatial Data Analytics
    Anomaly Detection
    Genomics Data Processing
    Connected Health
    Patient Record Analytics
    Anomaly Detection
    Performance Analytics
    Sensor Analytics
    Smart Meters Data
    Predictive Analytics
    Manufacturing
    Retail Analytics
    Marketing Optimization
    Customer Journey
    Edge Computing
    Multivariate Analysis
    Fault Detection
    Other Verticals

    View Slide

  14. 14
    Kx Performance Snippets
    Trusted by 19/20
    World’s Top
    Investment Banks
    500 KB profile
    (L1/L2 Cache)
    Process & store
    4.5 million events/
    second/core
    Ingest data at
    10 million
    records/
    second/core Streaming
    1.6 TB
    of Data Daily
    Search in-
    memory tables
    at 4 billion
    records/
    second/core

    View Slide

  15. 15
    • InfluxData published public benchmarks (data and software) against:
    • MongoDB
    • Cassandra
    • ElasticSearch
    • OpenTSDB
    • Kx applied identical methodologies and run tests to generate their own
    performance measurements
    Transitive Comparisons

    View Slide

  16. 16
    Server Configurations
    Platform CPU Memory Storage OS Database
    Raspberry Pi 1.2Ghz quad-core ARM
    Cortex-A53
    1GB DDR2-900 MHz 32GB Micro SDHC Raspbian Kdb+ (32 bit)
    MacBook Pro
    (mid-2014) 

    3Ghz Intel Core i7 (2
    cores)
    16GB DDR3-1600 Mhz 500GB SSD Flash MacOS 10.13.2 Kdb+ (64bit)
    Kx Server* 3.2Ghz quad-core
    E5-2667v3 Xeon (20MB
    cache)
    64GB DDR4-2133 Mhz 300GB SAS 10K CentOS 7.3.1611 Kdb+ (64bit)
    InfluxData
    Server*
    3.6Ghz quad-core
    E5-1271v3 Xeon (8MB
    cache)
    32GB DDR3-1600Mhz 1.2TB NVMe SSD Ubuntu 16.04 LTS InfluxDB
    * denotes similar server configurations for head-to-head comparisons

    View Slide

  17. 17
    Query Definitions
    # Definition
    Kdb+ vs
    InfluxDB vs…
    Data

    Spanning
    1 Return maximum value, by minute, in a 1-hour time frame, for 1 host Cassandra 1 day
    2 Return maximum value, by minute, in a 12-hour time frame, for 1 host Cassandra 1 day
    3 Return maximum value, by minute, in a 12-hour time frame, for 8 hosts Cassandra 1 day
    4 Return maximum value, by minute, in a 1-hour time frame, for 1 host (4 days) ElasticSearch 4 days
    5 Return maximum value, by minute, in a 1-hour time frame, for 1 host MongoDB 6 hours
    6 Return maximum value, by minute, in a 1-hour time frame, for 8 hosts OpenTSDB 4 hours

    View Slide

  18. 18
    Kdb+ vs InfluxDB Summary
    Kdb+ InfluxDB
    Query
    Raspberry

    Pi
    Macbook

    Pro
    Server

    1-core
    Server

    4-cores
    Server

    8-cores
    InfluxData

    Server

    4-cores
    How

    much

    faster?
    1 4,741 48,055 25,061 55,578 79,084 2,606 21.3×
    2 457 4,487 3,442 12,019 21,087 714 16.8×
    3 54 531 298 1,101 1,918 192 5.73×
    4 1,333 24,266 12,455 34,905 53,682 3,600 9.6×
    5 7,693 63,138 56,649 107,810 122,666 2,614 41.2×
    6 875 7,804 5,366 13,018 17,090 400 32.5×
    Note: Units are queries per second
    Similar Configurations

    View Slide

  19. 19
    Query Rate: Kdb+ vs InfluxDB vs MongoDB
    Queries per second
    0
    32,500
    65,000
    97,500
    130,000
    Raspberry Pi MacBook Server 1-Core Server 4-Cores Server 8-Cores InfluxDB MongoDB
    2850
    2614
    122666
    107810
    56649
    63138
    7693
    Kdb+

    View Slide

  20. 20
    Query Rate: Kdb+ vs InfluxDB vs ElasticSearch
    Queries per second
    0
    10,000
    20,000
    30,000
    40,000
    50,000
    60,000
    Raspberry Pi MacBook Server 1-Core Server 4-Cores Server 8-Cores InfluxDB ElaslcSearch
    79
    3600
    53682
    34905
    12455
    24266
    1333
    Kdb+

    View Slide

  21. 21
    Query Rate: Kdb+ vs InfluxDB vs Cassandra
    Queries per second
    0
    9,444
    18,889
    28,333
    37,778
    47,222
    56,667
    66,111
    75,556
    85,000
    Rasp Pi MacBook Server 1-Core Server 4-Cores Server 8-Cores InfluxDB Cassandra
    66
    192
    1918
    1101
    298
    531
    54 442
    714
    21087
    12019
    3442
    4487
    457
    1912
    2606
    79084
    55578
    25061
    48055
    4741
    Query 1 Query 2 Query 3
    Kdb+

    View Slide

  22. 22
    Query Rate: Kdb+ vs InfluxDB vs OpenTSDB
    Queries per second
    0
    5,000
    10,000
    15,000
    20,000
    Raspberry Pi MacBook Server 1-Core Server 4-Cores Server 8-Cores AWS 1-Core AWS 2-Cores InfluxDB OpenTSDB
    6 Nodes
    106
    400
    9788
    4957
    17090
    13018
    5366
    7804
    875
    Kdb+

    View Slide

  23. 23
    • The NASA MERRA-2 data is available at: https://disc.sci.gsfc.nasa.gov/uui/
    datasets?keywords=%22MERRA-2%22.
    • Roughly 250TB of data
    • Data divided across 100 datasets
    • Measurements from world-wide gridpoints from 1980-2016
    • .nc4 (Network Common Data) file type
    Demo 1 – Geographical Data

    View Slide

  24. 24
    • inst1_2d_lfo_Nx – Land Surface Forcing's:
    • 9 variables, measured at almost 5 million grid points
    • 1 years (2005) worth of daily data
    • This takes up 97GB on disc and is over 1.82 billion rows in table form
    Example Data Sets

    View Slide

  25. 25
    Land Surface Forcing’s Demo
    latitude=30
    longitude=-90
    Source: https://www.latlong.net/c/?lat=35.047691&long=-90.026049

    View Slide

  26. 26
    Lets run a big query across all 1.82 billion rows:
    For a given location, find the change in pressure for every point of time for the year,
    and extract where the change in pressure is greater than a specified threshold.
    There are roughly 150 million points per month
    select from
    (select month,time,PS,SPEEDLML,delta:{0,1_deltas x}PS from lfo1 where
    month within(2005.01m;2005.12m),lat=30,lon=-90)
    where not delta within(-500;500)
    (this query takes around 8 seconds)
    Land Surface Forcing’s Demo

    View Slide

  27. 27
    Land Surface Forcing’s Demo
    select time,PS,SPEEDLML from lfo1 where
    month=2005.08m,lat=30,lon=-90
    Pressure
    Time
    Aug 1st
    00:00
    Aug 31st
    00:00

    View Slide

  28. 28
    Land Surface Forcing’s Demo
    Using colour to identify the intensity of SPEEDML you can see
    that when the pressure drops dramatically, the wind speed
    really picks up.

    View Slide

  29. 29
    Land Surface Forcing’s Demo
    By clicking on this trough of the graph, we can see the time is
    41160f (offset from the start of the month)

    View Slide

  30. 30
    Land Surface Forcing’s Demo

    View Slide

  31. 31
    Land Surface Forcing’s Demo
    Colour represents time and the size of the dot represents wind speed.

    View Slide

  32. 32
    • Data is available at: https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-
    Vehicle-Collisions/h9gi-nx95
    • Over 1 million rows of historical data between 2012 and 2017
    • Weather Data Available at: https://www7.ncdc.noaa.gov/CDO/dataproduct
    (up to 2013)
    • Daily Summary Data for a station located in Central Park
    Demo 2 – NYC Motor Vehicle Collisions

    View Slide

  33. 33
    NYC Motor Vehicle Collisions
    highest number of accidents = 2014.01.21

    View Slide

  34. 34
    NYC Motor Vehicle Collisions
    accidents involving injuries and fatalities in 2014

    View Slide

  35. 35
    NYC Motor Vehicle Collisions
    Pavement slippery

    View Slide

  36. 36
    Enterprise Interfaces
    ODBC/
    JDBC
    Web
    Services
    Python &
    Perl
    R & Matlab
    TCP Sockets
    & Web
    Sockets
    C#/.NET
    Java/Scala
    C/C++
    CDC

    View Slide

  37. 37
    Benchmarks
    o STAC benchmarks: https://stacresearch.com/kx; includes independently verified benchmarks of the
    technology using common capital markets use cases.
    o Intel solution brief: http://www.intel.com/content/www/us/en/processors/xeon/real-time-financial-
    analysis-with-kx-systems-brief.html
    o Gartner paper on Kx technology: https://kx.com/gartner-download.php
    o Community
    o Kx Wiki: http://code.kx.com/wiki/Main_Page
    o Kx Community: http://kxcommunity.com/
    o Kx Github: http://kxsystems.github.io/
    Resources

    View Slide