Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data Analytics with Time-Series Database kdb+

Big Data Analytics with Time-Series Database kdb+

Speaker: Louise McCluskey, kdb+ Engineer
Description: “Big Data Analytics with Time-Series Database kdb+" followed by a Kx for Dashboards Demo – watch cool visualizations of HUGE Data sets.

Where: Dogpatch Labs, CHQ, IFSC, Dublin 1

https://www.meetup.com/PyLadiesDublin/events/dclgvlyxjbzb/

PyLadies Dublin

June 19, 2018
Tweet

More Decks by PyLadies Dublin

Other Decks in Technology

Transcript

  1. 4 • Subsidiary of First Derivatives plc • 13 Global

    Offices Inc. NYC, Singapore, London & Tokyo • Large user community About us NORTH AMERICA AFRICA AUSTRALIA & NZ ASIA UK & EUROPE • Widely adopted in financial services over two decades • Now in - Hi-Tech Manufacturing, Utilities, Telco, Energy, Life Sciences, Earth Observation • Software & industry solutions, consulting and implementation services
  2. 5 • World’s fastest time-series columnar database • Streaming, real-time

    and historical data in one platform • Runs on Linux, Windows, Solaris, and MacOS • Runs on commodity hardware, cloud, edge devices/appliances • Expressive query (qsql) and programming language (q) • In-memory compute engine for Complex Event Processing • Column-level compression • Integrates easily into legacy systems for performance augmentation • Multi-core / Multi-processor / Multi-thread / Multi-server Core technology kdb+ column based time-series database with in-built programming language q
  3. 6 • Processing & analysis of large volumes of real-time

    and historical time series data • Extreme performance - low latency • Scalability without requiring significant infrastructure change • Provide the fastest, most efficient, most flexible tools and dashboards • Worldwide leader in high-volume, high-performance databases Known for
  4. 8 select open: first price, high: max price, low: min

    price, close: last price from trade where date = 2013.05.01, sym=`VOD.L Sample q query open high low close 83.85 85.9 83.28 85.45
  5. 9 q select sym qty by p.color from sp SQL

    select p.color, sum(sp.qty) from sp, p where sp.p=p.p group by p.color order by color q vs. SQL
  6. 10 Single thread: {sum reverse sqrt log til x} each

    1000000 * til 8 Parallel: {sum reverse sqrt log til x} peach 1000000 * til 8 Parallelizing in q
  7. 11 Kx Architecture File & DB Sources File & DB

    Sources Scalability, High Availability, and Fault Tolerance Native Lambda/HTAP Architecture Stream for Kx Application Framework Kx for Flow Kx for Surveillance Kx for Algo Kx for DaaS Kx for Utilities Kx for Cyber Kx for Pharma Vertical Market Solutions Kx for Sensors Kx for Telco Real-Time Sources Core (kdb+) In Memory Database Historical Database In Memory Database Historical Database q language & qsql scripting Develop, configure, deploy, and manage solutions Control for Kx Monitor for Kx Scan, monitor and alerting of issues in software and hardware Third-Party Interoperability Pub/sub, SOA, ODBC, JDBC, web sockets R, Python, MatLab,
 Java, C#, C/C++ Analyst for Kx Query, explore, transform, and import without programming Dashboards for Kx Build real-time visualizations for multiple devices File & DB Sources Batch Loader Batch Loader Stream
 Feed Handler Stream
 Feed Handler Stream Engine Ticker Plant Stream and Ingestion Engine Complex Event Processing • Queries • Transforms • Alerts • Control signals • Notifications • Micro-services
  8. 13 Health & Life Sciences Automotive Utilities Space and Telco

    Retail Earth Observation Geospatial Data Analytics Anomaly Detection Genomics Data Processing Connected Health Patient Record Analytics Anomaly Detection Performance Analytics Sensor Analytics Smart Meters Data Predictive Analytics Manufacturing Retail Analytics Marketing Optimization Customer Journey Edge Computing Multivariate Analysis Fault Detection Other Verticals
  9. 14 Kx Performance Snippets Trusted by 19/20 World’s Top Investment

    Banks 500 KB profile (L1/L2 Cache) Process & store 4.5 million events/ second/core Ingest data at 10 million records/ second/core Streaming 1.6 TB of Data Daily Search in- memory tables at 4 billion records/ second/core
  10. 15 • InfluxData published public benchmarks (data and software) against:

    • MongoDB • Cassandra • ElasticSearch • OpenTSDB • Kx applied identical methodologies and run tests to generate their own performance measurements Transitive Comparisons
  11. 16 Server Configurations Platform CPU Memory Storage OS Database Raspberry

    Pi 1.2Ghz quad-core ARM Cortex-A53 1GB DDR2-900 MHz 32GB Micro SDHC Raspbian Kdb+ (32 bit) MacBook Pro (mid-2014) 
 3Ghz Intel Core i7 (2 cores) 16GB DDR3-1600 Mhz 500GB SSD Flash MacOS 10.13.2 Kdb+ (64bit) Kx Server* 3.2Ghz quad-core E5-2667v3 Xeon (20MB cache) 64GB DDR4-2133 Mhz 300GB SAS 10K CentOS 7.3.1611 Kdb+ (64bit) InfluxData Server* 3.6Ghz quad-core E5-1271v3 Xeon (8MB cache) 32GB DDR3-1600Mhz 1.2TB NVMe SSD Ubuntu 16.04 LTS InfluxDB * denotes similar server configurations for head-to-head comparisons
  12. 17 Query Definitions # Definition Kdb+ vs InfluxDB vs… Data


    Spanning 1 Return maximum value, by minute, in a 1-hour time frame, for 1 host Cassandra 1 day 2 Return maximum value, by minute, in a 12-hour time frame, for 1 host Cassandra 1 day 3 Return maximum value, by minute, in a 12-hour time frame, for 8 hosts Cassandra 1 day 4 Return maximum value, by minute, in a 1-hour time frame, for 1 host (4 days) ElasticSearch 4 days 5 Return maximum value, by minute, in a 1-hour time frame, for 1 host MongoDB 6 hours 6 Return maximum value, by minute, in a 1-hour time frame, for 8 hosts OpenTSDB 4 hours
  13. 18 Kdb+ vs InfluxDB Summary Kdb+ InfluxDB Query Raspberry
 Pi

    Macbook
 Pro Server
 1-core Server
 4-cores Server
 8-cores InfluxData
 Server
 4-cores How
 much
 faster? 1 4,741 48,055 25,061 55,578 79,084 2,606 21.3× 2 457 4,487 3,442 12,019 21,087 714 16.8× 3 54 531 298 1,101 1,918 192 5.73× 4 1,333 24,266 12,455 34,905 53,682 3,600 9.6× 5 7,693 63,138 56,649 107,810 122,666 2,614 41.2× 6 875 7,804 5,366 13,018 17,090 400 32.5× Note: Units are queries per second Similar Configurations
  14. 19 Query Rate: Kdb+ vs InfluxDB vs MongoDB Queries per

    second 0 32,500 65,000 97,500 130,000 Raspberry Pi MacBook Server 1-Core Server 4-Cores Server 8-Cores InfluxDB MongoDB 2850 2614 122666 107810 56649 63138 7693 Kdb+
  15. 20 Query Rate: Kdb+ vs InfluxDB vs ElasticSearch Queries per

    second 0 10,000 20,000 30,000 40,000 50,000 60,000 Raspberry Pi MacBook Server 1-Core Server 4-Cores Server 8-Cores InfluxDB ElaslcSearch 79 3600 53682 34905 12455 24266 1333 Kdb+
  16. 21 Query Rate: Kdb+ vs InfluxDB vs Cassandra Queries per

    second 0 9,444 18,889 28,333 37,778 47,222 56,667 66,111 75,556 85,000 Rasp Pi MacBook Server 1-Core Server 4-Cores Server 8-Cores InfluxDB Cassandra 66 192 1918 1101 298 531 54 442 714 21087 12019 3442 4487 457 1912 2606 79084 55578 25061 48055 4741 Query 1 Query 2 Query 3 Kdb+
  17. 22 Query Rate: Kdb+ vs InfluxDB vs OpenTSDB Queries per

    second 0 5,000 10,000 15,000 20,000 Raspberry Pi MacBook Server 1-Core Server 4-Cores Server 8-Cores AWS 1-Core AWS 2-Cores InfluxDB OpenTSDB 6 Nodes 106 400 9788 4957 17090 13018 5366 7804 875 Kdb+
  18. 23 • The NASA MERRA-2 data is available at: https://disc.sci.gsfc.nasa.gov/uui/

    datasets?keywords=%22MERRA-2%22. • Roughly 250TB of data • Data divided across 100 datasets • Measurements from world-wide gridpoints from 1980-2016 • .nc4 (Network Common Data) file type Demo 1 – Geographical Data
  19. 24 • inst1_2d_lfo_Nx – Land Surface Forcing's: • 9 variables,

    measured at almost 5 million grid points • 1 years (2005) worth of daily data • This takes up 97GB on disc and is over 1.82 billion rows in table form Example Data Sets
  20. 26 Lets run a big query across all 1.82 billion

    rows: For a given location, find the change in pressure for every point of time for the year, and extract where the change in pressure is greater than a specified threshold. There are roughly 150 million points per month select from (select month,time,PS,SPEEDLML,delta:{0,1_deltas x}PS from lfo1 where month within(2005.01m;2005.12m),lat=30,lon=-90) where not delta within(-500;500) (this query takes around 8 seconds) Land Surface Forcing’s Demo
  21. 27 Land Surface Forcing’s Demo select time,PS,SPEEDLML from lfo1 where

    month=2005.08m,lat=30,lon=-90 Pressure Time Aug 1st 00:00 Aug 31st 00:00
  22. 28 Land Surface Forcing’s Demo Using colour to identify the

    intensity of SPEEDML you can see that when the pressure drops dramatically, the wind speed really picks up.
  23. 29 Land Surface Forcing’s Demo By clicking on this trough

    of the graph, we can see the time is 41160f (offset from the start of the month)
  24. 32 • Data is available at: https://data.cityofnewyork.us/Public-Safety/NYPD-Motor- Vehicle-Collisions/h9gi-nx95 • Over

    1 million rows of historical data between 2012 and 2017 • Weather Data Available at: https://www7.ncdc.noaa.gov/CDO/dataproduct (up to 2013) • Daily Summary Data for a station located in Central Park Demo 2 – NYC Motor Vehicle Collisions
  25. 36 Enterprise Interfaces ODBC/ JDBC Web Services Python & Perl

    R & Matlab TCP Sockets & Web Sockets C#/.NET Java/Scala C/C++ CDC
  26. 37 Benchmarks o STAC benchmarks: https://stacresearch.com/kx; includes independently verified benchmarks

    of the technology using common capital markets use cases. o Intel solution brief: http://www.intel.com/content/www/us/en/processors/xeon/real-time-financial- analysis-with-kx-systems-brief.html o Gartner paper on Kx technology: https://kx.com/gartner-download.php o Community o Kx Wiki: http://code.kx.com/wiki/Main_Page o Kx Community: http://kxcommunity.com/ o Kx Github: http://kxsystems.github.io/ Resources