CakePHP and Spatial Big Data - Visualizing 70 Billion Rows of data

D178520f4807af80972739cf4978bac7?s=47 Dan Voyce
November 09, 2019

CakePHP and Spatial Big Data - Visualizing 70 Billion Rows of data

LOCALLY is a global leader in location data intelligence and real-time consumer engagement. Our innovative and highly precise location technology merges physical places and digital devices with powerful AI and ML models, along with the transparency of blockchain, to deliver insights that are actionable at scale.

We have been working with CakePHP 3.x since the inception of the company (and personally as developers over 8 years), where we were processing around 500K records per month through a standard MQ workflow, over the past year we have scaled up to be ingesting over 20B records each month.

This talk will discuss how we interface our Data Warehouse with CakePHP to produce sub-second aggregation counts over Billions of rows over a variety of metrics, we talk about how we integrate with Presto (A Facebook open-source Big Data Query Engine) to provide an SQL like interface to our data-warehouse that is accessible by CakePHP and how we use HyperLogLog++ Sketches to allow for dynamic up-to-the second unique aggregation counts of both standard metrics and Geospatial Joins to be achieved in sub-second times.

D178520f4807af80972739cf4978bac7?s=128

Dan Voyce

November 09, 2019
Tweet

Transcript

  1. 3.

    Who am I? 3 My name is Dan Voyce –

    I’m the CTO at LOCALLY A bit about me: • Reformed Brit • CakePHP Community member since 2006/2007 (Version 0.9) • Have been using CakePHP commercially since 1.0 / 1.2
  2. 4.

    Background: LOCALLY Tech & Data LOCALLY is a leader in

    location data intelligence and real-time consumer engagement. 4 An Analytics and Processing Platform Architected for Massive Scale 800 TB Data Warehouse – Designed for Petabyte Scale 250,000 POI Resolutions per second – Plans for 10M p/s in future 300M (Post Cleansing) Daily Location Signals 70M (Post Cleansing) Daily Active Users 160+ Unique Static Audiences Unlimited filter combinations to produce custom audiences All user facing systems are built with CakePHP 3x
  3. 5.

    Background: Geospatial Processing Our specialty is human movement data, we

    collect anonymized telemetry over millions of devices each day. This consists of some device information (OS, Brand, Model, Wireless capabilities etc) Also consists of Location data (Latitude, Longitude, Altitude etc). Why? By understanding human movement and intersecting this data with other datasets we can help businesses understand their customers – we can provide insights to allow a business to accurately predict customer intent / requirements. 5 POI Database We maintain a database of Millions of US POI’s, both shape data and ‘point’ data. By spatially joining the human movement data with this POI data we can get context as to the locations people visit and use this to determine intent.
  4. 6.

    Where does CakePHP fit into this? We have developed 3

    platforms from the ground up that are complimentary to each other. We started building this on the release of Cake 3. We have 2 platforms – Engage and Discover. Engage is a consumer engagement platform and consists of an API and a CMS / Visualisation Dashboard. Discover is our interface to our big data platform and allows the building of custom audiences from our data. The Discover platform is where the discussion today will center around. 6
  5. 8.

    Cardinality Estimation using Statistical Theory & HyperLogLog The following section

    describes how we chose the cardinality estimation tools based on various factors (And MUCH trial and error) – ill briefly describe the 2 main ones we ended up using: HyperLogLog++ - Essentially HLL is an algorithm that buckets hashes of distinct data (stored as a HLL Sketch which can be stored in memory as it is small) It is then a probabilistic counter of the approximate number of distinct elements in a set. Google tweaked it a bit to get HyperLogLog++ (HLL++) It has a selectable “error-margin” with higher margins producing faster counts. For our purposes up to 5 or 6% error margin is fine. Facebook did a great job articulating this: https://engineering.fb.com/data-infrastructure/hyperloglog/ 8 - Probability / Statistic theory VerdictDB works with the idea that with any sufficiently randomised sample of data, the properties of this sample are similar to the main dataset (Hoeffding's inequality). It has the ability to select an error margin with the compromise being the size of the scramble vs the error %.
  6. 9.

    Deprecated Options for fetching Cardinality Estimates Given multiple huge datasets

    – how do you count distinct values over Billions of rows efficiently? These were some of our early attempts that we soon discounted 9 Method Pro’s Cons SQL: SELECT count(distinct(id)) FROM `table` WHERE something… ~???minutes (unlikely to complete easily on reasonable hardware) • Easy to write for most people • Can use cardinality values in DB if it supports it • Not good with complicated predicates • Not good with joining multiple datasets • Very slow with Geospatial (PostGIS) Spark or Map Reduce: distinctValues = rdd.map(lambda x : x[i]).distinct(numPartitions = 15).collect() ~10 minutes (10 Node Reasonable hardware • Great for scaling • More programmatic than SQL • Geospatial on Spark is painful • Not ideal for interactive querying due to cluster overheads • Data Skew can kill this easily Presto approx_distinct(x): SELECT approx_distinct(id) FROM `table` WHERE something… ~3s-15 minutes (10 Node Reasonable hardware) • Very simple to write • Great for interactive queries • Amazingly easy to integrate with CakePHP (Fairly standard SQL dialect) • Only good for very simple predicates • Performance is variable • Requires specific setup to maintain efficiency. Lots of assumptions made here, there are a million ways to hack each one of these but bear with me….
  7. 10.

    Better Options for fetching Cardinality Estimates Following continuous improvement and

    better knowledge we landed with the following solutions: 10 Method Pro’s Cons HLL++: SELECT cardinality(merge(cast(hll AS HyperLogLog))) AS id_unique FROM `table` WHERE something… ~5-30 seconds (10 Node Reasonable hardware) / 1-3 second (BigQuery BI) • Native to Presto & BigQuery (Plugins for Postgres and others too) • Can return very fast cardinality estimates over many predicates • Each predicate needs to be setup and sketches created for every combination. • Requires specific setup to maintain efficiency. • BigQuery not so easy to integrate with CakePHP (but Faster!) VerdictDB on Presto: SELECT count(distinct(id)) FROM `table` WHERE something… ~2-20 seconds (10 Node Reasonable hardware) • No need to change queries - Verdict sits as “Middleware” and automatically translates the queries. • Incredibly fast and reliable counts over many predicates Including Geospatial • Need to pre-generate “Scrambles” of the data – can be expensive as its very computationally taxing. • Requires specific setup to maintain efficiency. VerdictDB on BigQuery / BQ BI: SELECT count(distinct(id)) FROM `table` WHERE something… ~1-7 seconds • No need to change queries • Incredibly fast and reliable counts when using BigQuery BI (3s overhead when just using BQ). • Supports Geospatial Predicates • No need to manage infrastructure • Static Pricing based on data amount • Need to pre-generate “Scrambles” of the data, however cheaper than Presto. • BigQuery BI is currently limited to 10GB (will increase in future!) – should be fine for datasets ~1TB (using a 1% sample)
  8. 12.

    Basic System Diagram 12 Discover Dashboard ML Cleansing Spatial Processing

    Unified Output Verdict Scramble Creation BI Engine VerdictDB & Verdict Proxy Data Lakes (S3 / GCS) Verdict Proxy is a JDBC bridge between PHP and VerdictDB written in Node.js. Allows us to connect JDBC databases to PHP and handle some custom commands.
  9. 13.
  10. 14.
  11. 15.

    Query Examples 15 1:#Scramble creation for table 2: CREATE TABLE

    `dans-personal-gcloud.large_data.scramble_ts_date_2019_07_20` 3: PARTITION BY DATE(_COPY_PARTITIONTIME) 4: AS 5: SELECT partner, src, ts, id, id_type, lat, lon, alt, ha, va, speed, bearing, signal_type, context, conn_type, ip, carrier, ssid, device_make, device_model, device_ua, device_ver, device_lang, device_country, device_charging, user_id, user_id_type, meta_k, meta_v, city, state, zipcode, county, msa, country, place_provider, place_id, place_name, place_cat, place_dist, place_attr_type, _col_partition as _COPY_PARTITIONTIME 6: FROM ( 7: SELECT partner, src, ts, id, id_type, lat, lon, alt, ha, va, speed, bearing, signal_type, context, conn_type, ip, carrier, ssid, device_make, device_model, device_ua, device_ver, device_lang, device_country, device_charging, user_id, user_id_type, meta_k, meta_v, city, state, zipcode, county, msa, country, place_provider, place_id, place_name, place_cat, place_dist, place_attr_type, _PARTITIONTIME as _col_partition, MOD(CAST(CONCAT('0x', substr(TO_HEX(md5(id)), 1, 15)) as INT64), 1000000) / 1000000.0 as _col_hashed 8: FROM `dans-personal-gcloud.large_data.ts_date_2019_07_20` 9: WHERE _PARTITIONTIME >= TIMESTAMP('1970-01-01') 10: ) 11: WHERE CASE WHEN _col_partition = TIMESTAMP('2019-08-20') THEN _col_hashed < 0.013998209069131694 END 12: voycey@cloudshell:~/keebobq$ python3 keebo-test.py [INFO] 2019-08-20 18:39:55,364 - Total bytes processed: 0 320684 1.48236083984375 Note to self: To prepare for the demonstration re-load the data into BQ