Slide 1

Slide 1 text

CakePHP& Spatial Big Data Linking up large data warehouses with CakePHP

Slide 2

Slide 2 text

Introduction Who / What and Why 2

Slide 3

Slide 3 text

Who am I? 3 My name is Dan Voyce – I’m the CTO at LOCALLY A bit about me: • Reformed Brit • CakePHP Community member since 2006/2007 (Version 0.9) • Have been using CakePHP commercially since 1.0 / 1.2

Slide 4

Slide 4 text

Background: LOCALLY Tech & Data LOCALLY is a leader in location data intelligence and real-time consumer engagement. 4 An Analytics and Processing Platform Architected for Massive Scale 800 TB Data Warehouse – Designed for Petabyte Scale 250,000 POI Resolutions per second – Plans for 10M p/s in future 300M (Post Cleansing) Daily Location Signals 70M (Post Cleansing) Daily Active Users 160+ Unique Static Audiences Unlimited filter combinations to produce custom audiences All user facing systems are built with CakePHP 3x

Slide 5

Slide 5 text

Background: Geospatial Processing Our specialty is human movement data, we collect anonymized telemetry over millions of devices each day. This consists of some device information (OS, Brand, Model, Wireless capabilities etc) Also consists of Location data (Latitude, Longitude, Altitude etc). Why? By understanding human movement and intersecting this data with other datasets we can help businesses understand their customers – we can provide insights to allow a business to accurately predict customer intent / requirements. 5 POI Database We maintain a database of Millions of US POI’s, both shape data and ‘point’ data. By spatially joining the human movement data with this POI data we can get context as to the locations people visit and use this to determine intent.

Slide 6

Slide 6 text

Where does CakePHP fit into this? We have developed 3 platforms from the ground up that are complimentary to each other. We started building this on the release of Cake 3. We have 2 platforms – Engage and Discover. Engage is a consumer engagement platform and consists of an API and a CMS / Visualisation Dashboard. Discover is our interface to our big data platform and allows the building of custom audiences from our data. The Discover platform is where the discussion today will center around. 6

Slide 7

Slide 7 text

Number Theory & Cardinality Estimation Getting Fast Counts of Data efficiently 7

Slide 8

Slide 8 text

Cardinality Estimation using Statistical Theory & HyperLogLog The following section describes how we chose the cardinality estimation tools based on various factors (And MUCH trial and error) – ill briefly describe the 2 main ones we ended up using: HyperLogLog++ - Essentially HLL is an algorithm that buckets hashes of distinct data (stored as a HLL Sketch which can be stored in memory as it is small) It is then a probabilistic counter of the approximate number of distinct elements in a set. Google tweaked it a bit to get HyperLogLog++ (HLL++) It has a selectable “error-margin” with higher margins producing faster counts. For our purposes up to 5 or 6% error margin is fine. Facebook did a great job articulating this: https://engineering.fb.com/data-infrastructure/hyperloglog/ 8 - Probability / Statistic theory VerdictDB works with the idea that with any sufficiently randomised sample of data, the properties of this sample are similar to the main dataset (Hoeffding's inequality). It has the ability to select an error margin with the compromise being the size of the scramble vs the error %.

Slide 9

Slide 9 text

Deprecated Options for fetching Cardinality Estimates Given multiple huge datasets – how do you count distinct values over Billions of rows efficiently? These were some of our early attempts that we soon discounted 9 Method Pro’s Cons SQL: SELECT count(distinct(id)) FROM `table` WHERE something… ~???minutes (unlikely to complete easily on reasonable hardware) • Easy to write for most people • Can use cardinality values in DB if it supports it • Not good with complicated predicates • Not good with joining multiple datasets • Very slow with Geospatial (PostGIS) Spark or Map Reduce: distinctValues = rdd.map(lambda x : x[i]).distinct(numPartitions = 15).collect() ~10 minutes (10 Node Reasonable hardware • Great for scaling • More programmatic than SQL • Geospatial on Spark is painful • Not ideal for interactive querying due to cluster overheads • Data Skew can kill this easily Presto approx_distinct(x): SELECT approx_distinct(id) FROM `table` WHERE something… ~3s-15 minutes (10 Node Reasonable hardware) • Very simple to write • Great for interactive queries • Amazingly easy to integrate with CakePHP (Fairly standard SQL dialect) • Only good for very simple predicates • Performance is variable • Requires specific setup to maintain efficiency. Lots of assumptions made here, there are a million ways to hack each one of these but bear with me….

Slide 10

Slide 10 text

Better Options for fetching Cardinality Estimates Following continuous improvement and better knowledge we landed with the following solutions: 10 Method Pro’s Cons HLL++: SELECT cardinality(merge(cast(hll AS HyperLogLog))) AS id_unique FROM `table` WHERE something… ~5-30 seconds (10 Node Reasonable hardware) / 1-3 second (BigQuery BI) • Native to Presto & BigQuery (Plugins for Postgres and others too) • Can return very fast cardinality estimates over many predicates • Each predicate needs to be setup and sketches created for every combination. • Requires specific setup to maintain efficiency. • BigQuery not so easy to integrate with CakePHP (but Faster!) VerdictDB on Presto: SELECT count(distinct(id)) FROM `table` WHERE something… ~2-20 seconds (10 Node Reasonable hardware) • No need to change queries - Verdict sits as “Middleware” and automatically translates the queries. • Incredibly fast and reliable counts over many predicates Including Geospatial • Need to pre-generate “Scrambles” of the data – can be expensive as its very computationally taxing. • Requires specific setup to maintain efficiency. VerdictDB on BigQuery / BQ BI: SELECT count(distinct(id)) FROM `table` WHERE something… ~1-7 seconds • No need to change queries • Incredibly fast and reliable counts when using BigQuery BI (3s overhead when just using BQ). • Supports Geospatial Predicates • No need to manage infrastructure • Static Pricing based on data amount • Need to pre-generate “Scrambles” of the data, however cheaper than Presto. • BigQuery BI is currently limited to 10GB (will increase in future!) – should be fine for datasets ~1TB (using a 1% sample)

Slide 11

Slide 11 text

Plumbing it all together Getting CakePHP to talk to everything! 11

Slide 12

Slide 12 text

Basic System Diagram 12 Discover Dashboard ML Cleansing Spatial Processing Unified Output Verdict Scramble Creation BI Engine VerdictDB & Verdict Proxy Data Lakes (S3 / GCS) Verdict Proxy is a JDBC bridge between PHP and VerdictDB written in Node.js. Allows us to connect JDBC databases to PHP and handle some custom commands.

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

Query Examples 15 1:#Scramble creation for table 2: CREATE TABLE `dans-personal-gcloud.large_data.scramble_ts_date_2019_07_20` 3: PARTITION BY DATE(_COPY_PARTITIONTIME) 4: AS 5: SELECT partner, src, ts, id, id_type, lat, lon, alt, ha, va, speed, bearing, signal_type, context, conn_type, ip, carrier, ssid, device_make, device_model, device_ua, device_ver, device_lang, device_country, device_charging, user_id, user_id_type, meta_k, meta_v, city, state, zipcode, county, msa, country, place_provider, place_id, place_name, place_cat, place_dist, place_attr_type, _col_partition as _COPY_PARTITIONTIME 6: FROM ( 7: SELECT partner, src, ts, id, id_type, lat, lon, alt, ha, va, speed, bearing, signal_type, context, conn_type, ip, carrier, ssid, device_make, device_model, device_ua, device_ver, device_lang, device_country, device_charging, user_id, user_id_type, meta_k, meta_v, city, state, zipcode, county, msa, country, place_provider, place_id, place_name, place_cat, place_dist, place_attr_type, _PARTITIONTIME as _col_partition, MOD(CAST(CONCAT('0x', substr(TO_HEX(md5(id)), 1, 15)) as INT64), 1000000) / 1000000.0 as _col_hashed 8: FROM `dans-personal-gcloud.large_data.ts_date_2019_07_20` 9: WHERE _PARTITIONTIME >= TIMESTAMP('1970-01-01') 10: ) 11: WHERE CASE WHEN _col_partition = TIMESTAMP('2019-08-20') THEN _col_hashed < 0.013998209069131694 END 12: voycey@cloudshell:~/keebobq$ python3 keebo-test.py [INFO] 2019-08-20 18:39:55,364 - Total bytes processed: 0 320684 1.48236083984375 Note to self: To prepare for the demonstration re-load the data into BQ

Slide 16

Slide 16 text

Thank you! 16