Big data in the trenches

Big data in the trenches

Every minute the Internet produces a crazy amount of data. No matter if you work on social network, Internet of Things application, e-commerce site or online casino - you'll always end up with various sources and formats of data. Having data alone won't give you any advantage. Knowing what to do with it, will.
The situation when data is structured (even in a well known JSON format), either SQL or NoSQL is a dream for everyone who takes care of data. But often times you'll end up with very dynamic content, not very well structured or containing hard to parse items. You still need to live with this and act quickly in order to find the insight that is vital for your operations.
This presentation's purpose is not to praise Big Data or to promote its advantages and amazing business opportunities. We'll get our hands dirty touching upon entire path, from producing and capturing the data, to it's intermediate aggregation, through real-time data storages, stream processing to end up with distributed file system and batch processing over millions of records.
To support this with facts, we will talk about real life examples using PHP as front-facing layer, NoSQL databases, messaging queues and powerful map-reduce tools. You will hear many times about Couchbase, MongoDB, RabbitMQ, Kafka and Apache Spark, to name just a few. We'll also prove that SQL is not dead, and that it is entering into whole new era!

5bbc7b79e04d8e8b1212c934ff2e2831?s=128

Wojciech Sznapka

October 02, 2016
Tweet

Transcript

  1. Big data in the trenches Wojciech Sznapka, PHPConPL 2016

  2. Definition Big data is a term for data sets that

    are so large or complex that traditional data processing applications are inadequate to deal with them. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy. [1] [1] https://en.wikipedia.org/wiki/Big_data
  3. Agenda ➔ Producing Where data is born and how to

    capture it ➔ Processing On-line, asynchronous, streaming and batch processing ➔ Lesson learned Accommodating for traffic growth and data volume increase
  4. I’m Wojtek. Head Of Development @ XCaliber (part of Cherry

    AB) @sznapka @wowo
  5. Data production

  6. We can think about two ways of data creation: ACTIVE

    or PASSIVE Fact Google processes 3.5 billion requests per day and stores 10 exabytes of data (10 billion gigabytes!) Facebook alone has 2.5 billion pieces of content, 2.7 billion ‘likes’ and 300 million photos – which adds up to more than 500 terabytes of data.
  7. Active data generation Data can be produced actively by: ➔

    Specialized application Activity tracker, search engine, gaming platform ➔ Sensors All kinds of Internet of Things sensor ➔ User content “Likes”, comments, photos in social networks
  8. Data can also originate from passive sources, like logs, which

    also carries a lot of value
  9. Data generation application example Affiliate tracking and event tracking applications

    used in igaming industry
  10. Considerations In order to perform in high throughput environment, thing

    about: ➔ Tracker as light as possible Drop frameworks, like Zend Framework or Symfony - you won’t need them in tracking endpoint ➔ Strip down libraries Use and include only those, you absolutely need (eg: consider error_log/syslog in favour of Monolog) ➔ Save data to fast storage Optimize for writes and use NoSQL
  11. <?php /** @var TrackingService $trackingService */ $trackingService = $this->getServiceLocator()->get('TrackingService'); /**

    @var Media $media */ $media = $trackingService->findMedia($mediaId, $affiliateId, $landingPageId,$request->getQuery()); if (0 === strpos($uri->getQuery(), 'tracking_code')) { $trackingId = $trackingService->storeClick( $media, $channelId, $request->getServer('REMOTE_ADDR') ); return $this->redirect()->toUrl( $trackingService->getLandingPageUrl($media, $channelId, $redirectUrl, $this->getAllowedParams(), $trackingId)); } $trackingService->storeImpression($media, $channelId, $request->getServer('REMOTE_ADDR')); Simple tracking example
  12. <?php if (isset($_SERVER['REQUEST_URI']) && 0 === strrpos($_SERVER['REQUEST_URI'], '/tracking.php')) { $modules

    = [ 'DoctrineModule', 'DoctrineORMModule', 'Tracking', ]; } else { $modules = [ 'Application', 'DoctrineModule', 'DoctrineORMModule', 'ZF\\Apigility', 'ZF\\ApiProblem', 'ZF\\MvcAuth', 'ZF\\OAuth2', 'ZF\\Hal', 'ZF\\Rest', 'ZF\\Rpc', 'ZF\\Versioning', 'ZfrCors', ]; } Load only necessary libs
  13. Things worth mentioning in terms of storing data - Be

    prepared for fast writes - Horizontal scalability is a must - Ensure failover and replicas
  14. Keep in mind CAP theorem, when choosing data storage Decide

    if you need strong consistency (CP) or eventual consistency is enough (AP) CAP theorem states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: - Consistency (every read receives the most recent write or an error) - Availability (every request receives a response, without guarantee that it contains the most recent version of the information) - Partition tolerance (the system continues to operate despite arbitrary partitioning due to network failures)
  15. Data processing

  16. Process on-line the tasks, that are absolutely required.

  17. Strategies There are pretty much three ways of processing online

    ➔ Materialized views Fetch data from materialized view stored under some well known key and process according to it ➔ Range of keys Fetch range of key and process based on underlying values ➔ Map-reduce Fetch data from map-reduce view (Couchbase/Mongo/Riak). Might by only eventually consistent
  18. Asynchronous processing

  19. Process data in CLI tasks consuming queues (RabbitMQ/Kafka). - Suitable

    for heavy/time consuming tasks - Requires to ensure proper error handling - Creates possibility to extend flow later (stream processing) Tip make sure you have some sort of process management in place, e.g supervisord; It will be required for process crashes/deployments/restarts
  20. Data aggregation Aggregate your NoSQL data and save into easily

    accessible place (eg. MySQL for reporting) Run aggregation quite often based on map-reduce (Couchbase/Riak) or search criteria (Mongo)
  21. Our story ➔ N1QL Couchbase’s N1QL (SQL-like engine) result saved

    to database every 5 minutes. N1QL doesn’t work very well for larger volumes of data ➔ Map-reduce Couchbase’s map-reduce view. Views are pre-fetched / indexed, meaning every change to the bucket will trigger view’s reindexing, hence they are eventually consistent; Thanks to composite map-reduce key we are able to limit views result to given day ➔ Spark streaming Apache Spark streaming job. Feeds with data from queue and aggregates them on some interval basis and then saves aggregated data to database
  22. Stream processing

  23. Very scalable and efficient approach to data processing Allows to

    run map-reduce jobs on batches of data, incoming in defined intervals Source Source can be a queue (RabbitMQ or Kafka, which is widely used and supported), files resource, TCP Socket, memory buffer etc.
  24. There’s multiple tools for stream processing. Apache Spark is one

    of the most extensively adopted in the market. Spark The power of Spark lies in its libraries. You can mix streaming, SQL and Machine Learning without any problems.
  25. Batch processing.

  26. Batch processing ➔ Why To process data for all kinds

    of reporting, analytics, machine learning models etc. ➔ How Using well known tools, like Hadoop or Apache Spark. Often is more convenient to use some big data SQL implementation. ➔ What Data usually stored in a “cold” storage, like HDFS or Amazon S3 service. In format of JSON or log files, usually gzipped.
  27. SQL is not DEAD! Usually it’s way more convenient, readable

    and maintainable to create batch jobs using SQL implementation Implementations There are numerous so called “SQL-on-Hadoop” engines like Hive, Apache Spark SQL, Presto, Apache Drill, HAWQ All of them are SQL-92 compat
  28. Apache Spark provides SQL engine - You can seamlessly mix

    SQL and other Spark code (including streaming or machine learning) - Variety of sources including Avro, Hive, Parquet, JSON etc. - Connectivity via JDBC or ODBC
  29. Spark SQL step by step Example running on Databricks cloud

    platform Register data using dbutils.fs.mount (S3 mountpoint) Read and transform/filter data Run SQL query and enjoy result (Databricks allows to use %sql shorthand)
  30. Spark SQL step by step - Example running on Databricks

    cloud platform - Register data using dbutils.fs.mount (S3 mountpoint) - Read and transform/filter data - Run SQL query and enjoy result (Databricks allows to use %sql shorthand) dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY , AWS_BUCKET_NAME ), MOUNT_NAME) def get_value_from_json(line): import json document = json.loads(line.strip(',')) return json.dumps(document['value']) tracking_rdd = sc.textFile(tracking_path).filter(lambda x: 'value' in x).map(get_value_from_json) tracking_json = sqlContext.jsonRDD(tracking_rdd) tracking_json.registerTempTable("tracking") %sql SELECT event, host, day, count FROM tracking WHERE year = 2016 AND month = 07 AND day BETWEEN 2 and 5 ORDER BY day, event;
  31. Lessons learned

  32. Stateless system = horizontally scalable

  33. Skim all boilerplate from your producers

  34. Segregate responsibilities of servers, create single responsible servers

  35. Store on-line data in horizontally scalable storage

  36. Aggregate data for an immediate on-line access

  37. Online storage is expensive, archive historic data to cold storages

    (S3 / HDFS)
  38. Offload as much as possible to background processors (queues or

    streaming jobs)
  39. Use Apache Spark notebooks (spark-notebook.io, Apache Zeppelin or Databricks) For

    interactive data analysis
  40. Thank you! Ask me anything: wojciech.sznapka@xcaliber.com twitter.com/sznapka