Big data in the trenches

Big data in the trenches Wojciech Sznapka, PHPConPL 2016

Definition Big data is a term for data sets that
are so large or complex that traditional data processing applications are inadequate to deal with them. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy. [1] [1] https://en.wikipedia.org/wiki/Big_data

Agenda ➔ Producing Where data is born and how to
capture it ➔ Processing On-line, asynchronous, streaming and batch processing ➔ Lesson learned Accommodating for traffic growth and data volume increase

I’m Wojtek. Head Of Development @ XCaliber (part of Cherry
AB) @sznapka @wowo

Data production

We can think about two ways of data creation: ACTIVE
or PASSIVE Fact Google processes 3.5 billion requests per day and stores 10 exabytes of data (10 billion gigabytes!) Facebook alone has 2.5 billion pieces of content, 2.7 billion ‘likes’ and 300 million photos – which adds up to more than 500 terabytes of data.

Active data generation Data can be produced actively by: ➔
Specialized application Activity tracker, search engine, gaming platform ➔ Sensors All kinds of Internet of Things sensor ➔ User content “Likes”, comments, photos in social networks

Data can also originate from passive sources, like logs, which
also carries a lot of value

Data generation application example Affiliate tracking and event tracking applications
used in igaming industry

Considerations In order to perform in high throughput environment, thing
about: ➔ Tracker as light as possible Drop frameworks, like Zend Framework or Symfony - you won’t need them in tracking endpoint ➔ Strip down libraries Use and include only those, you absolutely need (eg: consider error_log/syslog in favour of Monolog) ➔ Save data to fast storage Optimize for writes and use NoSQL

<?php /** @var TrackingService $trackingService */ $trackingService = $this->getServiceLocator()->get('TrackingService'); /**
@var Media $media */ $media = $trackingService->findMedia($mediaId, $affiliateId, $landingPageId,$request->getQuery()); if (0 === strpos($uri->getQuery(), 'tracking_code')) { $trackingId = $trackingService->storeClick( $media, $channelId, $request->getServer('REMOTE_ADDR') ); return $this->redirect()->toUrl( $trackingService->getLandingPageUrl($media, $channelId, $redirectUrl, $this->getAllowedParams(), $trackingId)); } $trackingService->storeImpression($media, $channelId, $request->getServer('REMOTE_ADDR')); Simple tracking example

<?php if (isset($_SERVER['REQUEST_URI']) && 0 === strrpos($_SERVER['REQUEST_URI'], '/tracking.php')) { $modules
= [ 'DoctrineModule', 'DoctrineORMModule', 'Tracking', ]; } else { $modules = [ 'Application', 'DoctrineModule', 'DoctrineORMModule', 'ZF\\Apigility', 'ZF\\ApiProblem', 'ZF\\MvcAuth', 'ZF\\OAuth2', 'ZF\\Hal', 'ZF\\Rest', 'ZF\\Rpc', 'ZF\\Versioning', 'ZfrCors', ]; } Load only necessary libs

Things worth mentioning in terms of storing data - Be
prepared for fast writes - Horizontal scalability is a must - Ensure failover and replicas

Keep in mind CAP theorem, when choosing data storage Decide
if you need strong consistency (CP) or eventual consistency is enough (AP) CAP theorem states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: - Consistency (every read receives the most recent write or an error) - Availability (every request receives a response, without guarantee that it contains the most recent version of the information) - Partition tolerance (the system continues to operate despite arbitrary partitioning due to network failures)

Data processing

Process on-line the tasks, that are absolutely required.

Strategies There are pretty much three ways of processing online
➔ Materialized views Fetch data from materialized view stored under some well known key and process according to it ➔ Range of keys Fetch range of key and process based on underlying values ➔ Map-reduce Fetch data from map-reduce view (Couchbase/Mongo/Riak). Might by only eventually consistent

Asynchronous processing

Process data in CLI tasks consuming queues (RabbitMQ/Kafka). - Suitable
for heavy/time consuming tasks - Requires to ensure proper error handling - Creates possibility to extend flow later (stream processing) Tip make sure you have some sort of process management in place, e.g supervisord; It will be required for process crashes/deployments/restarts

Data aggregation Aggregate your NoSQL data and save into easily
accessible place (eg. MySQL for reporting) Run aggregation quite often based on map-reduce (Couchbase/Riak) or search criteria (Mongo)

Our story ➔ N1QL Couchbase’s N1QL (SQL-like engine) result saved
to database every 5 minutes. N1QL doesn’t work very well for larger volumes of data ➔ Map-reduce Couchbase’s map-reduce view. Views are pre-fetched / indexed, meaning every change to the bucket will trigger view’s reindexing, hence they are eventually consistent; Thanks to composite map-reduce key we are able to limit views result to given day ➔ Spark streaming Apache Spark streaming job. Feeds with data from queue and aggregates them on some interval basis and then saves aggregated data to database

Stream processing

Very scalable and efficient approach to data processing Allows to
run map-reduce jobs on batches of data, incoming in defined intervals Source Source can be a queue (RabbitMQ or Kafka, which is widely used and supported), files resource, TCP Socket, memory buffer etc.

There’s multiple tools for stream processing. Apache Spark is one
of the most extensively adopted in the market. Spark The power of Spark lies in its libraries. You can mix streaming, SQL and Machine Learning without any problems.

Batch processing.

Batch processing ➔ Why To process data for all kinds
of reporting, analytics, machine learning models etc. ➔ How Using well known tools, like Hadoop or Apache Spark. Often is more convenient to use some big data SQL implementation. ➔ What Data usually stored in a “cold” storage, like HDFS or Amazon S3 service. In format of JSON or log files, usually gzipped.

SQL is not DEAD! Usually it’s way more convenient, readable
and maintainable to create batch jobs using SQL implementation Implementations There are numerous so called “SQL-on-Hadoop” engines like Hive, Apache Spark SQL, Presto, Apache Drill, HAWQ All of them are SQL-92 compat

Apache Spark provides SQL engine - You can seamlessly mix
SQL and other Spark code (including streaming or machine learning) - Variety of sources including Avro, Hive, Parquet, JSON etc. - Connectivity via JDBC or ODBC

Spark SQL step by step Example running on Databricks cloud
platform Register data using dbutils.fs.mount (S3 mountpoint) Read and transform/filter data Run SQL query and enjoy result (Databricks allows to use %sql shorthand)

Spark SQL step by step - Example running on Databricks
cloud platform - Register data using dbutils.fs.mount (S3 mountpoint) - Read and transform/filter data - Run SQL query and enjoy result (Databricks allows to use %sql shorthand) dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY , AWS_BUCKET_NAME ), MOUNT_NAME) def get_value_from_json(line): import json document = json.loads(line.strip(',')) return json.dumps(document['value']) tracking_rdd = sc.textFile(tracking_path).filter(lambda x: 'value' in x).map(get_value_from_json) tracking_json = sqlContext.jsonRDD(tracking_rdd) tracking_json.registerTempTable("tracking") %sql SELECT event, host, day, count FROM tracking WHERE year = 2016 AND month = 07 AND day BETWEEN 2 and 5 ORDER BY day, event;

Lessons learned

Stateless system = horizontally scalable

Skim all boilerplate from your producers

Segregate responsibilities of servers, create single responsible servers

Store on-line data in horizontally scalable storage

Aggregate data for an immediate on-line access

Online storage is expensive, archive historic data to cold storages
(S3 / HDFS)

Offload as much as possible to background processors (queues or
streaming jobs)

Use Apache Spark notebooks (spark-notebook.io, Apache Zeppelin or Databricks) For
interactive data analysis

Thank you! Ask me anything: [email protected] twitter.com/sznapka

Big data in the trenches

Big data in the trenches

More Decks by Wojciech Sznapka

Other Decks in Programming

Featured

Transcript