Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big data in the trenches

Big data in the trenches

Every minute the Internet produces a crazy amount of data. No matter if you work on social network, Internet of Things application, e-commerce site or online casino - you'll always end up with various sources and formats of data. Having data alone won't give you any advantage. Knowing what to do with it, will.
The situation when data is structured (even in a well known JSON format), either SQL or NoSQL is a dream for everyone who takes care of data. But often times you'll end up with very dynamic content, not very well structured or containing hard to parse items. You still need to live with this and act quickly in order to find the insight that is vital for your operations.
This presentation's purpose is not to praise Big Data or to promote its advantages and amazing business opportunities. We'll get our hands dirty touching upon entire path, from producing and capturing the data, to it's intermediate aggregation, through real-time data storages, stream processing to end up with distributed file system and batch processing over millions of records.
To support this with facts, we will talk about real life examples using PHP as front-facing layer, NoSQL databases, messaging queues and powerful map-reduce tools. You will hear many times about Couchbase, MongoDB, RabbitMQ, Kafka and Apache Spark, to name just a few. We'll also prove that SQL is not dead, and that it is entering into whole new era!

Wojciech Sznapka

October 02, 2016
Tweet

More Decks by Wojciech Sznapka

Other Decks in Programming

Transcript

  1. Big data in the
    trenches
    Wojciech Sznapka, PHPConPL 2016

    View Slide

  2. Definition
    Big data is a term for data sets that are so
    large or complex that traditional data
    processing applications are inadequate to
    deal with them. Challenges include analysis,
    capture, data curation, search, sharing, storage,
    transfer, visualization, querying, updating and
    information privacy. [1]
    [1] https://en.wikipedia.org/wiki/Big_data

    View Slide

  3. Agenda
    ➔ Producing
    Where data is born and how to capture it
    ➔ Processing
    On-line, asynchronous, streaming and
    batch processing
    ➔ Lesson learned
    Accommodating for traffic growth and
    data volume increase

    View Slide

  4. I’m Wojtek.
    Head Of Development @ XCaliber (part of Cherry AB)
    @sznapka
    @wowo

    View Slide

  5. Data
    production

    View Slide

  6. We can think about two
    ways of data creation:
    ACTIVE or PASSIVE Fact
    Google processes 3.5 billion
    requests per day and stores
    10 exabytes of data (10 billion
    gigabytes!) Facebook alone
    has 2.5 billion pieces of
    content, 2.7 billion ‘likes’ and
    300 million photos – which
    adds up to more than 500
    terabytes of data.

    View Slide

  7. Active data generation
    Data can be produced actively by:
    ➔ Specialized application
    Activity tracker, search engine, gaming
    platform
    ➔ Sensors
    All kinds of Internet of Things sensor
    ➔ User content
    “Likes”, comments, photos in social
    networks

    View Slide

  8. Data can also
    originate from
    passive sources, like
    logs, which also
    carries a lot of value

    View Slide

  9. Data generation application example
    Affiliate tracking
    and event tracking
    applications used in
    igaming industry

    View Slide

  10. Considerations
    In order to perform in high throughput
    environment, thing about:
    ➔ Tracker as light as possible
    Drop frameworks, like Zend Framework
    or Symfony - you won’t need them in
    tracking endpoint
    ➔ Strip down libraries
    Use and include only those, you
    absolutely need (eg: consider
    error_log/syslog in favour of Monolog)
    ➔ Save data to fast storage
    Optimize for writes and use NoSQL

    View Slide

  11. /** @var TrackingService $trackingService */
    $trackingService = $this->getServiceLocator()->get('TrackingService');
    /** @var Media $media */
    $media = $trackingService->findMedia($mediaId, $affiliateId, $landingPageId,$request->getQuery());
    if (0 === strpos($uri->getQuery(), 'tracking_code')) {
    $trackingId = $trackingService->storeClick(
    $media,
    $channelId,
    $request->getServer('REMOTE_ADDR')
    );
    return $this->redirect()->toUrl(
    $trackingService->getLandingPageUrl($media, $channelId, $redirectUrl, $this->getAllowedParams(), $trackingId));
    }
    $trackingService->storeImpression($media, $channelId, $request->getServer('REMOTE_ADDR'));
    Simple tracking example

    View Slide

  12. if (isset($_SERVER['REQUEST_URI']) && 0 === strrpos($_SERVER['REQUEST_URI'], '/tracking.php')) {
    $modules = [
    'DoctrineModule',
    'DoctrineORMModule',
    'Tracking',
    ];
    } else {
    $modules = [
    'Application',
    'DoctrineModule',
    'DoctrineORMModule',
    'ZF\\Apigility',
    'ZF\\ApiProblem',
    'ZF\\MvcAuth',
    'ZF\\OAuth2',
    'ZF\\Hal',
    'ZF\\Rest',
    'ZF\\Rpc',
    'ZF\\Versioning',
    'ZfrCors',
    ];
    }
    Load only necessary libs

    View Slide

  13. Things worth mentioning in terms of
    storing data
    - Be prepared for fast
    writes
    - Horizontal scalability is
    a must
    - Ensure failover and
    replicas

    View Slide

  14. Keep in mind CAP theorem, when
    choosing data storage
    Decide if you need strong
    consistency (CP) or
    eventual consistency is
    enough (AP)
    CAP theorem
    states that it is impossible for a
    distributed computer system to
    simultaneously provide all three of
    the following guarantees:
    - Consistency (every read
    receives the most recent write or
    an error)
    - Availability (every request
    receives a response, without
    guarantee that it contains the most
    recent version of the information)
    - Partition tolerance (the system
    continues to operate despite
    arbitrary partitioning due to
    network failures)

    View Slide

  15. Data processing

    View Slide

  16. Process on-line the
    tasks, that are
    absolutely required.

    View Slide

  17. Strategies
    There are pretty much three ways of
    processing online
    ➔ Materialized views
    Fetch data from materialized view
    stored under some well known key and
    process according to it
    ➔ Range of keys
    Fetch range of key and process based
    on underlying values
    ➔ Map-reduce
    Fetch data from map-reduce view
    (Couchbase/Mongo/Riak). Might by
    only eventually consistent

    View Slide

  18. Asynchronous
    processing

    View Slide

  19. Process data in CLI tasks consuming
    queues (RabbitMQ/Kafka).
    - Suitable for
    heavy/time consuming
    tasks
    - Requires to ensure
    proper error handling
    - Creates possibility to
    extend flow later
    (stream processing)
    Tip
    make sure you have some sort of
    process management in place, e.g
    supervisord; It will be required for
    process
    crashes/deployments/restarts

    View Slide

  20. Data aggregation
    Aggregate your NoSQL data and save
    into easily accessible place (eg. MySQL
    for reporting)
    Run aggregation quite often based on
    map-reduce (Couchbase/Riak) or
    search criteria (Mongo)

    View Slide

  21. Our story
    ➔ N1QL
    Couchbase’s N1QL (SQL-like engine) result
    saved to database every 5 minutes. N1QL
    doesn’t work very well for larger volumes of
    data
    ➔ Map-reduce
    Couchbase’s map-reduce view. Views are
    pre-fetched / indexed, meaning every change
    to the bucket will trigger view’s reindexing,
    hence they are eventually consistent; Thanks to
    composite map-reduce key we are able to limit
    views result to given day
    ➔ Spark streaming
    Apache Spark streaming job. Feeds with data
    from queue and aggregates them on some
    interval basis and then saves aggregated data
    to database

    View Slide

  22. Stream processing

    View Slide

  23. Very scalable and efficient approach
    to data processing
    Allows to run map-reduce
    jobs on batches of data,
    incoming in defined
    intervals
    Source
    Source can be a queue
    (RabbitMQ or Kafka, which is
    widely used and supported), files
    resource, TCP Socket, memory
    buffer etc.

    View Slide

  24. There’s multiple tools for stream
    processing.
    Apache Spark is one of the
    most extensively adopted
    in the market.
    Spark
    The power of Spark lies in its
    libraries. You can mix streaming,
    SQL and Machine Learning
    without any problems.

    View Slide

  25. Batch processing.

    View Slide

  26. Batch processing
    ➔ Why
    To process data for all kinds of reporting,
    analytics, machine learning models etc.
    ➔ How
    Using well known tools, like Hadoop or Apache
    Spark. Often is more convenient to use some
    big data SQL implementation.
    ➔ What
    Data usually stored in a “cold” storage, like
    HDFS or Amazon S3 service. In format of JSON
    or log files, usually gzipped.

    View Slide

  27. SQL is not DEAD!
    Usually it’s way more
    convenient, readable and
    maintainable to create
    batch jobs using SQL
    implementation
    Implementations
    There are numerous so called
    “SQL-on-Hadoop” engines like
    Hive, Apache Spark SQL, Presto,
    Apache Drill, HAWQ
    All of them are SQL-92 compat

    View Slide

  28. Apache Spark provides SQL engine
    - You can seamlessly mix
    SQL and other Spark code
    (including streaming or
    machine learning)
    - Variety of sources
    including Avro, Hive,
    Parquet, JSON etc.
    - Connectivity via JDBC or
    ODBC

    View Slide

  29. Spark SQL step by step
    Example running on Databricks cloud platform
    Register data using dbutils.fs.mount (S3
    mountpoint)
    Read and transform/filter data
    Run SQL query and enjoy result (Databricks allows
    to use %sql shorthand)

    View Slide

  30. Spark SQL step by step
    - Example running on Databricks
    cloud platform
    - Register data using
    dbutils.fs.mount (S3 mountpoint)
    - Read and transform/filter data
    - Run SQL query and enjoy result
    (Databricks allows to use %sql
    shorthand)
    dbutils.fs.mount("s3a://%s:%[email protected]%s" % (ACCESS_KEY,
    ENCODED_SECRET_KEY
    , AWS_BUCKET_NAME
    ), MOUNT_NAME)
    def get_value_from_json(line):
    import json
    document = json.loads(line.strip(','))
    return json.dumps(document['value'])
    tracking_rdd = sc.textFile(tracking_path).filter(lambda
    x: 'value' in x).map(get_value_from_json)
    tracking_json = sqlContext.jsonRDD(tracking_rdd)
    tracking_json.registerTempTable("tracking")
    %sql
    SELECT event, host, day, count
    FROM tracking
    WHERE
    year = 2016 AND
    month = 07 AND
    day BETWEEN 2 and 5
    ORDER BY day, event;

    View Slide

  31. Lessons learned

    View Slide

  32. Stateless system =
    horizontally scalable

    View Slide

  33. Skim all boilerplate from
    your producers

    View Slide

  34. Segregate responsibilities of
    servers, create
    single responsible servers

    View Slide

  35. Store on-line data in
    horizontally scalable
    storage

    View Slide

  36. Aggregate data for an
    immediate on-line access

    View Slide

  37. Online storage is expensive,
    archive historic data to
    cold storages (S3 / HDFS)

    View Slide

  38. Offload as much as possible
    to background processors
    (queues or streaming jobs)

    View Slide

  39. Use Apache Spark notebooks
    (spark-notebook.io, Apache Zeppelin or Databricks)
    For interactive data analysis

    View Slide

  40. Thank you!
    Ask me anything:
    [email protected]
    twitter.com/sznapka

    View Slide