Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BigData PHP

BigData PHP

Mariusz Gil

October 27, 2013
Tweet

More Decks by Mariusz Gil

Other Decks in Programming

Transcript

  1. bigdata.php
    Mariusz Gil

    View Slide

  2. PHP / Scalability and performance / Big Data

    View Slide

  3. PHP and memcached, advanced use-cases / PHPCon PL 2010
    Aspect oriented programming in PHP / PHPCon PL 2010
    3

    View Slide

  4. 3V
    Volume, Velocity, Variety

    View Slide

  5. View Slide

  6. View Slide

  7. 95%
    of knowledge you already have

    View Slide

  8. 2004

    View Slide

  9. View Slide

  10. (K1, V1) → list(K2, V2)
    map step
    (K2, list(V2)) → list(K3, V3)
    reduce step

    View Slide

  11. php is good
    php is simply
    php is popular
    php is good
    php is simply
    php is popular
    php, 1
    is, 1
    simply, 1
    php, 1
    is, 1
    popular, 1
    php, 1
    is, 1
    good, 1
    php, 3
    is, 3
    good, 1
    simply, 1
    popular, 1
    good, 1
    is, 1
    is, 1
    is, 1
    php, 1
    php, 1
    php, 1
    simply, 1
    popular, 1
    php, 3
    is, 3
    good, 1
    simply, 1
    popular, 1

    View Slide

  12. HDFS + YARN + MapReduce

    View Slide

  13. But with support for external programs by Streaming API
    Java oriented

    View Slide

  14. NodeManager
    YARNChild
    MapTask
    ReduceTask
    node manager node
    while (($line = fgets(STDIN)) !== false) {
    $words = explode(' ', trim($line));
    foreach ($words as $word) {
    echo $word . ' ' . 1 . PHP_EOL;
    }
    }

    View Slide

  15. MongoDB

    View Slide

  16. $mongo = new MongoClient();
    $app['mongo'] = $mongo->selectDB('db');
    $map = new MongoCode('function() {
    emit(this.key, this.value);
    }');
    $reduce = new MongoCode('function(key, values) {
    return Array.sum(values);
    }');
    $result = $app['mongo']->command(array(
    'mapreduce' => 'collection',
    'map' => $map,
    'reduce' => $reduce,
    'out' => array(
    'inline' => 1,
    ),
    ));

    View Slide

  17. Apache Zookeeper
    Apache HBase
    Apache Hive
    Apache Oozie
    Apache Pig
    Apache Avro
    Apache Ambari
    Apache Chukwa
    Apache Flume
    Apache Scribe
    Apache Whirr
    Apache Mahout
    Apache Sqoop
    Apache Zookeeper
    Apache HBase
    Apache Hive
    Apache Pig
    Apache Avro

    View Slide

  18. View Slide

  19. region servers
    HDFS nodes
    php

    View Slide

  20. $socket = new TSocket('localhost', 9090);
    $socket->setSendTimeout(2000);
    $socket->setRecvTimeout(4000);
    $transport = new TBufferedTransport($socket);
    $protocol = new TBinaryProtocol($transport);
    $client = new HbaseClient($protocol);
    $transport->open();
    $table = 'test';
    $descriptors = $client->getColumnDescriptors($table);
    $result = $client->getRow($table, "php");
    foreach ($descriptors as $col) {
    echo ("Column: {$col->name}, maxVer: {$col->maxVersions}" . PHP_EOL);
    }
    $transport->close();

    View Slide

  21. View Slide

  22. CREATE TABLE page_views (
    user_id INT,
    page_id,
    date DATE,
    user_agent STRING
    ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
    SELECT page_views.*
    FROM page_views
    WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31'
    SELECT page_views.*
    FROM page_views
    JOIN dim_users ON (page_views.user_id = dim_users.id)
    WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31'
    SELECT col1 FROM t1 GROUP BY col1 HAVING SUM(col2) > 10

    View Slide

  23. CREATE TABLE www_logs (
    ip STRING,
    method STRING,
    url STRING,
    http_code SMALLINT,
    referrer STRING
    ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
    add FILE www_logs_mapper.php;
    INSERT OVERWRITE TABLE www_logs_raw
    SELECT
    TRANSFORM (line)
    USING 'php www_logs_mapper.php'
    AS (ip, method, url, http_code, referrer)
    FROM www_logs;
    SELECT user_agent, COUNT(*)
    FROM www_logs
    GROUP BY user_agent;
    CREATE TABLE www_logs_raw (
    line STRING
    ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

    View Slide

  24. mapinka-reducinka.pl

    View Slide

  25. View Slide

  26. View Slide

  27. Storm
    Free and open source
    Distributed system
    Realtime processing
    Language agnostic

    View Slide

  28. TextSpout SplitSentenceBolt
    WordCountBolt
    [sentence]
    [word]
    [word, count]
    TextSpout SplitSentenceBolt
    [sentence]
    xyzBolt
    php
    php
    php
    php
    php php

    View Slide

  29. class RandomSentenceSpout extends ShellSpout {
    protected $sentences = array(
    "the cow jumped over the moon",
    "an apple a day keeps the doctor away",
    );
    protected function nextTuple() {
    sleep(.1);
    $sentence = $this->sentences[ rand(0, count($this->sentences) - 1)];
    $this->emit(array($sentence));
    }
    protected function ack($tuple_id) {
    return;
    }
    protected function fail($tuple_id) {
    return;
    }
    }
    $SentenceSpout = new RandomSentenceSpout();
    $SentenceSpout->run();

    View Slide

  30. class SplitSentenceBolt extends BasicBolt {
    public function process(Tuple $tuple) {
    $words = explode(" ", $tuple->values[0]);
    foreach($words as $word) {
    $this->emit(array($word));
    }
    }
    }
    $splitsentence = new SplitSentenceBolt();
    $splitsentence->run();

    View Slide

  31. View Slide

  32. View Slide

  33. http://hadoop.apache.org/
    http://hive.apache.org/
    http://hbase.apache.org/
    http://mahout.apache.org/
    http://zookeeper.apache.org/
    http://www.mongodb.org/
    http://storm-project.net/
    http://incubator.apache.org/drill/
    http://www.bigdatafestival.co/

    View Slide

  34. View Slide

  35. View Slide

  36. THANKS!
    joind.in/9775
    @mariuszgil

    View Slide