Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hadoop & PHP - PHP South Africa

Michael C.
September 27, 2018

Hadoop & PHP - PHP South Africa

Michael C.

September 27, 2018
Tweet

More Decks by Michael C.

Other Decks in Programming

Transcript

  1. HADOOP & PHP
    PHP CRAFT SOUTH AFRICA 2018
    @MICHAELCULLUMUK

    View full-size slide

  2. HADOOP & PHP
    PHP CRAFT SOUTH AFRICA 2018
    @MICHAELCULLUMUK

    View full-size slide

  3. @MICHAELCULLUMUK
    ME?

    View full-size slide

  4. MICHAEL CULLUM
    @MICHAELCULLUMUK

    View full-size slide

  5. @MICHAELCULLUMUK

    View full-size slide

  6. @MICHAELCULLUMUK
    APACHE HADOOP
    SUITE OF APPLICATIONS

    View full-size slide

  7. @MICHAELCULLUMUK
    HADOOP

    View full-size slide

  8. @MICHAELCULLUMUK
    BIG DATA

    View full-size slide

  9. @MICHAELCULLUMUK

    View full-size slide

  10. @MICHAELCULLUMUK
    HADOOP
    2003: GOOGLE FILE SYSTEM

    2004: MAPREDUCE: SIMPLIFIED DATA PROCESSING 

    ON LARGE CLUSTERS

    2006: HADOOP DEVELOPMENT BEGINS (AND NAMED AFTER A TOY ELEPHANT)

    View full-size slide

  11. @MICHAELCULLUMUK
    HADOOP
    2006: 1.8 TB IN 47.9 HOURS

    2007: 1PB IN 12.13 HOURS, 1.37 TB/MIN

    2008: 1PB IN 6.03 HOURS, 2.76 TB/MIN

    2010: 1PB, 2.95 HOURS, 5.65 TB/MIN

    2011: 1PB, 0.55 HOURS, 30.3 TB/MIN

    2012: 50PB, 23 HOURS, 36.2 TB/MIN

    2014: SPARK ALLOWED - SORT 5 TIMES AS FAST. 10 TIMES FEWER NODES

    View full-size slide

  12. HADOOP COMPONENTS
    Hadoop Core
    Hadoop YARN
    Hadoop HDFS
    Hadoop Map-Reduce

    View full-size slide

  13. @MICHAELCULLUMUK
    Hadoop YARN

    Resources Management
    Hadoop Map-Reduce

    Method of executing large tasks

    View full-size slide

  14. OTHER TOOLS
    ▸ Hive - Relational style database
    ▸ Bigtop - Quickly setup a test cluster
    ▸ Pig - High level programming
    language for MapReduce jobs
    ▸ Sqoop - For importing/reading
    MySQL and other RDBMS
    ▸ Spark - Alternative to MapReduce
    designed for fast analytics
    ▸ Flume - Streaming data collection /
    aggregation manager
    ▸ Oozie - MapReduce Workflow
    Manager & Scheduler
    ▸ Whirr - Deployment of clusters to
    AWS
    ▸ HBase - Low-latency distributed, non-
    relational database
    ▸ Zookeeper - Distributed application
    HA management
    ▸ HCatalog - Interop between Pig and
    Hive

    View full-size slide

  15. @MICHAELCULLUMUK
    ▸ Hadoop HDFS: How data is stored, accessed and distributed under
    the hood
    ▸ Hive: Using Hadoop as an RDBMS; writing
    ▸ Presto: A Facebook library we can use to query the cluster; reading
    ▸ Phresto: A library to make Presto accessible from PHP userland
    ▸ Phresto & Doctrine
    CONTENTS

    View full-size slide

  16. @MICHAELCULLUMUK
    HADOOP HDFS

    View full-size slide

  17. @MICHAELCULLUMUK
    ARCHITECTURE
    Namenode
    Datanodes

    View full-size slide

  18. @MICHAELCULLUMUK
    READING A FILE FROM HDFS
    file.txt
    1: 192.168.0.2, 192.168.0.5

    2: 192.168.0.3, 192.168.0.12
    Namenode
    Datanode: 191.168.0.5
    Datanode: 192.168.0.3
    Read block 1
    {content}
    {content}
    Read block 2
    Client Library

    View full-size slide

  19. @MICHAELCULLUMUK
    ▸ ✓ Hadoop HDFS: How data is stored, accessed and distributed
    under the hood
    ▸ Hive: Using Hadoop as an RDBMS; importing files
    ▸ Presto: A Facebook library we can use to query the cluster
    ▸ Phresto: A library to make Presto accessible from PHP userland
    ▸ Phresto & Doctrine
    CONTENTS

    View full-size slide

  20. @MICHAELCULLUMUK
    HIVE

    View full-size slide

  21. @MICHAELCULLUMUK
    METASTORE

    View full-size slide

  22. @MICHAELCULLUMUK
    TSV -> ORC

    View full-size slide

  23. IMPORTING TO HIVE IS EASY
    while read filename; do
    echo $filename
    hadoop fs -put /home/michael/for_hadoop_import/$filename /tmp/
    echo "LOAD DATA INPATH '/tmp/$filename'
    INTO TABLE temp_csv;
    INSERT INTO TABLE temp_orc
    SELECT * FROM temp_csv;
    TRUNCATE TABLE temp_csv;" | hive
    done

    View full-size slide

  24. @MICHAELCULLUMUK
    LIVE DEMO
    (IF WIFI WORKS)

    View full-size slide

  25. @MICHAELCULLUMUK
    ▸ ✓ Hadoop HDFS: How data is stored, accessed and distributed
    under the hood
    ▸ ✓ Hive: Using Hadoop as an RDBMS; importing files
    ▸ Presto: A Facebook library we can use to query the cluster
    ▸ Phresto: A library to make Presto accessible from PHP userland
    ▸ Phresto & Doctrine
    CONTENTS

    View full-size slide

  26. @MICHAELCULLUMUK
    PRESTO

    View full-size slide

  27. @MICHAELCULLUMUK
    “DISTRIBUTED SQL
    ENGINE”

    View full-size slide

  28. @MICHAELCULLUMUK
    ▸ Hive
    ▸ MySQL
    ▸ Cassandra
    ▸ MongoDB
    ▸ PostgreSQL
    ▸ Redis
    ▸ SQL Server
    ▸ JMX
    ▸ REST API
    ▸ Local files
    ▸ Memory
    CONNECTORS

    View full-size slide

  29. @MICHAELCULLUMUK
    HDF
    Your (PHP) application Presto Coordinator Presto Workers
    HDFS
    Hive Metastore

    View full-size slide

  30. @MICHAELCULLUMUK
    IT HAS A UI TOO ;)

    View full-size slide

  31. @MICHAELCULLUMUK
    LIVE DEMO #2

    View full-size slide

  32. @MICHAELCULLUMUK
    ▸ ✓ Hadoop HDFS: How data is stored, accessed and distributed
    under the hood
    ▸ ✓ Hive: Using Hadoop as an RDBMS; importing files
    ▸ ✓ Presto: A Facebook library we can use to query the cluster
    ▸ Phresto: A library to make Presto accessible from PHP userland
    ▸ Phresto & Doctrine
    CONTENTS

    View full-size slide

  33. @MICHAELCULLUMUK
    PHRESTO

    View full-size slide

  34. @MICHAELCULLUMUK
    PHP

    View full-size slide

  35. @MICHAELCULLUMUK
    REST API

    View full-size slide

  36. @MICHAELCULLUMUK

    View full-size slide

  37. @MICHAELCULLUMUK
    SIMPLE PHP CLIENT
    $socket = new \SamKnows\Phresto\Client\RemoteHost('http', ‘coordinator.hostname.com',
    '8080');
    $connection = new \SamKnows\Phresto\Client\HttpConnection($socket, new NullLogger(),
    'Michael\'s Macbook', 'Michael', '', 0, 0, 0, false);
    $result = $connection->executeQuery("SELECT * FROM table WHERE id=5')", 'hive',
    ‘database_name’);
    $resultClass = $result->getResult();

    View full-size slide

  38. @MICHAELCULLUMUK
    TAP INTO PRESTO FUNCTIONALITY TOO
    $connection->getNodeList();
    $connection->getClusterStatus();
    $connection->getServerInfo();
    $connection->getQueriesStatus();
    $connection->getQueryStatus('20170508_163921_00807_v2znp');

    View full-size slide

  39. @MICHAELCULLUMUK
    SAMKNOWS/PHRESTO

    View full-size slide

  40. @MICHAELCULLUMUK
    ▸ ✓ Hadoop HDFS: How data is stored, accessed and distributed
    under the hood
    ▸ ✓ Hive: Using Hadoop as an RDBMS; importing files
    ▸ ✓ Presto: A Facebook library we can use to query the cluster
    ▸ ✓ Phresto: A library to make Presto accessible from PHP userland
    ▸ Phresto & Doctrine
    CONTENTS

    View full-size slide

  41. @MICHAELCULLUMUK
    DOCTRINE & SYMFONY

    View full-size slide

  42. @MICHAELCULLUMUK
    DOCTRINE DBAL / PDO-STYLE INTERFACE
    $configuration = new Configuration();
    $params = [
    'host' => 'coordinator.hostname.com',
    'port' => '8080',
    'user' => 'Michael',
    'password' => '',
    'source' => __FILE__,
    'protocol' => 'http',
    'catalog' => 'hive',
    'schema' => 'database_name',
    'driverClass' => Driver::class,
    ];
    $connection = DriverManager::getConnection($params, $configuration);
    $result = $connection->executeQuery('SELECT * FROM messages LIMIT 1');
    $result->fetchAll();
    foreach ($result->fetch() as $row) {
    var_dump($row);
    }

    View full-size slide

  43. @MICHAELCULLUMUK
    SYMFONY
    # Doctrine Configuration
    doctrine:
    dbal:
    default_connection: presto
    connections:
    presto:
    driver_class: '\SamKnows\Phresto\Doctrine\DBAL\Driver\Phresto\Driver'
    host: 'coordinator.hostname.com'
    port: '8080'
    user: 'Michael'
    password: 'password'
    charset: UTF8
    options:
    source: 'Michaels MBP'
    catalog: 'hive'
    schema: 'table_name'
    protocol: 'http'
    Use parameters for these

    configuration values

    View full-size slide

  44. @MICHAELCULLUMUK
    REPOSITORY
    use Doctrine\DBAL\Connection;
    class AnalyticsRepository
    {
    private $doctrine;
    public function __construct(Connection $doctrine)
    {
    $this->doctrine = $doctrine;
    }
    public function getAveragesRttByPool(): array
    {
    $result = $this->connection->query("SELECT avg(rtt), pool GROUP BY pool")
    ->fetchAll();
    return $result;
    }
    }

    View full-size slide

  45. @MICHAELCULLUMUK
    ▸ ✓ Hadoop HDFS: How data is stored, accessed and distributed
    under the hood
    ▸ ✓ Hive: Using Hadoop as an RDBMS; importing files
    ▸ ✓ Presto: A Facebook library we can use to query the cluster
    ▸ ✓ Phresto: A library to make Presto accessible from PHP userland
    ▸ ✓ Phresto & Doctrine
    CONTENTS

    View full-size slide

  46. @MICHAELCULLUMUK
    ANY QUESTIONS?

    View full-size slide

  47. THANKS
    @MICHAELCULLUMUK
    DANKIE
    NGIYABONGA
    NGIYATHOKOZA
    ENKOSI
    KE A LEBOGA
    KE A LEBOHA
    NDI A LIVHUHA
    NDZA KHANS

    View full-size slide

  48. HADOOP & PHP
    PHP CRAFT SOUTH AFRICA 2018
    @MICHAELCULLUMUK

    View full-size slide