Hadoop & PHP - Speaker Deck

Slide 1

Slide 1 text

HADOOP & PHP PHPNW 2017 @MICHAELCULLUMUK

Slide 2

Slide 2 text

HADOOP & PHP PHPNW 2017 @MICHAELCULLUMUK

Slide 3

Slide 3 text

@MICHAELCULLUMUK ME?

Slide 4

Slide 4 text

MICHAEL CULLUM @MICHAELCULLUMUK

Slide 5

Slide 5 text

@MICHAELCULLUMUK

Slide 6

Slide 6 text

@MICHAELCULLUMUK APACHE HADOOP SUITE OF APPLICATIONS

Slide 7

Slide 7 text

@MICHAELCULLUMUK HADOOP

Slide 8

Slide 8 text

@MICHAELCULLUMUK BIG DATA

Slide 9

Slide 9 text

@MICHAELCULLUMUK

Slide 10

Slide 10 text

@MICHAELCULLUMUK HADOOP 2003: GOOGLE FILE SYSTEM  2004: MAPREDUCE: SIMPLIFIED DATA PROCESSING   ON LARGE CLUSTERS  2006: HADOOP DEVELOPMENT BEGINS (AND NAMED AFTER A TOY ELEPHANT)

Slide 11

Slide 11 text

@MICHAELCULLUMUK HADOOP 2006: 1.8 TB IN 47.9 HOURS  2007: 1PB IN 12.13 HOURS, 1.37 TB/MIN  2008: 1PB IN 6.03 HOURS, 2.76 TB/MIN  2010: 1PB, 2.95 HOURS, 5.65 TB/MIN  2011: 1PB, 0.55 HOURS, 30.3 TB/MIN  2012: 50PB, 23 HOURS, 36.2 TB/MIN  2014: SPARK ALLOWED - SORT 5 TIMES AS FAST. 10 TIMES FEWER NODES

Slide 12

Slide 12 text

HADOOP COMPONENTS Hadoop Core Hadoop YARN Hadoop HDFS Hadoop Map-Reduce

Slide 13

Slide 13 text

@MICHAELCULLUMUK Hadoop YARN  Resources Management Hadoop Map-Reduce  Method of executing large tasks

Slide 14

Slide 14 text

OTHER TOOLS ▸ Hive - Relational style database ▸ Bigtop - Quickly setup a test cluster ▸ Pig - High level programming language for MapReduce jobs ▸ Sqoop - For importing/reading MySQL and other RDBMS ▸ Spark - Alternative to MapReduce designed for fast analytics ▸ Flume - Streaming data collection / aggregation manager ▸ Oozie - MapReduce Workﬂow Manager & Scheduler ▸ Whirr - Deployment of clusters to AWS ▸ HBase - Low-latency distributed, non- relational database ▸ Zookeeper - Distributed application HA management ▸ HCatalog - Interop between Pig and Hive

Slide 15

Slide 15 text

@MICHAELCULLUMUK ▸ Hadoop HDFS: How data is stored, accessed and distributed under the hood ▸ Hive: Using Hadoop as an RDBMS; writing ▸ Presto: A Facebook library we can use to query the cluster; reading ▸ Phresto: A library to make Presto accessible from PHP userland ▸ Phresto & Doctrine CONTENTS

Slide 16

Slide 16 text

@MICHAELCULLUMUK HADOOP HDFS

Slide 17

Slide 17 text

@MICHAELCULLUMUK ARCHITECTURE Namenode Datanodes

Slide 18

Slide 18 text

@MICHAELCULLUMUK READING A FILE FROM HDFS ﬁle.txt 1: 192.168.0.2, 192.168.0.5  2: 192.168.0.3, 192.168.0.12 Namenode Datanode: 191.168.0.5 Datanode: 192.168.0.3 Read block 1 {content} {content} Read block 2 Client Library

Slide 19

Slide 19 text

@MICHAELCULLUMUK ▸ ✓ Hadoop HDFS: How data is stored, accessed and distributed under the hood ▸ Hive: Using Hadoop as an RDBMS; importing ﬁles ▸ Presto: A Facebook library we can use to query the cluster ▸ Phresto: A library to make Presto accessible from PHP userland ▸ Phresto & Doctrine CONTENTS

Slide 20

Slide 20 text

@MICHAELCULLUMUK HIVE

Slide 21

Slide 21 text

@MICHAELCULLUMUK METASTORE

Slide 22

Slide 22 text

@MICHAELCULLUMUK IMPORTING OF FILES

Slide 23

Slide 23 text

@MICHAELCULLUMUK TSV -> ORC

Slide 24

Slide 24 text

IMPORTING TO HIVE IS EASY while read filename; do echo $filename hadoop fs -put /home/michael/for_hadoop_import/$filename /tmp/ echo "LOAD DATA INPATH '/tmp/$filename' INTO TABLE temp_csv; INSERT INTO TABLE temp_orc SELECT * FROM temp_csv; TRUNCATE TABLE temp_csv;" | hive done

Slide 25

Slide 25 text

@MICHAELCULLUMUK LIVE DEMO (IF WIFI WORKS)

Slide 26

Slide 26 text

@MICHAELCULLUMUK ▸ ✓ Hadoop HDFS: How data is stored, accessed and distributed under the hood ▸ ✓ Hive: Using Hadoop as an RDBMS; importing ﬁles ▸ Presto: A Facebook library we can use to query the cluster ▸ Phresto: A library to make Presto accessible from PHP userland ▸ Phresto & Doctrine CONTENTS

Slide 27

Slide 27 text

@MICHAELCULLUMUK PRESTO

Slide 28

Slide 28 text

@MICHAELCULLUMUK “DISTRIBUTED SQL ENGINE”

Slide 29

Slide 29 text

@MICHAELCULLUMUK ▸ Hive ▸ MySQL ▸ Cassandra ▸ MongoDB ▸ PostgreSQL ▸ Redis ▸ SQL Server ▸ JMX ▸ REST API ▸ Local ﬁles ▸ Memory CONNECTORS

Slide 30

Slide 30 text

@MICHAELCULLUMUK HDF Your (PHP) application Presto Coordinator Presto Workers HDFS Hive Metastore

Slide 31

Slide 31 text

@MICHAELCULLUMUK IT HAS A UI TOO ;)

Slide 32

Slide 32 text

@MICHAELCULLUMUK LIVE DEMO #2

Slide 33

Slide 33 text

@MICHAELCULLUMUK ▸ ✓ Hadoop HDFS: How data is stored, accessed and distributed under the hood ▸ ✓ Hive: Using Hadoop as an RDBMS; importing ﬁles ▸ ✓ Presto: A Facebook library we can use to query the cluster ▸ Phresto: A library to make Presto accessible from PHP userland ▸ Phresto & Doctrine CONTENTS

Slide 34

Slide 34 text

@MICHAELCULLUMUK PHRESTO

Slide 35

Slide 35 text

@MICHAELCULLUMUK PHP

Slide 36

Slide 36 text

@MICHAELCULLUMUK REST API

Slide 37

Slide 37 text

@MICHAELCULLUMUK

Slide 38

Slide 38 text

@MICHAELCULLUMUK SIMPLE PHP CLIENT $socket = new \SamKnows\Phresto\Client\RemoteHost('http', ‘coordinator.hostname.com', '8080'); $connection = new \SamKnows\Phresto\Client\HttpConnection($socket, new NullLogger(), 'Michael\'s Macbook', 'Michael', '', 0, 0, 0, false); $result = $connection->executeQuery("SELECT * FROM table WHERE id=5')", 'hive', ‘database_name’); $resultClass = $result->getResult();

Slide 39

Slide 39 text

@MICHAELCULLUMUK TAP INTO PRESTO FUNCTIONALITY TOO $connection->getNodeList(); $connection->getClusterStatus(); $connection->getServerInfo(); $connection->getQueriesStatus(); $connection->getQueryStatus('20170508_163921_00807_v2znp');

Slide 40

Slide 40 text

@MICHAELCULLUMUK SAMKNOWS/PHRESTO

Slide 41

Slide 41 text

@MICHAELCULLUMUK ▸ ✓ Hadoop HDFS: How data is stored, accessed and distributed under the hood ▸ ✓ Hive: Using Hadoop as an RDBMS; importing ﬁles ▸ ✓ Presto: A Facebook library we can use to query the cluster ▸ ✓ Phresto: A library to make Presto accessible from PHP userland ▸ Phresto & Doctrine CONTENTS

Slide 42

Slide 42 text

@MICHAELCULLUMUK DOCTRINE & SYMFONY

Slide 43

Slide 43 text

@MICHAELCULLUMUK DOCTRINE DBAL / PDO-STYLE INTERFACE $configuration = new Configuration(); $params = [ 'host' => 'coordinator.hostname.com', 'port' => '8080', 'user' => 'Michael', 'password' => '', 'source' => __FILE__, 'protocol' => 'http', 'catalog' => 'hive', 'schema' => 'database_name', 'driverClass' => Driver::class, ]; $connection = DriverManager::getConnection($params, $configuration); $result = $connection->executeQuery('SELECT * FROM messages LIMIT 1'); $result->fetchAll(); foreach ($result->fetch() as $row) { var_dump($row); }

Slide 44

Slide 44 text

@MICHAELCULLUMUK SYMFONY # Doctrine Configuration doctrine: dbal: default_connection: presto connections: presto: driver_class: '\SamKnows\Phresto\Doctrine\DBAL\Driver\Phresto\Driver' host: 'coordinator.hostname.com' port: '8080' user: 'Michael' password: 'password' charset: UTF8 options: source: 'Michaels MBP' catalog: 'hive' schema: 'table_name' protocol: 'http' Use parameters for these  conﬁguration values

Slide 45

Slide 45 text

@MICHAELCULLUMUK ▸ ✓ Hadoop HDFS: How data is stored, accessed and distributed under the hood ▸ ✓ Hive: Using Hadoop as an RDBMS; importing ﬁles ▸ ✓ Presto: A Facebook library we can use to query the cluster ▸ ✓ Phresto: A library to make Presto accessible from PHP userland ▸ ✓ Phresto & Doctrine CONTENTS