Introduction to Apache Hadoop HDFS

Apache Hadoop HDFS • What is it ? • What
is it for ? • Architecture • Resilience • Administration • Data access • Future changes ?

HDFS – What is it ? • HDSF = Hadoop
Distributed File System • It is a distributed file system • Runs on low cost hardware • It is open source • Written in Java • Fault tolerant • Designed for very large data sets • Tuned for high throughput

HDFS – What is it for ? • Designed for
batch processing • Streaming access to data • Large data sizes i.e. Terabytes • Highly reliable using data replication • Supports very large node clusters • Supports large files • Supports file numbers into millions

HDFS – Architecture

HDFS – Architecture • Has a master / slave architecture
• A master NameNode – Controls file system operations – Maps data blocks to DataNodes – Logs all changes • Slave DataNodes – Store file blocks – Store replicated data

HDFS – Resilience • Data is replicated across DataNodes •
Nodes may fail but data is still available • DataNodes indicate state via heart beat report • Single point of failure in master NameNode • Data integrity via check sums

HDFS – Administration • Access via Java API • FS
Shell commands language • HTTP browser • C wrapper for Java API • Space reclamation – Via control of replication factor – Deleted files sent to trash folder – Trash folder cleaned after configurable time

HDFS – Future changes Things they might consider for HDFS
• File append • User quotas • File links • Stand by nodes

Other Areas • Want to know about ? – Big
Data – Nutch – Solr • see my other presentations

Contact Us • Feel free to contact us at –
www.semtech-solutions.co.nz – [email protected] • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems

Introduction to Apache Hadoop HDFS

Introduction to Apache Hadoop HDFS

Mike Frampton

More Decks by Mike Frampton

Other Decks in Technology

Featured

Transcript

Apache Hadoop HDFS • What is it ? • What

HDFS – What is it ? • HDSF = Hadoop

HDFS – What is it for ? • Designed for

HDFS – Architecture

HDFS – Architecture • Has a master / slave architecture

HDFS – Resilience • Data is replicated across DataNodes •

HDFS – Administration • Access via Java API • FS

HDFS – Future changes Things they might consider for HDFS

Other Areas • Want to know about ? – Big

Contact Us • Feel free to contact us at –