Upgrade to Pro — share decks privately, control downloads, hide ads and more …

On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

How is Big Data moved around? How are you planning to move it?
This session will focus on familiar and not so similar tools you can use today
for moving and integrating Big Data. Also important to outline the technologies and platform (introduction to Big Data, Hadoop, HDInsight and tools).

We will compare and outline options,
discuss how they can work with your existing Hadoop and Windows Azure
environment, and provide some guidance on when and how to use each of these
tools.

Stéphane Fréchette

February 13, 2014
Tweet

More Decks by Stéphane Fréchette

Other Decks in Technology

Transcript

  1. On  the  move  with  Big  Data   Hadoop,  Pig,  Sqoop,

     SSIS…                 Stéphane  Fréche=e   Thursday  February  13,  2014  
  2. Who  am  I?   My  name  is  Stéphane  Fréche2e  

      SQL  Server  MVP  -­‐  I’m  a  Database  &  Business  Intelligence  Professional  and  Founder  |   CEO  of       I  have  a  passion  for  architecIng,  designing  and  building  soluIons  that  ma2er.     Self  proclaimed  Open  Data  Hacker/Advocate  I  founded  GaIneau  Ouverte  a  ciIzen  led   iniIaIve  which  aims  to  promote  open  access  to  civic  data  of  the  city  of  GaIneau.         Twi2er:  @sfreche2e   Blog:  stephanefreche2e.com   Email:  [email protected]      
  3. Session  Outline   •  What  is  Big  Data?   • 

    Apache  Hadoop   •  Hadoop  Ecosystem   •  Windows  Azure  HDInsight   •  On  the  move…   •  SSIS,  Sqoop,  Pig     •  Demos   •  Resources    
  4. Apache  Hadoop   •  Open-­‐source  so\ware  framework  that  allows  for

     the  distributed  processing   of  large  data  sets  across  clusters  of  computers  using  simple  programming   models   •  Designed  to  scale  up  from  single  servers  to  thousands  of  machines,  each   offering  local  computaIon  and  storage  
  5. Hadoop  Ecosystem   •  Core  components;     •  HDFS

     (Hadoop  Distributed  File  System)  -­‐>  Storage   •  MapReduce  -­‐>  Processing    
  6. What  is  Pig?   •  Write  complex  MapReduce  jobs  using

     a  simple  script  language  (Pig  LaIn)   •  A  pladorm  for  analyzing  large  data  sets  that  consists  of  high-­‐level  language   for  expressing  data  analysis  programs   •  Pig  translates  and  compiles  complex  MapReduce  jobs  on  the  fly             h2p://pig.apache.org  
  7. What  is  Sqoop?   •  Command-­‐line  interface  applicaIon  to  transfer

     bulk  data  between  Hadoop   and  relaIonal  datastores     h2p://sqoop.apache.org  
  8. What  is  Hive?   •  A  data  warehouse  infrastructure  built

     on  top  of  Hadoop  for  providing  data   summarizaIon,  query,  and  analysis   •  Provides  an  SQL-­‐Like  language  called  HiveQL  to  query  data     •  IntegraIon  between  Hadoop  and  BI  and  visualizaIon  tools   h2p://hive.apache.org  
  9. What  is  SSIS?     •  SQL  Server  IntegraIon  Services

     is  a  pladorm  for  data  integraIon  and   workflow  applicaIons.  A  fast  and  flexible  tool  used  for  data  extracIon,   transformaIon,  and  loading  (ETL).     •  Contains  rich  set  of  built-­‐in  tasks  and  transformaIons;  tools  for  construcIng   packages…   •  Used  to  solve  complex  business  problems  
  10. Windows  Azure  HDInsight   •  HDInsight  is  a  Hadoop-­‐based  service

     from  Microso\  that  brings  a  100   percent  Apache  Hadoop  soluIon  to  the  cloud   •  Based  on  the  Hortonworks  Data  Pladorm   •  Scalable,  on-­‐demand  service  
  11. Resources   •  Apache  Projects  (list  with  links)  h2p://bit.ly/MfpLtE  

    •  Windows  Azure  HDInsight  h2p://bit.ly/1dnlAX1   •  HDInsight  Tutorials  and  Guide  h2p://bit.ly/LWRYol   •  Hortonworks  Sandbox  2.0  h2p://bit.ly/1gkkCte   •  Hortonworks  Tutorial  Gallery  h2p://bit.ly/1nvMAEX   •  Microso\  JDBC  Driver  4.0  for  SQL  Server  h2p://bit.ly/1kEgJ7O   •  Microso\  Hive  ODBC  Driver  h2p://bit.ly/NFkhcH   •  GitHub:  WindowsAzure  /  azure-­‐content  h2p://bit.ly/1h\hlF   •  SSIS  Custom  Task  –  Disorderly  Data  (Ken  Ross)  h2p://bit.ly/1nvIH2G     •  GitHub  h2ps://github.com/kzhen/SSISHDFS