Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Connecting the data infrastructure with the DataFlow (Apache NiFi)

Connecting the data infrastructure with the DataFlow (Apache NiFi)

The need to integrate a swarm of systems has always been present in the history of IT, however with the advent of big data and the internet of things this has simply exploded. Through the explanation of several real life use cases in companies of all sizes, this talk will introduce you to Apache NiFi, a powerful and scalable system to process, transform and distribute data.

NiFi is an open source project from the Apache Foundation that works perfectly as mediation logic between systems and to perform most of your ETL requirements. This talk will show you how NiFi can be used by humans in BI, Data Science, Development and Operations teams to easily fulfill your data move requirements.

After this talk you will know where you can leverage NiFi, but also where you should not use it, in a nutshell you will add another tool in your belt to work on data integration problems.

Pere Urbón

June 13, 2018
Tweet

More Decks by Pere Urbón

Other Decks in Technology

Transcript

  1. @purbon
    Connecting the data
    infrastructure with the
    DataFlow

    View Slide

  2. @purbon
    Pere Urbon-Bayes
    Software Architect
    [email protected]{gmail.com, acm.org}

    View Slide

  3. Topics for Today
    • Integration patterns for the enterprise startup.
    • What is Apache NIFI.
    • Examples
    • NiFi on operation (best practises).

    View Slide

  4. @purbon
    Integrate all the
    things!

    View Slide

  5. @purbon
    Enterprise integration is the task of making
    separate applications work together to produce
    an unified set of functionality.
    The applications probably run on multiple
    computers, which may be geographically
    dispersed.

    View Slide

  6. @purbon
    Some application might need to be integrated
    even though they were not designed for
    integration and can not be changed.
    This issues, and others, are what makes
    application integration difficult.

    View Slide

  7. @purbon
    Each integration faces different needs and
    criteria, we can group them as
    Application coupling
    Integration simplicity
    Data formats and timeliness
    Data or functionality
    Communication

    View Slide

  8. @purbon
    There is only a limited set of integration
    options

    View Slide

  9. @purbon
    File transfer

    View Slide

  10. @purbon
    Shared database

    View Slide

  11. @purbon
    RPC invoke

    View Slide

  12. @purbon
    Messaging

    View Slide

  13. @purbon
    Enterprise Integration Patterns

    View Slide

  14. @purbon
    What is Apache NiFi?

    View Slide

  15. @purbon
    An easy to use, powerful, and reliable system to
    process and distribute data.
    Web-based interface
    Highly configurable
    Data Provenance
    Designed for extension
    Secure

    View Slide

  16. @purbon
    NiFi was build to automate the flow of data
    between systems.
    an automated and managed flow of information
    between systems.
    But what is Dataflow?

    View Slide

  17. @purbon
    How Apache NiFi look like

    View Slide

  18. @purbon
    Concepts behind Apache NiFi

    View Slide

  19. @purbon
    A Flow file

    View Slide

  20. @purbon
    The Flow file Processor

    View Slide

  21. @purbon
    A Connection

    View Slide

  22. @purbon
    A Process Group

    View Slide

  23. @purbon
    Apache NiFi Architecture
    Distributed using Apache Zookeper

    View Slide

  24. @purbon
    Let’s take a closer
    look…

    View Slide

  25. @purbon
    Apache NiFI
    Operations

    View Slide

  26. @purbon
    Maximum file handles
    hard nofile 50000
    soft nofile 50000
    /etc/security/limits.conf

    View Slide

  27. @purbon
    Maximum forked Procs
    hard nproc 10000
    soft nproc 10000
    /etc/security/limits.conf
    /etc/security/limits.d/90-nproc.conf

    View Slide

  28. @purbon
    Increase number of TCP sockets
    sudo sysctl -w net.ipv4.ip_local_port_range="10000 65000"

    View Slide

  29. @purbon
    Timeout sockets in TIMED_WAIT state
    sudo sysctl -w
    net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait="1"

    View Slide

  30. @purbon
    Never SWAP
    vm.swappiness = 0
    /etc/sysctl.conf
    /dev/sda7 /chroot ext2 defaults, noatime 1 2
    /etc/fstab

    View Slide

  31. @purbon
    Thanks a lot!
    Questions?
    disagreements? threads?
    Pere Urbon-Bayes
    Data Wrangler
    [email protected]{gmail.com, acm.org}

    View Slide