Scrum Master + Data Engineer in a manufacturing company Working with • data and people Focus on • Agile/Engineering culture • Streaming process • IoT applications • Data visualization Shuhsi Lin sucitw gmail.com https://medium.com/@suci/
REST API • Highly configurable • Data Provenance • Designed for extension • Secure Created in 2006 at NSA, donated to ASF at 2014 Similar concepts of Flow Based Programming (FBP) 12
gives you tools to process this data 13 https://medium.com/free-code-camp/nifi-surf-on-your-dataflow-4f3343c50aa2 • Deal a great variety of data sources and format • From one source, transform it, and push it to a different data sink
data source and target ◦ Resilient to update components ( 300 +processors, data source, transformation logic, Job scheduling ...) ◦ Access control (Group, Users, LDAP) • Scalable (Clustering) • Reliability ◦ Guaranteed delivery (no data loss) ◦ Data provenance/lineage (auditing pipeline) • Fast develop/deploy cycle • Better data flow visibility (across Multi-disciplinary teams/peers ) • Data Processing without Programming The four Vs of Big Data https://medium.com/free-code-camp/nifi-surf-on-your-dataflow-4f3343c50aa2
to build a flow • Start and Stop and configure components in real time • View errors and corresponding error messages • View statistics and health of data flow • create templates of common processor & connections
• Two processors linked together by one connection (queue) 23 Processor 1- Connection - Processor 2 Simple data flow https://medium.com/@suci/hello-world-nifi-dcafcba0fdb0 Generate Flowfile (generate content) PutFile (Store content) connector Processor 2 Processor 1 I am flowfile
memory space (swappable) ◦ Ex, the file name, file path, and a unique identifier are standard attributes • Content ◦ a reference to the stream of bytes compose the FlowFile content FileFile object is immutable How Apache Nifi works — surf on your dataflow, don’t drown in it
its threshold. Two processors linked by a connector with its limit respected. Number of FlowFiles below the threshold. The Flow Controller schedules the P1 for execution again Backpressure
of all the FlowFiles in the flow • The metadata and context information of each FlowFile • Replay the data from any point in time How Apache Nifi works — surf on your dataflow, don’t drown in it
and choose what you need • Do you need an enterprise dataflow platform • Do you need to integrate with many different big data solutions • You are suggested to have the ability to build the same pipeline without NiFi • You may need enable administration functions in a large scale usage 39
• Detailed documentation of the full SDK at all levels • CRUD wrappers for common task areas like Processor Groups, Processors, Templates, Registry Clients, Registry Buckets, Registry Flows, etc. • Convenience functions for inventory tasks, such as recursively retrieving the entire canvas, or a flat list of all Process Groups • Support for scheduling and purging flows, controller services, and connections • Support for fetching and updating Variable Registries • Support for import/export of Versioned Flows from NiFi-Registry • Docker Compose configurations for testing and deployment • A scripted deployment of an interactive environment, and a secured configuration, for testing and demonstration purposes By Dan Chaffelson
NiPyApi Config Security System Templates Utils Versioning A set of defaults and parameters Secure connectivity management For system and cluster level functions in NiFi For managing flow deployments Convenience utility functions for NiPyApi ( not intended for external use) For interactions with the NiFi Registry Service and related functions https://github.com/sucitw/python-script-in-NiFi/blob/master/RESTAPI/nipyapi_demo.ipynb
to a FlowFile • No incoming flowfile ( No upstream connections ) 54 • On the contents of a flow file ◦ Pipes FlowFile content to STDIN • Creates a new flow file with the results of the command ◦ Populates outgoing flowfile content from STDOUT ◦ Can also direct to named attribute Executes an external command • Ex, external Python/Shell script
C libs), Lua, Javascript ◦ JSR-223 compatible language • Easy to use, fast development • Source is inline or file system • ExecuteScript => evaluate “onTrigger()” • Only two relationships available (REL_SUCCESS and REL_FAILURE) • Poor performance 57 https://nifi.apache.org/docs.html (ExecuteScript) https://github.com/sucitw/python-script-in-NiFi
◦ Supports custom properties and relationships ◦ ES handles the session.commit() for you, but ISP does not. ◦ ES has a "session" variable, where the ISP onTrigger() method must call sessionFactory.createSession() 60 https://nifi.apache.org/docs.html (InvokeScriptedProcessor) https://github.com/apache/nifi/tree/master/nifi-nar-bundles/nifi-scripting-bundle/nifi-scripting-processors/src/test/resources/jython
NiFi • Do it like software development ◦ Focus on business logic first ◦ Do it after design ◦ Keep it simple ◦ Separate environments: Dev, Test, Production ◦ Versioning ◦ Build once, develop many • Extract -Load - Transform ( with external tools) • Design for Failure (fault tolerance) ◦ Good retry/error handling mechanism ◦ Monitoring ◦ Clustering in Production at least • Use processor groups (PGs) • PGs => Ingestion, test, and monitoring • Naming convention • Put labels/comments • Use variables, use funnel • Scheduling -> Automation • Define SLA with you clients • Do monitoring and alerting • Use for handling routing and formatting • Try to not queue data • Control FlowFile stream rate • ...
How NiFi may help you • Hello-World NiFi • Key concepts and terms of NiFi (Flowfile, Processors, Connections) • RestAPI with examples (NiPyapi) • ExecuteProcess/ExecuteStreamCommand • ExecuteScript/InvokeScriptedProcessor with Python • Suggestions on NiFi 64
your dataflow, don’t drown in it by François Paupier 2. Pipeline Builder: Micron’s Journey Automating the Global Data Warehouse, Datawork Summit 2019 3. Best practices for running Apache NiFi in production - 3 takeaways from real world projects /“Best practices and lessons learnt from Running Apache NiFi” Datawork Summit 2018 4. BYOP: Custom Processor Development in Apache NiFi, Dataworks Summit Barcelona 2019 5. Hello-World in Apache NiFi 6. Nifi 開發小技巧(Funnel)
with more processors ◦ Choose what you need and do it right • Learn from Dataflow Templates • NiFi Express Language • Build Your Own Processor ◦ BYOP: Custom Processor Development with Apache NiFi • MiNiFi • NiFi Registry 66