From the course: Data Management with Apache NiFi

Adding and configuring processors - Apache NiFi Tutorial

From the course: Data Management with Apache NiFi

Adding and configuring processors

- [Instructor] We are now ready to create our first data flow, and for that we need a processor. A processor is a NiFi component that can be used for listening to incoming data, pulling data from external sources, publishing data, routing, transforming data. Select the processor and drag it onto your canvas. And this will bring up a dialog where you can choose the processor that you want to add to your data flow. I've typed in generate here because I want to select the GenerateFlowFile processor which creates FlowFiles with random content or custom content. As you can see, this is great for load testing, configuration and simulation. Add this to your canvas. We have our very first processor set up. Now this processor has to be configured, and the way you do this is by double clicking on the processor, which will bring up the properties associated with the processor. Every processor has a name and identifier, a type, and a bunch of other information. Now these are not properties you can configure. However, you can configure the scheduling of a processor. For example, you can set the scheduling strategy. Now the scheduling strategy can either be timer driven or CRON driven. We'll stick with timer driven scheduling strategies. The CRON option, you can specify a CRON expression for indicating when you want this processor to execute. Concurrent tasks allows you to configure the concurrency of this processor, so how many different processes it should run. Run schedule tells you how often the processor will run. This processor will run by default every minute. Every processor is associated with a bunch of properties or attributes, and these are things that you can configure. For example, I'm going to set the file size to be a hundred bytes to give an indication of the file size produced by this processor. These properties can be read by downstream processors in your flow in order to perform certain actions. We won't be using this property, but you know how it can be set. Next, I specify the custom text property. This is the custom text that will be generated every minute by this GenerateFlowFile processor. Every minute, this processor will produce a FlowFile that is an information packet with the contents that I have specified here under custom text. In the relationship tab, you can specify how this processor is connected to other processors in your data flow. We'll just leave this empty for now and you can also specify additional comments, allowing other team members to understand what this processor does. Go ahead and click on Apply. You can see that you've configured your first processor. However, you can see that little warning icon that indicates that something is wrong. That's because the relationship success, that is what this processor should do when it has successfully generated a FlowFile, that's something we haven't set up yet. While this isn't a meaningful data flow yet, let's add another processor that will actually write out the FlowFile to our local file system. Click on the processor icon yet again. This will bring up the dialog, and let's search for the PutFile processor. PutFile can be used to write out a FlowFile that is present within our data flow. It requires these permissions, but because we are running locally, it does have these permissions on your local machine. Go ahead and add this processor to your data flow as well. For a PutFile to work, there are a number of configuration properties that we need to specify so double click on PutFile, and let's bring up the configuration dialog for this processor and set this up. Under scheduling, notice run schedule says zero seconds. That means this processor will execute as soon as data, that is a FlowFile, is available for this processor to process. Let's head over to properties where we have a few properties to configure. The first property that I set is the directory where the FlowFile will be written out. Now this directory has to be accessible by NiFi on your local machine, so I'm just going to write everything out to output_files under my tools subfolder. If this directory doesn't already exist, I want it to be created so make sure that Create Missing Directories is set to true. output_files does not currently exist under my tools folder. It'll be created as needed. Let's head over to relationships, and here on Failure. That is, if PutFile fails, let's just terminate the data flow, so check the terminate option. And on Success, after PutFile has successfully written out our FlowFile to our local file system, we'll terminate the process as well. This is the last processor in our data flow which is why setting both failure as well as success to terminate makes sense. Let's head over to comments and write something meaningful so that our other team members know what this processor is for and click on Apply. At this point, we have two processors in our data flow, but we haven't set up any connections. Well, that's what we'll do next.

Contents