Pentaho Data Integration Tutorial for Beginners - Part 1: Group By vs Memory Group By

Pentaho Data Integration Tutorial for Beginners - Part 1: Group By vs Memory Group By

Pentaho Data Integration Tutorial for Beginners
Hi my fellow Devs! Sincere apologies for my absence from this blog for a long time. I will be regularly posting useful Business Intelligence tutorials in simple language. I am starting with a mini tutorial series in parts for beginners using Pentaho Data Integration, an ETL software.

For IT professionals who have worked in other ETL tools as well as absolute beginners starting their IT career in data analytics, PDI is a very simple to learn ETL program.

Here in this post, I have given a demo for few basic scenarios one may encounter while doing ETL development and how can we handle those scenarios using Pentaho Data Integration. 

Scenario#1: Aggregation of numerical data using Group By & Memory Group By

Aggregation is the process of combining data to get a summary details like Sum, Min, Max, Average, Median, etc. In this example, I am going to demonstrate Pentaho Group By with example use case with simple text file as source and the data set used for this example is given below:



Fig 1: Source data set

Requirement:

The task is to find the Department wise maximum salary amount from the given data. This can be achieved by using PDI Group By and Memory Group By steps. While both of them does the same work, using 'Group By' requires sorted data whereas 'Memory Group By' does not require sorted data to operate. If the rows to be grouped is so huge, it could not be handled within available memory capacity, 'Group By' step can be used along with 'Sort Rows' step instead of 'Memory Group By'.

Note:
If the data is from a database table, ideally we should be doing the aggregation within the database environment and use the aggregated data in PDI transformation as SQL Group By operation is generally more efficient than doing it in any ETL environment. Hence, PDI Group By/Memory Group By steps are more suitable for any data in which aggregation at the source is not possible eg. csv, txt, etc.     
                                                   
Fig 2: PDI Transformation Design
The image given above, is the design of the PDI transformation file that I created for demonstrating the aggregation operation using both Grouping steps (just to show that both of them work as expected and in actual development using any one of them is sufficient). Alright, lets build this Transformation, step by step.                                                              

  1. Bringing the components/steps into Design canvas.
  2. Connecting steps using hops.
  3. Configuring the steps.
  4. Saving and running the Transformation file.

1. Bringing the Steps into Design Canvas:

The steps we need for this demo transformation are:

  • Text File Input
  • Sort Rows
  • Group By
  • Memory Group By
  • Text File Output (one for Group By results & another for Memory Group By Results)
At first, we need to connect the Source data, which in our case is a simple .txt file. For this, we need to use Text File Input step. Simply, type the step name in the left side design explorer pane. Once the step appears in the list, drag the step and drop it in the design canvas area as shown in the below image.

Fig 3: Bringing the Text File Input to Design Canvas
In a similar way, bring the above listed steps into the Design Canvas.

Fig. 4: Bringing the components

2. Connecting the Steps using Hops:

Connect the steps by using the links aka "hops". Click the Text File Input Step (clicking any step on the design canvas also gives the same options). We can see four options and select the option shown in the below screenshot.


Fig. 5: Selecting the connector

On clicking the option, we will get a line with arrow at the mouse cursor. Simply click the Sort Rows step and two will be connected together. Similarly, click the Sort Rows and connect it with Group By step.                                                                                
Note:
Any two steps can also be connected by selecting the first step and by holding the Shift key in the keyboard, move the cursor towards the second step, a hop will be linking two steps.                                                                                                                                                       
Now, connect all the steps as shown in Fig. 2. When connecting Text File Input with Memory Group By step, we will get a message as shown below:


Fig. 6: Copying the data in two streams
Simply select Copy option. What this does is, it passes two copies of the text file dataset: one to Sort Rows step and another to Memory Group By step.

3. Configuring the steps:

a. Text File Input:

Once, we brought this step, we need to configure it, which means we need to choose the source file folder path and name, metadata information such as field names and data types. Lets see this one by one.

Double click the Text file input step on the design canvas and in the window that appears, the first tab will be File tab. Here we need to specify the complete file path with file name. Click the Browse option and navigate to the required folder and choose the intended file. Once you have chosen the file, click Add button. The file name will be listed in the Selected Files section. In the same manner, if we have multiple files with same metadata i.e., name, datatype and order of of field/columns the same in all the files.                             


Fig. 7: File Selection
Having selected the file name, click the Add button and that will add the file name in the list in the bottom as shown in the screenshot below:                                                               
Fig. 8: Adding file to the Selected Files list

Next, navigate to Content Tab. There we need to change the Separator value to Comma (,) instead of Semicolon (;) which is set by default. Leave all other settings as it is.        
                        
Fig. 9: Setting the Separator

Proceed to Fields tab. Click the 'Get Fields' button. It will ask for number of lines to read (100 by default) for getting the metadata information (details such as field name, data type (i.e, number, text, date), length, precision, etc.). Click Ok and once the Metadata is read, click the Close option.                                                                                                              
Fig. 10: Column metadata details
Since our source dataset in this example is very simple, we don't have to configure much in this Text File input step. Hence, we can finally click the Ok button on the bottom and we are done with this step.

b. Sort Rows Step:


As I mentioned earlier, using Group By step requires the data to be sorted to give expected results. Since, our source data is from text file where sorting the data is not possible. Hence, we need to use Group By in association with Sort Rows step to get the sorted data.

Our goal in this example, is to find the Department wise maximum salary. Hence, the columns involved are Department and Salary. We need to sort Department column first since it is going to be used in Grouping operation and this column can be either in ascending or descending fashion. 

Followed by Salary column which needs to be only in the Descending order as we are trying to find the maximum salary. The sort rows step will condition the dataset from Table File Input step in a way that Group By step can consume this conditioned (sorted) data.

Double click the Sort rows step and click the Get Fields button to get all the incoming columns. Keep only required columns and ensure to keep the grouping column first followed by aggregated column. See the screenshot below for more details.

Fig. 11: Sorting the Rows

c. Group By step:

In this step, we are selecting the Grouping column and Aggregated column. Please refer to the below screenshot for more details.                                                                                     
Fig. 12: Configuring the Group By step

d. Memory Group By Step:

This is an alternate option to Group By step but does not require the incoming data to be sorted. Configuring this step is same of that of the regular Group By step.                            

Fig. 13: Memory Group By step

e. Text File Output step:

The final step is the Text File Output. In this example, we have used two of this step to capture the results of both Group By and Memory Group By results. Configuration is same for both Text File Output steps.                                                                                      
Fig. 14: Text File Output Step - 1
After defining the output file name and path, proceed to content tab. Select the separator for the final output file.                                                                                                            
Fig. 15: Text File Output step - 2
Configure the other Text File Output Step in the same manner but ensure to give a different output file name for that step. 

Save the Transformation file (.ktr) in a folder of your choice and run the transformation by selecting Action >> Run. Alternatively, pressing F9 also does the same. 

Upon executing the transformation, we can see green tick marks on each of the step indicating the steps are successfully executed without error. We can preview the results of any step post execution to understand how the records are manipulated on each step. I have given below the previews of Text File output data from Group By and Text File output data from Memory Group By in the below screenshots.                                                                                                                                                                                                
Fig. 16: Preview of Text File Output data from Group By step


Fig. 17: Preview of Text File Output data from Memory Group By step
Now lets check the actual output files.                                                                                    
Fig. 18: Final output file from Group By step



Fig. 19: Final output file from Memory Group By step

In We can see that in output file for Group By step, the departments are arranged in descending (Z-A) order as a result of setting N in the Sort Rows step. Whereas in the output file from Memory Group By step, the departments are arranged in ascending order (A-Z) by default. But the numerical results are same in both the cases. This concludes our discussion on Group By vs Memory Group By steps in Pentaho Data Integration.

Summary:

In this post, we have seen how to implement aggregation operation using both Pentaho Data Integration Group By and Memory Group Steps in detail and the limitations. We can also use these steps to find Min, Max, Average values.

I have given this post more elaborately since this is the first post in the series. Please let me know in the comments section if this post helped and let me know if you want me to keep the tutorials short.

Install Pentaho Data Integration CE 8 on Ubuntu

Install Pentaho Data Integration CE 8 on Ubuntu

Installing Pentaho Data Integration on Ubuntu

Hi my fellow Devs! It's been a while since I posted last post on Pentaho DI due to some unavoidable circumstances and I did not wanted to post some random post as I always prefer quality content over quantity. Hereafter, I will be posting frequently and more awesome topics are on the way.

In this post, we are going to see how to install Pentaho Data Integration on Linux (I have used Ubuntu for this tutorial but it is pretty much the same for other Linux Distros like Fedora, CentOS etc).

I have also written a post on how to do the same in a Windows Operating System here. If you are new to Pentaho ETL, please have a look at this blog post Pentaho Data Integration Basics.

The basic requirements to install Pentaho Data Integration CE (aka PDI aka KETTLE) on Ubuntu Operating System  are as follows:

  • Pentaho Data Integration Community Edition 
  • Ubuntu 16 or above
  • JDK 1.8 and above (Java Development Kit)

Step-1: Downloading the Pentaho Data Integration (PDI/Kettle) Software

Obviously, the first step to install Pentaho Data Integration on Ubuntu, would be downloading the PDI Community Edition from the Official Sourceforge download page. The recent version is 8.2 at the time of writing this post and the download file is about 1.1 GB. The files will be downloaded as a zip file in the name 'pdi-ce-8.2.0.0.-342.zip'.


Step-2: Extracting the zip file

Extract the downloaded zip file which will be in the Downloads folder . Right click the file and choose the 'Extract Here' if you want it to get extracted in the downloads folder. 

If you want to choose a different folder, then right click and select 'Extract To...' option and give the destination folder path. The default name of the extracted folder would be 'data-integration'. Leave the folder name as it is and note down the path of this folder as it will be used in the following steps. In my tutorial, I am placing it in the following path:



Step-3: Checking Java Availability

Since Pentaho is written on Java, it is required for us to install Pentaho Data Integration CE on Ubuntu or in any OS for that matter. Go to desktop and right click anywhere on the desktop screen and select 'Open Terminal' and in the terminal command line screen, type:




and type it in lower case as keywords are case-sensitive. If the version information is displayed similar to the one shown in the screenshot below, then we are good to proceed with the next step.


Installing Pentaho Data Integration on Ubuntu
Fig 1. Checking Java Availability
If you did not get version info, then you need to download and install Java into Ubuntu first. There is a very good instruction on downloading Java and setting environment variables, is provided in this link: How to Install Java on Ubuntu. This instructions can also be utilized for other Linux distributions. Once this is done, continue with the next step.


Step-4: Launching Spoon

The last step would be to launch the Spoon application. For this, go to folder where we have extracted pdi (data-integration folder) earlier in step-2. Right click within this folder and select 'Open Terminal' and type the below command:





Installing Pentaho Data Integration on Ubuntu
Fig 2. Starting Spoon Application

Alternatively, you can start the Spoon as super user by adding 'sudo' prefix. This will ask for system password.



Install Pentaho Data Integration on Ubuntu
Fig 3. Starting Spoon Application with sudo command

Once entered, Spoon application will be started.

Install Pentaho Data Integration on Ubuntu
Fig 4. Pentaho Data Integration Start Screen

Bonus Tip:

Once, Pentaho Data Integration tool is installed in Ubuntu, we need to launch the Spoon application which is GUI where we can create ETL jobs and transformations. Instead of opening Spoon every time via terminal command line window (which can be frustrating at times), we can make it start by double clicking the spoon.sh shell script as executable program, so by double-clicking the Spoon, we will be able to open it.

For this, go to data-integration folder and select spoon.sh file and go to Preferences option in the top.


Install Pentaho Data Integration on Ubuntu
Fig 5. Accessing Preferences in the File explorer


Go to Behavior tab and set as per the below screenshot:

  • Select 'Double Click to open items'.
  • Enable 'Show action to create symbolic links'.
  • Select 'Ask what to do'.

After setting the options, close the window.




Install Pentaho Data Integration on Ubuntu
Fig 6. Setting Behavior in Preferences


Next, right click the spoon.sh file and select properties and go to Permissions tab and enable the check box 'Allow executing files as program' and close the window.



Install Pentaho Data Integration on Ubuntu
Fig 7. Spoon.sh properties


Right click the spoon.sh file and select 'Create Link' and a link (similar to shortcut in Windows OS) will be created as shown below:



Install Pentaho Data Integration on Ubuntu
Fig 8. Create Link for Spoon.sh
Install Pentaho Data Integration on Ubuntu
Fig 9. Link created for Spoon.sh

Cut the Link file and place it in the Desktop area. Select the file again and press F2 and rename the file by removing 'Link to '.


Install Pentaho Data Integration on Ubuntu
Fig 10. Renaming Link to Spoon.sh file


 Once again, right click the Spoon.sh file in the desktop and select Properties, click the small icon box and it will open a file manager window to select new icon for the file.


Install Pentaho Data Integration on Ubuntu
Fig 11. Open the properties for Spoon.sh


Install Pentaho Data Integration on Ubuntu
Fig 12. Changing icon in Properties window
Navigate to the data-integration folder and choose Spoon.ico file and close the window.


Install Pentaho Data Integration on Ubuntu
Fig 13. Choosing spoon.ico


 Launch Spoon by double-clicking the Spoon.sh in the Desktop and in the selection window, click on 'Run' and it will open the Spoon.


Install Pentaho Data Integration on Ubuntu
Fig 14. Final Spoon.sh Shortcut in the Desktop



Install Pentaho Data Integration on Ubuntu
Fig 15. Choosing Run option in File execution choice window


Quick Summary:


In this post, we have discussed:

1. How to install Pentaho Data Integration Community Edition on Ubuntu or in any Linux Distributions.
2. The installation requires Java in the Linux Machine. If it is not already installed, link is given above in the Step-3 of the instructions from downloading to configuring the Java.
3. As a additional tip, provided the details on how to make a link(shortcut) to Spoon application on Desktop and running from it instead of starting it from terminal every time.

Hope this post is useful to you. Please leave a comment if it helped you/if you have any queries.


Install Pentaho Data Integration (CE) on Windows - A Detailed Step-by-Step Tutorial

Install Pentaho Data Integration (CE) on Windows - A Detailed Step-by-Step Tutorial

Hi fellow Devs! Hope you had a wonderful day. Today's tutorial is all about how to download and install Pentaho Data Integration (Community Edition) in Windows laptop/desktop.


For those who are new to ETL process and PDI/Kettle, check my blog post:


When I was new to Pentaho, I was having difficulty in installing particularly the configuration part. I have searched internet and learned those things by trial and error. Hence I thought it will be worth sharing my experience.


The basic requirements are:

  • Pentaho Data Integration Community Edition
  • JRE 1.8 and above (Java Runtime Environment)
  • JDK 1.8 and above (Java Development Kit)
  • Windows 7 and above (Though PDI can be installed in Linux or Mac OS as well, the scope of this post is limited to Windows Operating System)

Step-1: Downloading the Pentaho Data Integration (PDI/Kettle) Software

The first step is to download the PDI Community Edition from the Official Sourceforge download page. The recent version is 8.2 at the time of writing this post and the download file is about 1.1 GB. The files will be downloaded as a zip file in the name 'pdi-ce-8.2.0.0.-342.zip'.


Step-2: Extracting the zip file

Extract the downloaded zip file which will be in the Downloads folder . Right click the file and choose the 'Extract Here' if you want it to get extracted in the downloads folder. 

If you want to choose a different folder, then right click and select 'Extract Files...' option and give the destination folder path. The default name of the extracted folder would be 'data-integration'. 


Step-3: Checking JRE version

Next step is to check the Java Runtime Environment version in your system. First check if your machine has Java installed. If not download it from official Java download page. 


If it is already installed, check for JRE versions. To do this, 


Go to C:\Program Files\Java\ (in case if it is 32 bit Windows OS) 
(or) 
check it in C:\Program Files (x86)\Java (in case of 64 bit OS).


Install Pentaho Data Integration on Windows
Fig 1. Screenshot showing the path containing JRE


There will be a folder within this folder path, starting with 'jre' followed by version number. This version should be 1.8 or higher. If this is the case, then we are good to proceed with the next step. 


If your PC/Laptop does not have JRE, we need to download the JRE 1.8 or higher version from the official Oracle JRE download page and install it. The recent version at the time of writing this post is '1.9'. We need to download the correct file according to OS type (32-bit/64-bit) and also select the file ending with '.exe' extension as it does not require any extracting tools. 


Once the file is downloaded, run the file and install JRE.


Step-4: Checking JDK version

The next step is to check the version of JDK in your Windows PC. This step is similar to the previous step.


Go to C:\Program Files\Java\


In the same folder, there will be a folder with a name similar to 'jdk1.8.0_191'. If you can find this folder, then we are good with this step. If the folder is not there, we have to download the JDK from the official Oracle JDK download page.


Like the previous step, we need to download the correct file according to our OS architecture type (32-bit/64-bit) and there will be multiple formats of the file will be available. Choose the file that ends with '.exe' extension and install the JDK by running the file after the download.

Install Pentaho Data Integration on Windows
Fig 2. Screenshot showing the path containing JDK folder


Step-4 Setting the Environment Variables


The final step is to configure the environment variables to point to the JRE folder path.
  • Open My Computer (or) press Ctrl+E. 
  • In the left side pane, right click 'Computer' for Win 7 (or) 'This PC' for Win 10 and select 'Properties' option. A new window will appear showing the Processor, RAM capacity, Computer Name, etc. 
  • Alternatively, navigating to Control Panel\System and Security\System can bring the same window.
  • On the left side pane, click the 'Advanced system settings' and it will bring the System Properties window. Go to 'Advanced' Tab and select 'Environment Variables' button.
  • In the Environment variables window, click the new button.
  • Give 'PENTAHO_JAVA' in upper case as variable name and folder path pointing to java.exe under the jre folder as variable value and click Ok. Please refer the screenshot below.

Install Pentaho Data Integration on Windows
Fig 3. Creating Environment variable - PENTAHO_JAVA


Again, create a new variable in the same way, but set the variable name as 'PENTAHO_JAVA_HOME' as variable name and  jre folder path as variable value like in the screenshot given below. 


Install Pentaho Data Integration on Windows
Fig 4. Creating Environment variable - PENTAHO_JAVA_HOME

Finally, create another variable in the name JAVA_HOME and the variable value would be the path where the jdk folder is available (see Step-4).

Install Pentaho Data Integration on Windows
Fig 5. Creating Environment variable -  JAVA_HOME



Note:


In the above step, example is given for 64-bit version of Windows. If your machine is 32-bit OS, replace the 'C:\Program Files (x86)\Java' string with 'C:\Program Files\Java' in the variable value for both the environment variables.


Restart your machine. Go to 'data-integration' folder which we extracted in the Step-2 and search for the file 'spoon.bat'. Double click it and PDI will open.


You can right click the spoon.bat file and select send to --> Desktop (create shortcut). This helps you to open PDI right from the desktop instead of going to data-integration folder every time.


It may be slow to start for the first time but it will not take more than 3-4 minutes. If you feel difficulty in any of the above steps, please let me know in the comments area. Happy Learning. 


Troubleshooting


For Win 7 32-bit, if PDI did not open after configuring as per step-4. Try this method.


Go to data-integration folder, select spoon.bat file. Right-click it and select rename. Change the file extension '.bat' to '.txt'. Now we can edit it in text editor like Notepad. Search for 'xmx256m' and replace it with 'xmx1024m' and save the file. 


Again, change the file extension back to '.bat' from '.txt'. Now double-click the spoon.bat file. It will open PDI aka Kettle.


In case of any queries, please let me know by writing it down in the comments section.