Talend Studio Demonstration and Best Practices Implementation


Installation of Talend Open Studio is not complicated, the main important point is ensure you have the compatible Java version configurated in your Environment System Path and make sure Talend launch with the compatible version.

As I have multiple version of Java in my computer, I created a file “talend_studio_startup.bat” with the below command to ensure my Talend Open Studio launched with the compatible version of Java.

C:TalendTOS_DI-Win32-20200219_1130-V7.3.1TOS_DI-win-x86_64.exe -vm “C:Program FilesJavajdk-11.0.12bin”

Note: Try to import a delimited file under the repository metadata, if you encounter error during the process, highly likely that your Java is either incompatible or your java is not launched with the correct version of java.

If you are able to import the file without error, you have a working Talend Open Studio with compatible java configuration. This will save you a lot of time later on.

Lets start with creating a new Empty project

  • select Local connection
  • Create New Project and name the project

Below is the user interface of Talend Open Studio

Create the Data Integration Job

a typical Data Integration Job involved 3 different stages:

  1. Read Data from Sources
  • e.g. import resources into project metadata (reusable)
  • Work with both structured files (File Schemas), databases (DB Connections), real time streaming ( this deserved to be shared in another post)

2. Transform Data

  • e.g. Join data from multiple data sources
  • e.g. transform data with filters, expressions and aggregate functions etc

3. Collect Results and Generate Reports

A few best practices recommended for the design of the Talend jobs

  1. Follow a standard naming convention, job annotation, state clearly the purpose of each job, component settings which will impact the Job Documentation (generated and auto-maintained)
  2. Prepare the Context (Development, Test, Production, Disaster Recovery etc) and the relevant variables, also setup the Database connection metadata, generic schema etc before development begin. Use the Context and relevant variable, metadata in the jobs for easily auto-propaganda of updated variables. Also ensure you have all the JAR files for the Hadoop/spark pre-requisites.
  3. Designing Top bottom, left and right bound flow, e.g. left = error path, right and/or downward bound flow is the desired path etc, this enable easy job understanding. Adding notes, comments, description wherever deems fit to facilitate understanding of jobs.
  4. Breaking big/complex job into smaller jobs with clear purpose, using tRunJob for better control, execution as well as easy to debug/fix and maintenance
  5. Limiting Parent/Child jobs calls within 5 levels max, typically 3 levels will be enough

Right click “Job Designs” and create a new Data integration job

Create a simple “Hello World” Job

Input

Output

Run the job (Ensure Exit Code = 0 for success run)

Let create another 2 simple job for demo:

  1. Load delimited Product files into MySQL

Input

Output

Run the job

  1. Sales Report with tMap & tAggregateRow

Inputs – 1 main plus 1 lookup
Note: tMap has only 1 main input, the remaining sources will be lookup

Main source

lookup source

transformation – tMap

combine 2 fields into 1 field in the salesReport using tMap Expression Builder

Inner Join the customerSales table product id field with the product table id field, and map the product name field to the salesReport destination

3 outputs from tMap:

  1. Sales Report Output

  1. rejects output

Capture “rejects” records and output to reject output table

  1. Aggregate sales group by product

Use tAggregate to aggregate the record group by product

Filter the records equal to New York (state code = ny)

Output New York Sales Report in XML format

Run the job

Use Data viewer to verify the XML output

Wrong Practices vs Best Practices

Wrong Practices in layout design

Best Practices in layout design

Generated and Auto-maintained Documentation
Add job notes, annotation, components setting etc for the final documentation as below:
Right click on the job => view documentation

Generate the final documentation as HTML

version control
job => edit properties

M = major e.g. 0.2 => 1.0 or 1.2 => 2.0
m = minor e.g. 0.1 => 0.2

M = major e.g. 0.2 => 1.0 or 1.2 => 2.0
m = minor e.g. 0.1 => 0.2

Data Integration Basic Demo

Working with source files

  1. Identify the source files properties
  2. Define Schema
  3. Read Files

CSV files

Define the schema

Read from the CSV file

Use Data Viewer to validate imported data

XML files

edit Schema and import from XML Schema file

update Loop XPath Query according to the source XML XPath

Read data from XML

Use Data Viewer to validate the imported data

JSON

Run the job and read data from JSON file

Validate the imported data from JSON

Databases

Ensure the scheme is correct between source and destination

Run the job and write data to the customer database table

Reading form MySQL database
The database setting is similar to how you write, except that you can specify the query with condition

Modify the SQL query with the SQL builder

Run the job and read data from MySQL table

Best Practices for Open and Close Database Connection
Open Connection

Close Connection with Post Job

Run the job and validate the job run successfully

Ensure the job run successfully with exit code 0

Repository Metadata vs Built-in
Build the reusable metadata (e.g. delimited/XML/JSON file schema, Database connection, database schema and generic schema)

Ensure all those relevant components are in the repository mode which enable auto-propaganda update upon changes to the repository metadata.

Built-in is for 1 time use metadata/schema

Built-in Context variables

  • Execution Variables for Development, Test, Production , Disaster Recovery environment etc
  • Files path of input and output
  • Prompt option available (Never use for production)
  • Document setting using comment column in the context table
  • access context variable via “context.VariableName” e.g. context.RepoOutputDir
  • Make use of “Export as Context” or tContextDump to build the context variables
  • Choose the context before running a job
  • Use tContextLoad to overwrite existing variable for a job

Demo for the creating and using of Context Variable
Below is a simple job to extract data from database

Go to Context tab

Create 2 Contexts for demo

Create a Context Variable for Development and Production Output file Path

Production context

Use the context variable in a job by typing “context” and ctrl space to get the list of context variables (in this case only 1 variable)

and input the output file name

Choose the Context in Development or Production to run the job

Run the job in both context

validate the outputs in the folders

file in Dev Context

file in Prod Context

Connecting to Database via Context Variables
Export database connection variable as contexts

Create a new context base on the exported details

Create and modify the variables base on the Contexts

Upon completion of the context creation, it will propaganda the modifications to all the relevant jobs

run the job using the context database connection

Run the job and ensure it exit with code 0

Building Executables and Docker Images from Data Integration Jobs

  • Build and run a standalone job

Right click on a job and build

Configure the build job

Extract the files from the zip and you will find 2 files (1 .bat for window and 1 .sh for Unix)

In window environment, drag and drop the .bat file into a cmd prompt, you will see the job run standalone:

same is applicable for Unix.

If you know how to run a java job, you will find the command in the .bat and .sh file familiar.

java -Dtalend.component.manager.m2.repository=”C:StudentFilesDIBasicsBuildingJobsPrintCustomers_0.1PrintCustomers/../lib” -Xms256M -Xmx1024M -cp .;../lib/routines.jar;../lib/log4j-slf4j-impl-2.12.1.jar;../lib/log4j-api-2.12.1.jar;../lib/log4j-core-2.12.1.jar;../lib/commons-lang3-3.8.1.jar;../lib/antlr-runtime-3.5.2.jar;../lib/accessors-smart-1.1.jar;../lib/audit-common-1.8.0.jar;../lib/org.talend.dataquality.parser.jar;../lib/slf4j-api-1.7.25.jar;../lib/dom4j-2.1.1.jar;../lib/audit-log4j2-1.8.0.jar;../lib/logging-event-layout-1.8.0.jar;../lib/asm-5.0.3.jar;../lib/job-audit.jar;../lib/json-smart-2.2.1.jar;../lib/crypto-utils.jar;../lib/talend_file_enhanced_20070724.jar;printcustomers_0_1.jar; local_project.printcustomers_0_1.PrintCustomers –context=Development

  • Build a new version of an existing job

click on the small “m” and create a 0.2 version

now you have the new version 0.2
Build configuration for a new version

  • Build a job as docker image and run in Docker container

Launch cmd command prompt and ensure docker is installed in the machine

>> docker –help

>> docker -H dockerhost info

>> docker -H dockerhost images
No images yet
>> docker -H dockerhost ps
No running container yet

Lets start building a docker image Talend job

Now you can see the docker images in your cmd:
>>docker -H dockerhost images

Run the docker image Talend job in the container
>> docker -H dockerhost run printcustomers

Controlling Execution Demo
Managing files in Talend

Run a job similar to the above to generate list of csv files

Create a new job and import the Execution Control Context Group to the job.

Create a tFileList, tItelateToFlow, tLogRow

the job design should be similar to the below

  1. specify where are the list of source files located

    2. add the FilePath for the tIterate mapping table

Refer to the option below

select the flow “Current File Name with path”

the value should be as below:
((String)globalMap.get(“tFileList_1_CURRENT_FILEPATH”))

use tLogRow to print the output

the output should be as below

Next process all files and clean up upon successful run of job
Duplicate the previous job, add tFileInputDelimited to replace the tIterateToFlow component, add tUnite in between

No changes to tFileList, but copy the schema of the csv files into tFileInputDelimited

passing the schema to tUnit component as well.

run the job to check if the job ok

We are going to add 2 more flows to the above, 1 for archiving the processed files, the other deleting the files after archiving

Archiving the files
add tFileArchive and tWarn as below

Note: make sure your file name doesn’t contain unacceptable symbol, e.g. “:” which will cause error in archiving the files in zip.

You may specify the folder name and the error message upon the tFileArchive return with error and captured with tWarn component

Cleaning up the staging folder
Add tFileDelete, tFixedFlowInput and tLogRow respectively as below

tFileDelete link upon component OK

tFixedFlowInput link upon Subjob Ok
and add the schema with 1 line of log value

and finally add the tLogRow to print the log message from tFixedFlowInput component

run the job and validate if your job exit with code 0 and print the message indicating the staging folder has been deleted

Next we will try managing a job execution using a master job

Create a Master Job and drag the previous 2 jobs into the designer canvas, link them upon Subjob Ok, run the job and ensure it is successfully run with exit code 0

With the above master job, you may configure the relevant context variable (e.g. Development Staging file path, production staging file path etc) for better execution control.

For a Enterprise collaboration, normally you might need to export the master job to your colleagues for other tasks.

Right click the master job and export items

export it as a zip with dependencies

Ensure the file is exported and working properly by testing import.

Handling Errors and Jobs Debugging

Talend has recommended best practices for the return code for complex Enterprise Project which large group of developers are working together.
https://help.talend.com/r/41_ybDITKyuD01jGtDFvnw/i_IZJ81kqUvR7y_xVsUJiA

Return Code Example

Have the Return Code defined as a 4 digit number where the first digit designates the Priority Level which allows calling jobs to determine the nature of the Return Code (as organized above).

The second digit designates a System Level which identifies where the code was generated.

The last two digits a specific Condition Type which when coupled with the first two digits clearly isolate what has occurred that warrants the Return Code.

Where Priority Codes are defined as:

PRIORITY LEVEL PRIORITY CODE
INFO 3
WARNING 4
ERROR 5
FATAL 6

Where System Codes are defined as:

SYSTEM LEVEL SYSTEM CODE
Operating System 1
Memory 2
Storage 3
Network 4
Internet 5
File System 6
Database 7
NoSQL 8
Other 9

Where Type Codes are defined as:

TYPE LEVEL TYPE CODE
Permission 01
Connection 02
Locate 03
Check 04
Open 05
Close 06
Read 07
Write 08
Create 09
Delete 10
Rename 11
Dump 20
Load 21
Get 30
Put 31

To further illustrate this example, here is how some Return Codes may be used:

RETURN CODE SAMPLE MESSAGE
3605 Open File Successful
4304 Disk Space is Low
5701 Invalid DB Login
6205 Insufficient Memory

Therefore, try to maintain a certain best practices for better collaboration among developers.

Below is a simple demo of detecting and handling no input files error

In the tDie, you can customize a precise error message

You may also log level to have better understanding of the error in the advanced setting

run the job and you can check on the error message captured

Another common error capturing is via the “Run If” connection and specify the condition to appear warning

For example, the output records are lesser than input records

use Ctrl + space to fill the condition as below

customize warning message

run the job and you will get the warning message

next on debugging some common job design run errors
When a job failed to run, there will be error messages. More often than no, 1 common simple error which you quickly identify and fix: syntax for file path, file name, volume label etc

example 1:
the error looks complicated, but actually it is just a missing ” behind your input file path

click on the code tab, and click on the red indicator, you will find out that a missing ” causing the error

adding the ” in the end of the file path will fix the issue

Another syntax issue you may not find in the code tab and it run with an exit code 0!

And you will find nothing wrong in your code tab

This might be caused by the unexpected symbol (eg. “:”) generated by the ctrl + space function in your output path

just remove the “:” in between of the hour and minutes, the job will run without any issue.

Working with Web Services

add tFixedInputFlow, tXMLMap, tESBConsumer, tLogRow

configure the tESBConsumer with the endpoint WSDL

the schema will be generated once you finished configuring the tESBConsumer component

The full job design should be as below

tFixedFlowInput with a constant input of 94022 US Zip code
connect to tXMLMap upon main

tXMLMap should connect to tESBConsumer upon Main, import the schema from the repository in the output table payload

and finally connect the input table to the output table as below

run the job and you will get the response

I have another post with regards to the best practices for Talend Job Design and Documentation – https://farmoutain.wordpress.com/2021/10/01/talend-development-guideline-and-best-practices/

Evernote helps you remember everything and get organized effortlessly. Download Evernote.

3dc6930a-04ea-4069-95b5-7c32384c3364

Related Posts