Installation of Talend Open Studio is not complicated, the main important point is ensure you have the compatible Java version configurated in your Environment System Path and make sure Talend launch with the compatible version.
As I have multiple version of Java in my computer, I created a file “talend_studio_startup.bat” with the below command to ensure my Talend Open Studio launched with the compatible version of Java.
C:TalendTOS_DI-Win32-20200219_1130-V7.3.1TOS_DI-win-x86_64.exe -vm “C:Program FilesJavajdk-11.0.12bin”
Note: Try to import a delimited file under the repository metadata, if you encounter error during the process, highly likely that your Java is either incompatible or your java is not launched with the correct version of java.
If you are able to import the file without error, you have a working Talend Open Studio with compatible java configuration. This will save you a lot of time later on.
Lets start with creating a new Empty project
- select Local connection
- Create New Project and name the project
Below is the user interface of Talend Open Studio
Create the Data Integration Job
a typical Data Integration Job involved 3 different stages:
- Read Data from Sources
- e.g. import resources into project metadata (reusable)
- Work with both structured files (File Schemas), databases (DB Connections), real time streaming ( this deserved to be shared in another post)
2. Transform Data
- e.g. Join data from multiple data sources
- e.g. transform data with filters, expressions and aggregate functions etc
3. Collect Results and Generate Reports
A few best practices recommended for the design of the Talend jobs
- Follow a standard naming convention, job annotation, state clearly the purpose of each job, component settings which will impact the Job Documentation (generated and auto-maintained)
- Prepare the Context (Development, Test, Production, Disaster Recovery etc) and the relevant variables, also setup the Database connection metadata, generic schema etc before development begin. Use the Context and relevant variable, metadata in the jobs for easily auto-propaganda of updated variables. Also ensure you have all the JAR files for the Hadoop/spark pre-requisites.
- Designing Top bottom, left and right bound flow, e.g. left = error path, right and/or downward bound flow is the desired path etc, this enable easy job understanding. Adding notes, comments, description wherever deems fit to facilitate understanding of jobs.
- Breaking big/complex job into smaller jobs with clear purpose, using tRunJob for better control, execution as well as easy to debug/fix and maintenance
- Limiting Parent/Child jobs calls within 5 levels max, typically 3 levels will be enough
Right click “Job Designs” and create a new Data integration job
Create a simple “Hello World” Job
Run the job (Ensure Exit Code = 0 for success run)
Let create another 2 simple job for demo:
- Load delimited Product files into MySQL
- Sales Report with tMap & tAggregateRow
Inputs – 1 main plus 1 lookup
Note: tMap has only 1 main input, the remaining sources will be lookup
combine 2 fields into 1 field in the salesReport using tMap Expression Builder
Inner Join the customerSales table product id field with the product table id field, and map the product name field to the salesReport destination
3 outputs from tMap:
- Sales Report Output
- rejects output
Capture “rejects” records and output to reject output table
- Aggregate sales group by product
Use tAggregate to aggregate the record group by product
Filter the records equal to New York (state code = ny)
Output New York Sales Report in XML format
Use Data viewer to verify the XML output
Wrong Practices vs Best Practices
Wrong Practices in layout design
Best Practices in layout design
Generated and Auto-maintained Documentation
Add job notes, annotation, components setting etc for the final documentation as below:
Right click on the job => view documentation
Generate the final documentation as HTML
version control
job => edit properties
M = major e.g. 0.2 => 1.0 or 1.2 => 2.0
m = minor e.g. 0.1 => 0.2
M = major e.g. 0.2 => 1.0 or 1.2 => 2.0
m = minor e.g. 0.1 => 0.2
Data Integration Basic Demo
Working with source files
- Identify the source files properties
- Define Schema
- Read Files
Define the schema
Use Data Viewer to validate imported data
XML files
edit Schema and import from XML Schema file
update Loop XPath Query according to the source XML XPath
Use Data Viewer to validate the imported data
JSON
Run the job and read data from JSON file
Validate the imported data from JSON
Databases
Ensure the scheme is correct between source and destination
Run the job and write data to the customer database table
Reading form MySQL database
The database setting is similar to how you write, except that you can specify the query with condition
Modify the SQL query with the SQL builder
Run the job and read data from MySQL table
Best Practices for Open and Close Database Connection
Open Connection
Close Connection with Post Job
Run the job and validate the job run successfully
Ensure the job run successfully with exit code 0
Repository Metadata vs Built-in
Build the reusable metadata (e.g. delimited/XML/JSON file schema, Database connection, database schema and generic schema)
Ensure all those relevant components are in the repository mode which enable auto-propaganda update upon changes to the repository metadata.
Built-in is for 1 time use metadata/schema
Built-in Context variables
- Execution Variables for Development, Test, Production , Disaster Recovery environment etc
- Files path of input and output
- Prompt option available (Never use for production)
- Document setting using comment column in the context table
- access context variable via “context.VariableName” e.g. context.RepoOutputDir
- Make use of “Export as Context” or tContextDump to build the context variables
- Choose the context before running a job
- Use tContextLoad to overwrite existing variable for a job
Demo for the creating and using of Context Variable
Below is a simple job to extract data from database
Create a Context Variable for Development and Production Output file Path
Use the context variable in a job by typing “context” and ctrl space to get the list of context variables (in this case only 1 variable)
and input the output file name
Choose the Context in Development or Production to run the job
validate the outputs in the folders
Connecting to Database via Context Variables
Export database connection variable as contexts
Create a new context base on the exported details
Create and modify the variables base on the Contexts
Upon completion of the context creation, it will propaganda the modifications to all the relevant jobs
run the job using the context database connection
Run the job and ensure it exit with code 0
Building Executables and Docker Images from Data Integration Jobs
- Build and run a standalone job
Right click on a job and build
Extract the files from the zip and you will find 2 files (1 .bat for window and 1 .sh for Unix)
In window environment, drag and drop the .bat file into a cmd prompt, you will see the job run standalone:
same is applicable for Unix.
If you know how to run a java job, you will find the command in the .bat and .sh file familiar.
java -Dtalend.component.manager.m2.repository=”C:StudentFilesDIBasicsBuildingJobsPrintCustomers_0.1PrintCustomers/../lib” -Xms256M -Xmx1024M -cp .;../lib/routines.jar;../lib/log4j-slf4j-impl-2.12.1.jar;../lib/log4j-api-2.12.1.jar;../lib/log4j-core-2.12.1.jar;../lib/commons-lang3-3.8.1.jar;../lib/antlr-runtime-3.5.2.jar;../lib/accessors-smart-1.1.jar;../lib/audit-common-1.8.0.jar;../lib/org.talend.dataquality.parser.jar;../lib/slf4j-api-1.7.25.jar;../lib/dom4j-2.1.1.jar;../lib/audit-log4j2-1.8.0.jar;../lib/logging-event-layout-1.8.0.jar;../lib/asm-5.0.3.jar;../lib/job-audit.jar;../lib/json-smart-2.2.1.jar;../lib/crypto-utils.jar;../lib/talend_file_enhanced_20070724.jar;printcustomers_0_1.jar; local_project.printcustomers_0_1.PrintCustomers –context=Development
- Build a new version of an existing job
click on the small “m” and create a 0.2 version
now you have the new version 0.2
Build configuration for a new version
- Build a job as docker image and run in Docker container
Launch cmd command prompt and ensure docker is installed in the machine
>> docker -H dockerhost images
No images yet
>> docker -H dockerhost ps
No running container yet
Lets start building a docker image Talend job
Now you can see the docker images in your cmd:
>>docker -H dockerhost images
Run the docker image Talend job in the container
>> docker -H dockerhost run printcustomers
Controlling Execution Demo
Managing files in Talend
Run a job similar to the above to generate list of csv files
Create a new job and import the Execution Control Context Group to the job.
Create a tFileList, tItelateToFlow, tLogRow
the job design should be similar to the below
- specify where are the list of source files located
2. add the FilePath for the tIterate mapping table
select the flow “Current File Name with path”
the value should be as below:
((String)globalMap.get(“tFileList_1_CURRENT_FILEPATH”))
use tLogRow to print the output
Next process all files and clean up upon successful run of job
Duplicate the previous job, add tFileInputDelimited to replace the tIterateToFlow component, add tUnite in between
No changes to tFileList, but copy the schema of the csv files into tFileInputDelimited
passing the schema to tUnit component as well.
run the job to check if the job ok
We are going to add 2 more flows to the above, 1 for archiving the processed files, the other deleting the files after archiving
Archiving the files
add tFileArchive and tWarn as below
Note: make sure your file name doesn’t contain unacceptable symbol, e.g. “:” which will cause error in archiving the files in zip.
You may specify the folder name and the error message upon the tFileArchive return with error and captured with tWarn component
Cleaning up the staging folder
Add tFileDelete, tFixedFlowInput and tLogRow respectively as below
tFileDelete link upon component OK
tFixedFlowInput link upon Subjob Ok
and add the schema with 1 line of log value
and finally add the tLogRow to print the log message from tFixedFlowInput component
run the job and validate if your job exit with code 0 and print the message indicating the staging folder has been deleted
Next we will try managing a job execution using a master job
Create a Master Job and drag the previous 2 jobs into the designer canvas, link them upon Subjob Ok, run the job and ensure it is successfully run with exit code 0
With the above master job, you may configure the relevant context variable (e.g. Development Staging file path, production staging file path etc) for better execution control.
For a Enterprise collaboration, normally you might need to export the master job to your colleagues for other tasks.
Right click the master job and export items
export it as a zip with dependencies
Ensure the file is exported and working properly by testing import.
Handling Errors and Jobs Debugging
Talend has recommended best practices for the return code for complex Enterprise Project which large group of developers are working together.
https://help.talend.com/r/41_ybDITKyuD01jGtDFvnw/i_IZJ81kqUvR7y_xVsUJiA
Return Code Example
Have the Return Code defined as a 4 digit number where the first digit designates the Priority Level which allows calling jobs to determine the nature of the Return Code (as organized above).
The second digit designates a System Level which identifies where the code was generated.
The last two digits a specific Condition Type which when coupled with the first two digits clearly isolate what has occurred that warrants the Return Code.
Where Priority Codes are defined as:
PRIORITY LEVEL | PRIORITY CODE |
---|---|
INFO | 3 |
WARNING | 4 |
ERROR | 5 |
FATAL | 6 |
Where System Codes are defined as:
SYSTEM LEVEL | SYSTEM CODE |
---|---|
Operating System | 1 |
Memory | 2 |
Storage | 3 |
Network | 4 |
Internet | 5 |
File System | 6 |
Database | 7 |
NoSQL | 8 |
Other | 9 |
Where Type Codes are defined as:
TYPE LEVEL | TYPE CODE |
---|---|
Permission | 01 |
Connection | 02 |
Locate | 03 |
Check | 04 |
Open | 05 |
Close | 06 |
Read | 07 |
Write | 08 |
Create | 09 |
Delete | 10 |
Rename | 11 |
Dump | 20 |
Load | 21 |
Get | 30 |
Put | 31 |
To further illustrate this example, here is how some Return Codes may be used:
RETURN CODE | SAMPLE MESSAGE |
---|---|
3605 | Open File Successful |
4304 | Disk Space is Low |
5701 | Invalid DB Login |
6205 | Insufficient Memory |
Therefore, try to maintain a certain best practices for better collaboration among developers.
Below is a simple demo of detecting and handling no input files error
In the tDie, you can customize a precise error message
You may also log level to have better understanding of the error in the advanced setting
run the job and you can check on the error message captured
Another common error capturing is via the “Run If” connection and specify the condition to appear warning
For example, the output records are lesser than input records
use Ctrl + space to fill the condition as below
run the job and you will get the warning message
next on debugging some common job design run errors
When a job failed to run, there will be error messages. More often than no, 1 common simple error which you quickly identify and fix: syntax for file path, file name, volume label etc
example 1:
the error looks complicated, but actually it is just a missing ” behind your input file path
click on the code tab, and click on the red indicator, you will find out that a missing ” causing the error
adding the ” in the end of the file path will fix the issue
Another syntax issue you may not find in the code tab and it run with an exit code 0!
And you will find nothing wrong in your code tab
This might be caused by the unexpected symbol (eg. “:”) generated by the ctrl + space function in your output path
just remove the “:” in between of the hour and minutes, the job will run without any issue.
Working with Web Services
add tFixedInputFlow, tXMLMap, tESBConsumer, tLogRow
configure the tESBConsumer with the endpoint WSDL
the schema will be generated once you finished configuring the tESBConsumer component
The full job design should be as below
tFixedFlowInput with a constant input of 94022 US Zip code
connect to tXMLMap upon main
tXMLMap should connect to tESBConsumer upon Main, import the schema from the repository in the output table payload
and finally connect the input table to the output table as below
run the job and you will get the response
I have another post with regards to the best practices for Talend Job Design and Documentation – https://farmoutain.wordpress.com/2021/10/01/talend-development-guideline-and-best-practices/
Evernote helps you remember everything and get organized effortlessly. Download Evernote. |