Talend Development Guideline and Best Practices

Talend Development Guidelines

https://www.talend.com/blog/2015/12/07/talend-job-design-patterns-and-best-practices/

  1. Methodologies
  • Data Modeling / Schema – Conceptual, Logical & Physical for Databases, NoSQL, Files, EDW
  • SDLC process control – Waterfall or Agile, requirements & Specifications
  • Error Handling & Auditing
  • Data Governance & Stewardship

Technologies & Tools

  • OS & Infrastructure Topology
  • DB Management Systems
  • NoSQL Systems
  • Encryption & Compression
  • 3rd Party Software Integration
  • Web Service Interfaces
  • External Systems Interfaces
    • [Notes] for examples – Kafka, Spark, HDFS needs to have relevant JARs files ready for the Talend Big Data jobs

Best Practices

  • Environments (DEV/QA/UAT/PROD)
  • Naming Conventions
    • e.g. all jobs organized into repository folders with meaningful names that make sense for your projects
  • Projects & Jobs & Joblets
  • Repository Objects
  • Logging, Monitoring & Notifications
  • Job Return Codes
    • e.g. adopt the same style of logging messages, perhaps using a common method wrapper around the System.out.PrintLn()function;
  • Code (Java) Routines
  • Context Groups & Global Variables
  • Database & NoSQL Connections
  • Source/Target Data & Files Schemas
  • Job Entry & Exit Points
    • establish a common entry/exit point criterion with options for alternative requirements, for job code (both of these help realize several precepts all at once).
  • Job Workflow & Layout
  • Component Utilization
  • Parallelization
  • Data Quality
  • Parent/Child Jobs & Joblets
  • Data Exchange Protocols
  • Continuous Integration & Deployment
    • Integrated Source Code Control (SVN/GIT)
    • Release Management & Versioning
    • Automated Testing
    • Artifact Repository & Promotion
  • Administration & Operations
    • Configuration
    • User Security & Authorizations
    • Roles & Permissions
    • Project Management
    • Job Tasks, Schedules, & Triggers
  • Archives & Disaster Recovery

Additional Documents

  • — Module Library: describing all reusable projects, methods, objects, joblets, & context groups
  • — Data Dictionary: describing all data schemas & related stored procedures
  • — Data Access Layer: describing all things pertinent to connecting to and manipulating data

Additional examples of best practices (to make it easily readable and understandable):

  1. Canvas Workflow & Layout
    top to bottom’, then work ‘left and right’ where a left bound flow is generally an error path, and a right and/or downward bound flow is the desired, or normal path.
  2. Atomic Job Modules — Parent/Child Jobs

breaking the big jobs (with a lot of components) down into smaller jobs, or units of work wherever possible. Then execute them as child jobs from a parent job (using the tRunJob component) whose purpose includes the control and execution of them.
Smaller jobs that have clear purpose jump off the canvas as to their intent, almost always easy to debug/fix, and maintenance, comparatively a breeze.
While it is perfectly acceptable to create nested Parent/Child job hierarchies, there are practical limitations to consider. Depending upon job memory utilization, passed parameters, test/debug concerns, and parallelization techniques (described below), a good job design pattern should not exceed 3 nested levels of tRunJob Parent/Child calls. While it is safe to perhaps go deeper, I think that with good reasons, 5 levels should be more than enough for any use case

  1. tRunJob vs Joblets

The simple difference between deciding between a child job versus using a joblet is that a child job is ‘Called’ from your job and a joblet is ‘Included’ in your job. Both offer the opportunity to create reusable, and/or generic code modules. A highly effective strategy in any Job Design Pattern would be to properly incorporate their use.

  1. Entry and Exit Points

Job Design pattern is to use the tPreJob to initialize context variables, establish connections, and log important information. For the tPostJob: closing connections and other important cleanup and more logging.
Using the tWarn and tDie components should also be part of your consideration for job entry and exit points. These components provide programmable control over where and how a job should complete. It also supports improved error handling, logging, and recovery opportunities.

  1. Error Handling and Logging

create a ‘logPROCESSING’ joblet for a consistent, maintainable logging processor that can be included into any job, PLUS incorporating well defined ‘Return Codes’ that offers conformity, reusability, and high efficiency.
Recent versions of Talend have added support for the use of Log4j and a Log Server. Simply enable the Project Settings>Log4j menu option and configure the Log Stash server in the TAC (Talend Administration Center). Incorporating this basic functionality into your jobs is definitely a Good Practice.

Talend Tests Design Best Practices
https://help.talend.com/r/en-US/Cloud/software-dev-lifecycle-best-practices-guide/ci-test-cases

Talend recommends you to use the Test Case feature: it automatically creates a Test Case with a skeleton in a Test Instance.

A Test Case is an executable test that consists of an immutable part extracted from the initial Job or Route, along with other editable components that form the skeleton of the Test Case.

A Test Instance is a set of data that allows you to run the Test Case with different parameters that you define (input, reference files, etc.).

Note: When building and deploying your project, Test Cases will be generated as JUnit files and thus will be built before the packaging Maven phase.

Best practices:

  • It is recommended to create and use a context adapted to your environment (a Test context to execute test Jobs and Routes with the metadata of this environment, and a Production context to execute Jobs in the Production environment).
  • When the feature is designed and tested, it is recommended to use the Talend Artifact Repository (Nexus, Artifactory) to publish items and retrieve them in the QA and Production environments via the Talend Cloud Public API. See Deploying to QA and Production environments for more information.

Talend CI/CD Best Practices
https://help.talend.com/r/en-US/Cloud/software-dev-lifecycle-best-practices-guide/ci-build

Best practices: To ensure continuous integration during development and to help developers design and build consistent, efficient and optimised artifacts, here are some best practices we recommend you to follow:

Concept Best practice example
Naming standards In the Studio, define a naming convention for Jobs, Routes or Data Services and folders and follow it.

In this document, the naming convention is the following, but feel free to adapt it to your requirements: job_|route_ |service_ prefixes for Job, Routes and Data Services names respectively, test_ prefix for Test Case names, pub_ prefix for publishing task names and task_ prefix for execution task names.

For example, name your folder xxx. Folders should be used to group Jobs of a similar type. Then create a Job named job_xxx_description and its Test case named test_xxx_description.

At a more granular level, components should also have a meaningful name.

At the project level, name your project using upper case otherwise it might cause build failure.
Warning: If you are working on a Git-managed project, do not use any of the following reserved key words to name your Job or Job folder:

  • tests
  • target
  • src

If any of the above-mentioned key words is used in the name of a Job, a Job folder or any level of its parent folders, changes to your Job or your Jobs in the folder will not get pushed to Git.

Version control Use GIT branches and tags as well as the Studio to handle artifact versions.

For more information on how to change the version of your artifacts centrally at once to publish them with the version of your choice, see Changing the deployment version of each artifact at once.

Project identifier When first connecting to the project in the Studio, edit the Studio parameters to set the project identifier (groupID) that will be used at deployment time.

For more information on how to set this project identifier, see Changing the deployment identifier of the project at once.

Metadata Use schema metadata in your Jobs, Routes or Data Services to share database connections between several artifacts and help designing source/target components.
Contexts Use contexts in order to reuse variables (context parameters locally for artifacts, group contexts globally for projects) such as database connectivity, host names, ports, etc. If values need to be changed or are used in multiple places, then they should not be hard coded and it is recommended to use contexts.

These contexts are also useful to switch between environments (Development context then QA context then Production context).

Standard Job layout Use a standard Job layout to ensure its readability, it is particularly useful for collaborative work.

Some examples include: putting data flows from left to right, top-to-bottom layout to show the process flow between subJobs, target components on the right, etc.

Complexity Jobs should follow a logic and be split in steps, called subJobs, when necessary. It is also recommended to use parent Jobs to run one or several child Jobs in order to create a process flow and even though there is no limit, you should avoid using more than 20 components in a Job.

Once the artifact is designed in a remote project from the Studio or the CommandLine, it can be published, deployed and executed in Talend Cloud. Exporting as an artifact will also help to perform Quality assurance tests on the same exact Jobs than those created in the Development environment.

For more information, see Deploying to QA and Production environments.

Building and Deploying
https://help.talend.com/r/en-US/Cloud/software-dev-lifecycle-best-practices-guide/ci-build

Talend offers you several ways to publish your project artifacts to Talend Cloud or artifact repositories (Nexus, Artifactory) and schedule their executions, and allows you to choose the one that suits best your needs.

In a continuous integration environment, it is common practice to launch tests at every commit. By default, a new commit is made every time you save artifacts (GIT commit mode).

Appendix – Return Code Samples
https://help.talend.com/r/41_ybDITKyuD01jGtDFvnw/i_IZJ81kqUvR7y_xVsUJiA

It is highly recommended that jobs utilize these tDie/tWarncomponents and that, in-conjunction with the Priority Levels, well defined Return Codes, are established and enforced.

These can be simple binary values that return 1=true/0=false, or 1=unsuccessful/0=successful. These can also be complex where returned values encode specialized meaning on the job/components condition.

The following example of a complex offers one possible convention.
Return Code Example

Have the Return Code defined as a 4 digit number where the first digit designates the Priority Level which allows calling jobs to determine the nature of the Return Code (as organized above).

The second digit designates a System Level which identifies where the code was generated.

The last two digits a specific Condition Type which when coupled with the first two digits clearly isolate what has occurred that warrants the Return Code.

Where Priority Codes are defined as:

PRIORITY LEVEL PRIORITY CODE
INFO 3
WARNING 4
ERROR 5
FATAL 6

Where System Codes are defined as:

SYSTEM LEVEL SYSTEM CODE
Operating System 1
Memory 2
Storage 3
Network 4
Internet 5
File System 6
Database 7
NoSQL 8
Other 9

Where Type Codes are defined as:

TYPE LEVEL TYPE CODE
Permission 01
Connection 02
Locate 03
Check 04
Open 05
Close 06
Read 07
Write 08
Create 09
Delete 10
Rename 11
Dump 20
Load 21
Get 30
Put 31

To further illustrate this example, here is how some Return Codes may be used:

RETURN CODE SAMPLE MESSAGE
3605 Open File Successful
4304 Disk Space is Low
5701 Invalid DB Login
6205 Insufficient Memory
Evernote helps you remember everything and get organized effortlessly. Download Evernote.

54d5d279-c577-401a-a758-37f5dc46f849

Related Posts