IBM & HCL Workload Automation Best Practices: TWSd

Showing posts with label TWSd. Show all posts

Friday, September 30, 2016

IBM and HCL partnership: a positive change for IBM Workload Scheduler people

After 18 years spent working for IBM, mainly on Workload Scheduler covering different roles, the time to leave and move to a different company has come.
On September 1st I've started working for a new company, HCL Technologies, a global IT services company with 100.000+ employees, but I'm not leaving IBM Workload Scheduler development.

This change is part of a 15+ years strategic partnership between HCL and IBM to invest together in IBM Workload Scheduler.

In this article I'll try to represent my personal view of this partnership and why I think this is a positive change for the product and all the people working with it.

All the information in this article are my own opinions that must not be consider coming from either IBM or HCL.

For an official statement from IBM and HCL about the partnership you can look at the following posts on the HCL and IBM web sites.

Keep your plan clean with IWS 9.3

Since version 8.3, Workload Scheduler is carryforwarding by default all workload that is not complete to the next plan, including the workload that is in error.
This is generally good and allow to assure that all workload that is in the plan is actually run and not deleted just because the day has gone or a JnextPlan -for 0000 has been run by an admin.

The drawback of this behavior is that if end users are not really interested about that workload, they may leave in the plan forever the jobs that has ended in error, causing the Symphony and the Pre-Production plan (LTP) to grow without control, and then impacting the overall performances.
So admins have to cleanup the plan for jobs that are not really required, starting with IBM Workload Scheduler 9.3 there is a new option that can make this cleanup easier.

Plan Mirroring on Database

Since 9.1 IBM Workload Scheduler has introduced a new copy of the plan in the Relation Database, this is often called the "Plan Mirroring".

This copy of the plan is used only for plan monitoring from the UI (and from Java APIs) and is still not used for scheduling purposes that continue to work using the Symphony file to assure consistency between Master and agents.
This change has tremendously improved the scalability of the UI with performance test that has shown almost no performance degradation increasing the number of users monitoring the plan.

For end users or cloud customers this is completely transparent, but for IWS Administrators that are managing an on-premise environment this introduces new components in the product architecture that need to be understood, monitored and eventually managed and recovered.

A new name for Tivoli Workload Scheduler

In June we had released the new 9.3 version. The the new release, in addition to the new great features like the what-if, is also changing the name of the product, aligning the name of the product to IBM organization and strategy.

Starting with 9.3, IBM Tivoli Workload Scheduler is now just IBM Workload Scheduler

Scheduling FINAL on backup master

FINAL is the job stream that in TWSd extends the plan for the next period.
I've already covered some best practices about FINAL in this article: Scheduling FINAL (Best Practices)
In this article I'll show a common best practice that is used to automatically schedule FINAL on the current active master, in order to assure High Availability. Even if common, and we use it on our SaaS environments, this best practice is not known from all the users and requires some customization.

Built-in mechanism for high availability / disaster recovery in TWSd is based on the backup master, this is an active-passive configuration where the active master role can be move to the backup master using the switchmgr command. This can be done for planned or unplanned outages and remove the single point of failure of the master.
However this is not sufficient for long term unavailability of the original master, by default the FINAL job stream is scheduled to run on the master, and in this condition the schedule FINAL job streams will not run and the plan will not be automatically extended.

The immediate solution is to cancel the FINAL on the old master and submit a new FINAL running on the new master, in addition the workstation definition in the DB must be changed to set the current master as the master also in the DB.

If the master is running on any Unix platform, a more automated solution is available using a unixlocl XA (extended agent):

Create a new XA workstation (e.g. MASTER_XA) with "unixlocl" access method and "$MASTER" as host workstation.
Change localopts on master and backup masters to set "mm resolve master = no".
Change FINAL and FINALPOSTREPORTS jobstreams to move the job streams and all the jobs from the master to the new XA (using composer be careful that just using modify will actually clone your FINAL, use rename instead of modify or delete the old one at the end).

Netman, ITA Agent and their sons

When TWS runs there are many processes involved, some are historical, others has been added in the latest releases for new functionalities.

In order to better manage the environment and autonomously troubleshoot issues, it can be useful to know when they run, which command starts / stops each process, and which main files and which TCP ports they use.

For this reason (and since some customers was asking for this information) I've decided to consolidate here this information, hoping this can be useful also to other TWS admins.

"Start Of Day" in Tivoli Workload Scheduler

While writing the article about Scheduling FINAL I've realized the need to write a specific article about the meaning of the "Start Of Day" and how it works.

Most of the customers are still running TWS using Start Of Day set to 0600 (the 6 in the morning), or they are scheduling FINAL 1 minute before the Start Of Day. This is no more required since TWS 8.3, but changing this setting in an existing production environment is not easy, and just since 8.6 the default value for a fresh installation has been changed to 0000.

Java DNS caching

I just spent some hours in the last days understanding an issue caused by the default Java behavior resolving hostnames, and I think sharing this configuration detail can help other people.

In my case I was experiencing failure due to some of our application servers that was unable to connect to another server. The problem was initially appearing random, some server was working, other not, even if they have the same configuration.
The failing servers was receiving a "connection refused" error while connecting to the backed server, but contacting the same URL from the command line was working successfully.
Restarting the application server was fixing the issue for that machine, but providing no clue about what had caused the issue.

Using tcpdump command I was able to trace the actual IP address used for the connections attempts, confirming that the application server was actually contacting an IP address different from the current one (the one returned by nslookup command). Investigating with the remote server team they confirmed that the other IP address is a backup system where the service was down at that moment.
My failing server was contacting an old server, currently inactive.

As we found, the HA (High Availability) architecture for the remote server is based on DNS resolution, with the hostname resolved to the IP address of the currently active server.
The default behavior of Java is to cache DNS resolution forever, with the result that our servers was continuing to use the IP address cached inside Java even if the active server has changed and the DNS has been updated.

This technote documents how to tune the JVM and change this behavior.
In our case we have changed the java.security file setting networkaddress.cache.ttl=30.

If HA strategy requires to update the DNS, this Java behavior can impact several scenario where TWS server or TWS agent have to contact a remote server using this strategy, e.g. a remote LDAP server or an application scheduled via plugin.

If you like this article and you find it useful, please share it on social media so other people may take advantage of it.

Monday, January 12, 2015

Using HTTP Server - Part 1: Introduction

During the development of our SaaS infrastructure, we have found very useful the usage of IBM HTTP Server in front of our WAS servers. Not only for load balancing on TDWC cluster, but also for security, performances and to modify some behaviors.

Setting up IBM HTTP Server is pretty simple and includes the following phases.

Define architecture and SSL certificates
Configure TDWC in cluster
Install both IBM HTTP Server and Web Server Plugin
Configure HTTP server
Configure web server plugin

I'll dedicate a specific article to each of the above phases.

On our SaaS, in addition to TDWC access, we use the HTTP server also for connections from dynamic agents and to handle few redirects:

- to display a disclaimer at the beginning of each session

- to replace the logout page with a custom one.

HTTP server can also be used to set browser caching and reduce the network traffic and TDWC server load.

The presence of HTTP server improves TDWC scalability also because reduce the impact of network latency on the server. In this configuration TDWC can return the result back to TDWC very quickly, with HTTP server that will keep a thread active to return the data back to browser. This reduces the number of active threads in TDWC server.

If you like this article and you find it useful, please share it on social media so other people may take advantage of it.

Monday, January 5, 2015

Recover FINAL on Tivoli Worload Scheduler

As said on the Scheduling FINAL post, and as TWS administrator knows, the extension of the plan is one of most important process to monitor in the product. If it fails the plan is not extended and the new job stream instances are not available to run.
For this reason it's important that any TWS administrator is able to recover the FINAL quickly, without, possibly without the need to open a PMR and wait to have L2 or L3 support on-line to help with the recovery, at least in the most common situations.

Of course, if you are using IBM Workload Automation SaaS you don't have to worry about this, IBM is managing the environment and is monitoring and is ready to recover it in case of failure.

In this post I'll explain the role of each job in the FINAL and FINALPOSTREPORTS job streams and how each of them can be recovered in case of failure.

Prevent and solve queuing issues in Tivoli Workload Scheduler

In this article I've talked about Tivoli Workload Scheduler message queues and how they work as input queue for TWS processes.
In large and busy environment these message files can start growing creating delay issues.

I've just published with Paolo Salerno a new article on developerworks about monitoring message files, detect issues and a new feature introduced with 9.2 FP1 that allows TWS administrators to prevent and solve delays issues monitoring how much of the mailman capacity is used and which workstation should be moved under a mailman server to have a more reliable environment.

The article is available here: http://bit.ly/wablog-mailman-queues

If you like this article and you find it useful, please share it on social media so other people may take advantage of it.

Friday, December 19, 2014

How to replace CA7 eXtended Agent in Tivoli Workload Scheduler

I've just published togheter with Silvano Lutri an article on developer works about two possible solutions to replace the CA7 XA: http://bit.ly/wablog-raplace-CA7-XA

This specific Tivoli Workload scheduler agent is used to coordinate workload scheduled by TWS and workload running on z/OS by CA7.

This agent is going end of support on September 30, 2015 together with Tivoli Workload Scheduler for Applications 8.4 that is the last version including that eXtended Agent.

If you are currently running that agent, check the article to verify which solution can work for you. You can comment on this blog or contact me on social media if you need any further clarification or help.

If you like this article and you find it useful, please share it on social media so other people may take advantage of it.

Wednesday, December 17, 2014

Running What If Analysys on Tivoli Workload Scheduler

I'm very excited to talk you about the new "What If" feature that we have just published yesterday on IBM Workload Automation SaaS and on beta with new refresh we are publishing right now.
You may be already aware of this new capability if you are participating to the Transparent Development program.

This new capability is target to answer the following questions:

How much time do I have to fix this failure without impacting the SLAs for my critical workload?
What will happen if this job will take longer today?
Why my workflow is completing so late? Which job and dependencies I should work on to anticipate it?

Message flow and processes on FTAs and classic E2E

I've received a request from a customer for information about the flow of messages (events) in Tivoli Workload Scheduler for z/OS classic End-to-End.
With messages I'm referring to the information exchanged between TWS agents and servers in order to start and track job execution, submit new workload, modify existing workload, etc..
I'll not use the word event, that is sometime used in this context, to avoid confusion with the events of Event Driven Workload Automation (EDWA).

This is a very specific and technical topic, however understanding this flow was the first think I made when I started working on the TWSd code to start the porting on z/OS and integration with TWSz (OPC at that time) to make first release of the classic E2E. It was the year 2000 and TWS development was still in Santa Clara, while OPC was already here in Rome. I started creating the diagrams that I'll use in this article and they was on the wall in front of me for several months.
Even if this information is also available in the manuals, I think it could be useful to have this information also on this blog.

The picture above represents the basic message flows for a Fault Tolerant Agent (FTA).

Workload Service Assurance and Ideal Batch

Workload Service Assurance is one of the most powerful feature present in Tivoli Workload Scheduler, present both in TWS for z/OS and in TWS distributed.
This is also know as "Dynamic Critical Path" or with the acronym WSA.
This was made to address the need of taking under control the end of complex critical workflows.

Scheduling batch processes is a fundamental backbone of every IT infrastructure, from the smaller organizations to the larger one, and very critical processes are delegated to the scheduler, like creating payrolls, drive money transfers, calculate and distribute price lists, create financial statements, automatically process orders, process claims in insurance companies, etc...
Most of this processes are still scheduled and usually need to complete within a specific time, otherwise it will have significant business impacts and often will result in fees to pay. Also dynamic workload may have SLAs that imposes to process the request within a specific amount of time.
Traditionally this critical processes are constantly monitored by operations or application teams to assure they complete on time, the main challenge is to identify all the jobs that are part of the flows and assure that no one has issues that can impact the completion of the overall process. If there is an high number of jobs in the flow, the user try to identify the critical path that need to be monitored with more attention in order to react quickly to any issue. This remains a complex and time consuming work.
Workload Service Assurance is made to address those needs, but at the same time makes a step forward, removing the need to constantly monitor those processes.
The ideal batch is the one that you can forgive, that you can assume is working and that will provide the expected result on time, you should care about it only when there is an unexpected issue and in this case you should be notified and able to easily find where the issues is.

Using Tivoli Workload Scheduler to automate complex reboots

Few weeks ago I've published with Enrica Alberti an article on IBM developer works about how we have automated the reboot of machines on our IBM Workload Automation SaaS environment.

This is an example of how we used TWS itselft to manage our infrastructure for Workload Automation SaaS. On our SaaS we are running tens of servers running the product for the customers, in addition to them we have a couple of machines used to control the infrastructure and where we are running an internal TWS used to automate any recurring task:
- create, configure and deprovision VMs used to run customer subscriptions
- create, delete, suspend, resume customer subscriptions
- add and remove users to customer subscriptions

For these tasks we have created some REST APIs that are invoked by the Service Engage common infrastructure components and that submit to TWS the appropriate job stream to actually modify the environment.

In addition we have some scheduled housekeeping job streams and now the reboot process described in the article.

The actual work we need to make on the environment is minimal, with all the operations running automatically and easy to monitor and recover thanks to TWS.

This experience reinforce the message that automation can save a lot of effort, especially when considered and planned at the beginning of the project or quickly recognized later if missed at the first analysis.

If you like this article and you find it useful, please share it on social media so other people may take advantage of it.

Tuesday, December 2, 2014

Scheduling FINAL (Best Practices)

One of the most important process when setting a TWSd environment is when and how to extend the plan and there are several options in optman that are related to this process.

I already presented this topic during the ASAP conference in 2011, but due to the important, and many users that are not yet familiar with that, I think this is a good argument to start the blog.

First decision to make is when to run the FINAL job stream to extend the plan. The decision have to take in consideration several elements like:

At what time most of the batch workload is complete and the plan extension have the minimum impact?
When are the administrators available in case there is any issue with the plan extension?
Are the administrators available in the weekends and holidays?
How many hours of buffer I want to keep in order to be able to fix extension problems without impacting the productions?

The result of this decision could be for example that I want to schedule FINAL only on working days, running at 7 AM and with 5 hours of buffer.
That means we need a FINAL scheduled at 7 AM on working days that extends the plan until the 12 PM of the next working day.

About Franco Mossotto

I'm the lead architect for the IBM/HCL Workload Automation products.
I've started working in IBM in 1998 as a Tivoli Workload Scheduler for z/OS developer, and then worked in design, development and support of IBM Tivoli scheduling and provisioning products. In the scheduling area I've worked as developer, chief designer, L3 technical leader for both Tivoli Workload Scheduler and Tivoli Workload Scheduler for z/OS, and as an architect for the development of SaaS offerings.
Following the IBM and HCL partnership in 2016, I transitioned to HCL with the rest of the development team to continue my work on IBM Workload Automation portfolio.

IBM & HCL Workload Automation Best Practices

Pagine

Friday, September 30, 2016

IBM and HCL partnership: a positive change for IBM Workload Scheduler people

Monday, August 31, 2015

Keep your plan clean with IWS 9.3

Wednesday, August 19, 2015

Plan Mirroring on Database

A new name for Tivoli Workload Scheduler

Monday, March 2, 2015

Scheduling FINAL on backup master

Friday, February 20, 2015

Netman, ITA Agent and their sons

Monday, January 26, 2015

"Start Of Day" in Tivoli Workload Scheduler

Monday, January 19, 2015