Since version 8.3, Workload Scheduler is carryforwarding by default all workload that is not complete to the next plan, including the workload that is in error.
This is generally good and allow to assure that all workload that is in the plan is actually run and not deleted just because the day has gone or a JnextPlan -for 0000 has been run by an admin.
The drawback of this behavior is that if end users are not really interested about that workload, they may leave in the plan forever the jobs that has ended in error, causing the Symphony and the Pre-Production plan (LTP) to grow without control, and then impacting the overall performances.
So admins have to cleanup the plan for jobs that are not really required, starting with IBM Workload Scheduler 9.3 there is a new option that can make this cleanup easier.
A personal blog by Franco Mossotto with best practices and news about IBM & HCL Workload Automation
Monday, August 31, 2015
Wednesday, August 19, 2015
Plan Mirroring on Database
Since 9.1 IBM Workload Scheduler has introduced a new copy of the plan in the Relation Database, this is often called the "Plan Mirroring".
This copy of the plan is used only for plan monitoring from the UI (and from Java APIs) and is still not used for scheduling purposes that continue to work using the Symphony file to assure consistency between Master and agents.
This change has tremendously improved the scalability of the UI with performance test that has shown almost no performance degradation increasing the number of users monitoring the plan.
For end users or cloud customers this is completely transparent, but for IWS Administrators that are managing an on-premise environment this introduces new components in the product architecture that need to be understood, monitored and eventually managed and recovered.
This copy of the plan is used only for plan monitoring from the UI (and from Java APIs) and is still not used for scheduling purposes that continue to work using the Symphony file to assure consistency between Master and agents.
This change has tremendously improved the scalability of the UI with performance test that has shown almost no performance degradation increasing the number of users monitoring the plan.
For end users or cloud customers this is completely transparent, but for IWS Administrators that are managing an on-premise environment this introduces new components in the product architecture that need to be understood, monitored and eventually managed and recovered.
A new name for Tivoli Workload Scheduler
In June we had released the new 9.3 version. The the new release, in addition to the new great features like the what-if, is also changing the name of the product, aligning the name of the product to IBM organization and strategy.
Starting with 9.3, IBM Tivoli Workload Scheduler is now just IBM Workload Scheduler
Monday, March 2, 2015
Scheduling FINAL on backup master
FINAL is the job stream that in TWSd extends the plan for the next period.
I've already covered some best practices about FINAL in this article: Scheduling FINAL (Best Practices)
In this article I'll show a common best practice that is used to automatically schedule FINAL on the current active master, in order to assure High Availability. Even if common, and we use it on our SaaS environments, this best practice is not known from all the users and requires some customization.
Built-in mechanism for high availability / disaster recovery in TWSd is based on the backup master, this is an active-passive configuration where the active master role can be move to the backup master using the switchmgr command. This can be done for planned or unplanned outages and remove the single point of failure of the master.
However this is not sufficient for long term unavailability of the original master, by default the FINAL job stream is scheduled to run on the master, and in this condition the schedule FINAL job streams will not run and the plan will not be automatically extended.
The immediate solution is to cancel the FINAL on the old master and submit a new FINAL running on the new master, in addition the workstation definition in the DB must be changed to set the current master as the master also in the DB.
If the master is running on any Unix platform, a more automated solution is available using a unixlocl XA (extended agent):
I've already covered some best practices about FINAL in this article: Scheduling FINAL (Best Practices)
In this article I'll show a common best practice that is used to automatically schedule FINAL on the current active master, in order to assure High Availability. Even if common, and we use it on our SaaS environments, this best practice is not known from all the users and requires some customization.
Built-in mechanism for high availability / disaster recovery in TWSd is based on the backup master, this is an active-passive configuration where the active master role can be move to the backup master using the switchmgr command. This can be done for planned or unplanned outages and remove the single point of failure of the master.
However this is not sufficient for long term unavailability of the original master, by default the FINAL job stream is scheduled to run on the master, and in this condition the schedule FINAL job streams will not run and the plan will not be automatically extended.
The immediate solution is to cancel the FINAL on the old master and submit a new FINAL running on the new master, in addition the workstation definition in the DB must be changed to set the current master as the master also in the DB.
If the master is running on any Unix platform, a more automated solution is available using a unixlocl XA (extended agent):
- Create a new XA workstation (e.g. MASTER_XA) with "unixlocl" access method and "$MASTER" as host workstation.
- Change localopts on master and backup masters to set "mm resolve master = no".
- Change FINAL and FINALPOSTREPORTS jobstreams to move the job streams and all the jobs from the master to the new XA (using composer be careful that just using modify will actually clone your FINAL, use rename instead of modify or delete the old one at the end).
Friday, February 20, 2015
Netman, ITA Agent and their sons
When TWS runs there are many processes involved, some are historical, others has been added in the latest releases for new functionalities.
In order to better manage the environment and autonomously troubleshoot issues, it can be useful to know when they run, which command starts / stops each process, and which main files and which TCP ports they use.
For this reason (and since some customers was asking for this information) I've decided to consolidate here this information, hoping this can be useful also to other TWS admins.
In order to better manage the environment and autonomously troubleshoot issues, it can be useful to know when they run, which command starts / stops each process, and which main files and which TCP ports they use.
For this reason (and since some customers was asking for this information) I've decided to consolidate here this information, hoping this can be useful also to other TWS admins.
Friday, February 13, 2015
Workload Scheduler for Bluemix
Do you want to create an application on the web, but you don't want to spend time looking for a server, internet hosting, security, high availability, and other time consuming activites?
Bluemix is the Platform as a Service (PaaS) solution from IBM that works for you, you just select the runtime and the services you need and you are ready to create your applications.
Do you also need to schedule some activities or run part of your workflow as batch? Now you can use Workload Scheduler for Bluemix service.
With Workload Scheduler for Bluemix you can address simple and complex scheduling needs.
You can periodically call your Bluemix application to gather or process data.
You can wake up your code at a specific time.
You can split your application in a front end transactional part and a backend batch part.
The transactional part can start the batch part on Workload Scheduler for Bluemix via APIs that will take care of the batch process.
With the scheduler you are able to monitor the batch process, eventually recover errors, distribute the workload.
Optionally you can install agents on your own cloud or on premise servers and start a batch processes that runs on the cloud and on your premise, connecting your system of engagement running on Bluemix with the system of records that you need to keep on your premise.
If you like this article and you find it useful, please share it on social media so other people may take advantage of it.
Monday, January 26, 2015
"Start Of Day" in Tivoli Workload Scheduler
While writing the article about Scheduling FINAL I've realized the need to write a specific article about the meaning of the "Start Of Day" and how it works.
Most of the customers are still running TWS using Start Of Day set to 0600 (the 6 in the morning), or they are scheduling FINAL 1 minute before the Start Of Day. This is no more required since TWS 8.3, but changing this setting in an existing production environment is not easy, and just since 8.6 the default value for a fresh installation has been changed to 0000.
Most of the customers are still running TWS using Start Of Day set to 0600 (the 6 in the morning), or they are scheduling FINAL 1 minute before the Start Of Day. This is no more required since TWS 8.3, but changing this setting in an existing production environment is not easy, and just since 8.6 the default value for a fresh installation has been changed to 0000.
Monday, January 19, 2015
Java DNS caching
I just spent some hours in the last days understanding an issue caused by the default Java behavior resolving hostnames, and I think sharing this configuration detail can help other people.
In my case I was experiencing failure due to some of our application servers that was unable to connect to another server. The problem was initially appearing random, some server was working, other not, even if they have the same configuration.
The failing servers was receiving a "connection refused" error while connecting to the backed server, but contacting the same URL from the command line was working successfully.
Restarting the application server was fixing the issue for that machine, but providing no clue about what had caused the issue.
Using tcpdump command I was able to trace the actual IP address used for the connections attempts, confirming that the application server was actually contacting an IP address different from the current one (the one returned by nslookup command). Investigating with the remote server team they confirmed that the other IP address is a backup system where the service was down at that moment.
My failing server was contacting an old server, currently inactive.
As we found, the HA (High Availability) architecture for the remote server is based on DNS resolution, with the hostname resolved to the IP address of the currently active server.
The default behavior of Java is to cache DNS resolution forever, with the result that our servers was continuing to use the IP address cached inside Java even if the active server has changed and the DNS has been updated.
This technote documents how to tune the JVM and change this behavior.
In our case we have changed the java.security file setting networkaddress.cache.ttl=30.
If HA strategy requires to update the DNS, this Java behavior can impact several scenario where TWS server or TWS agent have to contact a remote server using this strategy, e.g. a remote LDAP server or an application scheduled via plugin.
If you like this article and you find it useful, please share it on social media so other people may take advantage of it.
In my case I was experiencing failure due to some of our application servers that was unable to connect to another server. The problem was initially appearing random, some server was working, other not, even if they have the same configuration.
The failing servers was receiving a "connection refused" error while connecting to the backed server, but contacting the same URL from the command line was working successfully.
Restarting the application server was fixing the issue for that machine, but providing no clue about what had caused the issue.
Using tcpdump command I was able to trace the actual IP address used for the connections attempts, confirming that the application server was actually contacting an IP address different from the current one (the one returned by nslookup command). Investigating with the remote server team they confirmed that the other IP address is a backup system where the service was down at that moment.
My failing server was contacting an old server, currently inactive.
As we found, the HA (High Availability) architecture for the remote server is based on DNS resolution, with the hostname resolved to the IP address of the currently active server.
The default behavior of Java is to cache DNS resolution forever, with the result that our servers was continuing to use the IP address cached inside Java even if the active server has changed and the DNS has been updated.
This technote documents how to tune the JVM and change this behavior.
In our case we have changed the java.security file setting networkaddress.cache.ttl=30.
If HA strategy requires to update the DNS, this Java behavior can impact several scenario where TWS server or TWS agent have to contact a remote server using this strategy, e.g. a remote LDAP server or an application scheduled via plugin.
If you like this article and you find it useful, please share it on social media so other people may take advantage of it.
Monday, January 12, 2015
Using HTTP Server - Part 1: Introduction
During the development of our SaaS infrastructure, we have found very useful the usage of IBM HTTP Server in front of our WAS servers. Not only for load balancing on TDWC cluster, but also for security, performances and to modify some behaviors.
Setting up IBM HTTP Server is pretty simple and includes the following phases.
On our SaaS, in addition to TDWC access, we use the HTTP server also for connections from dynamic agents and to handle few redirects:
Setting up IBM HTTP Server is pretty simple and includes the following phases.
- Define architecture and SSL certificates
- Configure TDWC in cluster
- Install both IBM HTTP Server and Web Server Plugin
- Configure HTTP server
- Configure web server plugin
I'll dedicate a specific article to each of the above phases.
- to display a disclaimer at the beginning of each session
- to replace the logout page with a custom one.
HTTP server can also be used to set browser caching and reduce the network traffic and TDWC server load.
The presence of HTTP server improves TDWC scalability also because reduce the impact of network latency on the server. In this configuration TDWC can return the result back to TDWC very quickly, with HTTP server that will keep a thread active to return the data back to browser. This reduces the number of active threads in TDWC server.
If you like this article and you find it useful, please share it on social media so other people may take advantage of it.
If you like this article and you find it useful, please share it on social media so other people may take advantage of it.
Monday, January 5, 2015
Recover FINAL on Tivoli Worload Scheduler
As said on the Scheduling FINAL post, and as TWS administrator knows, the extension of the plan is one of most important process to monitor in the product. If it fails the plan is not extended and the new job stream instances are not available to run.
For this reason it's important that any TWS administrator is able to recover the FINAL quickly, without, possibly without the need to open a PMR and wait to have L2 or L3 support on-line to help with the recovery, at least in the most common situations.
Of course, if you are using IBM Workload Automation SaaS you don't have to worry about this, IBM is managing the environment and is monitoring and is ready to recover it in case of failure.
In this post I'll explain the role of each job in the FINAL and FINALPOSTREPORTS job streams and how each of them can be recovered in case of failure.
For this reason it's important that any TWS administrator is able to recover the FINAL quickly, without, possibly without the need to open a PMR and wait to have L2 or L3 support on-line to help with the recovery, at least in the most common situations.
Of course, if you are using IBM Workload Automation SaaS you don't have to worry about this, IBM is managing the environment and is monitoring and is ready to recover it in case of failure.
In this post I'll explain the role of each job in the FINAL and FINALPOSTREPORTS job streams and how each of them can be recovered in case of failure.
Subscribe to:
Posts (Atom)