Monday, January 5, 2015

Recover FINAL on Tivoli Worload Scheduler

As said on the Scheduling FINAL post, and as TWS administrator knows, the extension of the plan is one of most important process to monitor in the product. If it fails the plan is not extended and the new job stream instances are not available to run.
For this reason it's important that any TWS administrator is able to recover the FINAL quickly, without, possibly without the need to open a PMR and wait to have L2 or L3 support on-line to help with the recovery, at least in the most common situations.

Of course, if you are using IBM Workload Automation SaaS you don't have to worry about this, IBM is managing the environment and is monitoring and is ready to recover it in case of failure.

In this post I'll explain the role of each job in the FINAL and FINALPOSTREPORTS job streams and how each of them can be recovered in case of failure.

Someone that is on an older release may be initially surprised about the second job stream FINALPOSTREPORTS. This has been introduced in version 9.1 when we have divided FINAL in 2 separate job streams in order to leave in FINAL only the jobs that are critical for the plan extension and leaving the less important jobs in FINALPOSTREPORTS. This helps monitoring the plan extension and avoid not needed "Oh! My FINAL has ABENDed!!!" panic when there is just a report or the process to collect job statistics has failed. They need to be analyzed and eventually recovered as well, but this is less urgent.

Final

Let's start with jobs in FINAL jobstream, this is the job stream that actually extend the plan.
FINAL jobstream

StartAppServer

This is just a simple job to check the availability of the application server, restarting it in case it's not running. Next jobs depend on the application server availability and would abend if it's down.
In case of failure: 
  • Rerun the job. 
If the application server is down and does not restart, before calling IBM support, check the WAS logs for any error you can recognize and recover.

MakePlan

This job:
  • Replans or, if needed, extends the Pre-Production plan (LTP)
  • Produces the Symnew file, covering the extension time window and containing all the new job stream instances.
  • Generates Pre-Production reports in the joblog. If you don't need it this part can be removed, or moved in a different job.
In case of failure: 
  • Global lock may be left set, use "planman unlock" to reset it.
  • Rerun the job to recover
    • Pre-Production plan is automatically re-verified and updated.
    • Symnew is recreated.
If MakePlan is run before the end of the active plan, there is time to recover it with impacting the production.

It may be required to stop this process in case of hangs. Killing the job may not stop it completely since some processing is still running inside WAS or on DB. 

Since 9.2 FP1 and in the next releases:
  • When you run "planman unlock" the planner should automatically stop within few seconds. 
  • If not, it's probably hanging on a DB statement and you need to force the statement closure from the DB. 
Before 9.2 FP1:
  • Force the DB statement closure if a DB statement is running too long, causing Makeplan to abend.
  • If processing is still running in WAS and Makeplan does not terminate you need to restart WAS to assure every process is stopped.
In all cases MakePlan will abend and you can then recover rerunning the job.

If MakePlan stdlist shows the following messages:
  • AWSBEH023E Unable to establish communication with the server on host "127.0.0.1" using port "31116".
    This error means that the application server (eWAS) is down and MakePlan is not able to continue. In this case, start WAS and check the WAS logs in order to identify the reason of the WAS stop.
  • AWSBEH021E The user "twsuser" is not authorized to access the server on host "127.0.0.1" using port "31116".
    This is an authorization error, check the twsuser credentials in the useropts file.
  • AWSJPL018E The database is already locked.
    This means that a previous operation of MakePlan is stopped and the global lock is not reset. To recover the situation run “planman unlock” and rerun MakePlan
  • AWSJPL006E An internal error has occurred. A database object "xxxx” cannot be loaded from the database.
    In general “xxxx” is an object like workstation, job, job streams. This error means that a connection with the database is broken. In this case check in the SystemOut.log and the ffdc directory the error because additional information related to the database issue is logged.
  • AWSJPL017E The production plan cannot be created because a previous action on the production plan did not complete successfully. See the message help for more details.
    This error means that a previous operation on the preproduction plan is preformed but finished with an error. In general it is present when “ResetPlan -scratch” is performed but not successfully finished.
  • AWSJPL704E An internal error has occurred. The planner is unable to extend the preproduction plan
    This error means that MakePlan is not able to extend the preproduction plan. Different root causes are associated at this issue, in general always related to the database, like no space for the tablespace , full transaction logs. The suggestion is to check more information in the SystemOut.log or in the ffdc directory.

SwitchPlan

This is the most critical job during the plan extension. During the run of this job the scheduling is down and other jobs cannot run.
It's made by 4 main steps:
  • Stops all the CPUs
  • Runs stageman:
    • to merge old Symphony file with SymNew
    • to archive the old Symohony file in schedlog directory
  • Runs "planman confirm" to update plan status information in DB (e.g. plan end date and current run number)
  • Restart the master to distribute the Symphony file and restart scheduling.
In case of failure:
  1. If Planman confirm has not been run yet (check logs and “planman showinfo”):
    • Rerun SwitchPlan
  2.  Planman confirm has failed:
    • Manually run “planman confirm” and “conman start”
  3. Planman confirm has been already run (i.e. plan end date has been updated):
    • Run “conman start”
Sometime it may hang trying to stop a remote workstation, just kill conman command. This may impact plan distribution that will need to stop the agents left running before distributing the new Symphony. To prevent this issue consider using "conman stop; progressive" and tuning the connect timeout in localopts.

If SwitchPlan stdlist shows the following messages:
  • STAGEMAN:AWSBHV082E The previous Symphony file and the Symnew file have the same run number. They cannot be merged to form the new Symphony file."
    There are several possible causes for the Symphony and Symnew run numbers to be the same:
    1. MAKEPLAN did not extend the run number in the Symnew file.
    2. SWITCHPLAN was executed before MAKEPLAN 
    3. The stageman process has been run twice on the same Symnew file without resetting the plan or deleting the Symphony file.

  • AWSJCL054E The command "CONFIRM" has failed.
    WSJPL016E An internal error has occurred. A global option "confirm run number" cannot be set
    In general, these error messages are present when the last step of the SwitchPlan that is “planman confirm” fails. Analyze the SystemOut.log to check  more information and to rerun “planman confirm”.

FinalPostReports

The job in FINALPOSTREPORTS are less critical, they are just tracking that the plan is available on DB, job statistics are updated and generates a report on the previous plan.
FINALPOSTREPORTS jobstream

CheckSync

This job has been introduced in 9.1 and is just tracking that the plan is automatically loaded on the DB.
In case of failure:
  • run "planman checksync" manually
  • if it fails and the plan is not loaded you can try running a "planman resync"
  • if also the resync is not working, check WAS logs for errors happening during the resync process.
  • When the plan is successfully loaded on the DB you can manually complete the job with a confirm succ.

UpdateStats

This job runs logman to update job statistics and history and to extend the Pre-production plan if its length is shorter then the value specified for minLen in optman.

In case of failure:
  • Rerun the job or manually run “logman <file>” on the latest scedlog file.
If you don't run this job,  the statistics and history will be partial. The Pre-Production plan is updated anyway at the beginning of Makeplan.

If you need to stop it, kill the job or logman process, the statistics and history will be partial until the job or logman is rerun.

CreatePostReports

This job just creates a report.

In case of failure:
  • Rerun the job if you need the report.

If you like this article and you find it useful, please share it on social media so other people may take advantage of it.

7 comments:

  1. such an awesome and detailed blog on the most important job of the TWS.
    please create more such blogs, these are very helpful to many of us.

    ReplyDelete
  2. Thank you for all of these detailed posts! Info on configuring High-Availability Dynamic Scheduling Brokers would be appreciated. I know many of us are looking to extend the Workload Scheduler functionality with dynamic scheduling, but have questions about how best to configure this in our existing MDM/BKM topology.

    ReplyDelete
    Replies
    1. The Broker always run on the actual MDM.
      To configure it in HA, you just need to have a MDM and BKM linked to the same DB (as usual), and switch the broker with the master. You can find the full switch procedure in this article: http://runtws.blogspot.co.uk/2015/03/scheduling-final-on-backup-master.html

      Delete
    2. Hi,Thank you for this post. I would be more interested to know about the Pre-production plan. I know Symnew is the intermediate production plan that is created by Makeplan which is later used to create the new Symphony file. How about the Pre-production plan, is this something created under the TWS directories or on the DB ?

      Delete
    3. Hi, the Pre-Production plan is stored in the DB in the JSI_JOB_STREAM_INSTANCES and JDP_JOB_STREAM_INSTANCE_DEPS tables.
      This is the same concept that in TWSz is called LTP, it's used to calculate in advance the job stream instances for the next days (from 7 to 14 by default) and to resolve external dependencies calculating which instance of the predecessor job stream should be the actual predecessor.
      The management of the Pre-Production plan is completely automated with extension / replans triggered automatically at the end of UpdateStats or at the beginning of MakePlan

      Delete
  3. Hi
    I get this messgae:
    "AWSJCL070I Symphony file load is not yet started."
    when run 'planman checksync'
    It was OK, but it keep looping with this message.

    this only started recently
    Can anybody help me

    ReplyDelete
  4. Hello,

    Thank you Franco for your blog.
    I would like to share my experience with a failed MakePlan.
    Last week, MakePlan failed with the following message:
    AWSBEH021E The user "twsuser" is not authorized to access the server on host "127.0.0.1" using port "31116".

    After calling IBM support, and some investigation,we stopped all prcesses and restarded them.
    After that,he 'optman ls' command returned an Oracle message:
    ORA-28001: the password has expired

    We changed the password of the TWS Oracle user and ran the changeSecurityProperties.sh script


    ReplyDelete