Wednesday, August 19, 2015

Plan Mirroring on Database

Since 9.1 IBM Workload Scheduler has introduced a new copy of the plan in the Relation Database, this is often called the "Plan Mirroring".

This copy of the plan is used only for plan monitoring from the UI (and from Java APIs) and is still not used for scheduling purposes that continue to work using the Symphony file to assure consistency between Master and agents.
This change has tremendously improved the scalability of the UI with performance test that has shown almost no performance degradation increasing the number of users monitoring the plan.




For end users or cloud customers this is completely transparent, but for IWS Administrators that are managing an on-premise environment this introduces new components in the product architecture that need to be understood, monitored and eventually managed and recovered.

As said, the plan mirroring on DB is just a copy of the Symphony file and does not have any impact on the job scheduling. Of course, since this is used for monitoring, this is still a critical components.


Sync phases

In order to understand how it works we first need to focus on how the information are replicated from the Symphony to the Plan Mirroring.
There are actually two different flows/phases that alternate during the normal operations and that assure the Plan Mirroring is aligned with the Symphony file, they both run on the actual master:

  • Resync: This phase takes a snapshot of the Symphony file that is then loaded in the DB.
  • Apply Messages: This is a normal running phase. As batchman updates the Symphony file applying the messages in the Intercom.msg coming from other IWS components or other agents, also the Plan Mirroring is updated applying the same messages queued in the mirrorbox.msg.


Resyncs are automatically triggered when a new Symphony file is started (when batchman starts at the end of SwitchPlan), when a switch master is performed or when the mirrobox.msg queue is full (this prevents that issues on the DB may have any impact on scheduling activities).
Resyncs can be also forced manually using planman resync.
At any time it's possible to check the synchronization status of the mirroring using planman checksync. This is also used inside FINALPOSTREPORTS in order to monitor that the automatic resync of the new Symphony file has completed successfully.

Components

Before looking these phases in details we need to see the old & new components that are involved in that phases.

Components with a star are new for Plan Mirroring

We already know mailman and batchman and their msg files, respectively Mailbox.msg and Intercom.msg.
The new component introduced for plan mirroring is the PlanUpdate, a set of threads running inside the Application Server that are responsible for any update to the Plan Mirroring.
In details there is a main PlanUpdate thread receiving messages from mirrorbox.msg, and a set of sub threads each with its own mirrorbox_#.msg message queue.
The main thread handles the resyncs, distribute the messages to sub threads and process directly some kind of message. The sub threads are mainly processing messages about jobs and job streams.

Resync



  1. Resync is usually started automatically after the new Symphony starts, after a switch master or when mirrorbox.msg is full. In all these cases the process is started by mailman that detects the situations and sends a resync message to both Batchman and PlanUpdate, both messages contain the same syncid.
    When the resync is started manually using planman the flow is similar, with planman sending the resync message to Mailman that then sends the message to Batchman and PlanUpdate.
  2. The resync message has several effects on Batchman and PlanUpdate
    1. The first effect is that all the previous messages present in the mirrorbox.msg are immediately discarded, this speeds up the processing with PlanUpdate that is immediately available to start loading the new plan.
    2. PlanUpdate sends to resync message to subthreads so that their mirrorbox can be purged as well. 
    3. When Batchman receives the resync it makes a copy of the Symphony file, calling the new copy Sinfonia.<syncid> using the syncid received in the message.
  3. After sending the resync to subthreads, PlanUpdate starts waiting for the Sinfonia.<syncid> to be created by Batchman. When available PlanUpdate can start loading the Sinfinia.<syncid> to the Mirroring DB tables.
  4. For the whole time of the load, the APIs and the UI are still seeing the previous plan. When the load is complete, the new plan is made available and a background thread is started in order to delete from the DB the rows related to the previous plan.
If the load fails, the UI users are automatically switched to query the Symphony file, limiting the impact to final users.

During the resync process, the current status is tracked with some internal status properties stored on the DB. This information are used by planman checksync to show a human readable message, if the resync is in progress the checksync will continue to run monitoring and reporting the resync progress.

Apply Messages

 This is the normal operating flow.
As for many years, mailman route the messages to the local Batchman via Intercom.msg as well as to remote agents. Since 9.1 every message that need to be sent to the local Batchman is now duplicated and sent also to PlanUpdate via mirrorbox.msg.
PlanUpdate processes the message and apply the change to the Plan Mirroring as Batchman is doing for the Symphony file. Job and Job Stream messages are not processed directly by the main PlanUpdate thread but are first distributed to a sub thread that finally handle them.

As anticipated, an issues with the DB can slow down or stop the PlanUpdate message processing with mirrorbox.msg starting growing. To avoid scheduling impacts in this situation, when mailman detects the mirrorbox.msg is full, instead of waiting, it starts a new resync that has an immediate effect of purging the mirrorbox.msg. Also from a performance perspective this is fine since it's faster to load a large Symphony file in the DB than applying 20MB of messages.

Recovery

There 2 main possible error scenarios that IWS need to be able to quickly handle.


Resync fails (e.g. if the CHECKSYNC in the FINALPOSTREPORTS abends), in this case:
  • run "planman checksync" manually to check if the resync has worked or not.
  • if it fails and the plan is not loaded you can try running a "planman resync"
  • if also the resync is not working, check WAS logs for errors happening during the resync process.
  • When the plan is successfully loaded on the DB you can manually complete the CHECKSYNC job with a confirm succ.
The mirroring is out of sync, i.e. you see information from the UI that are different from the information you see from conman:
  • This can be simply a delay, messages are following different flows and may be applied at different time to the Symphony and the Plan Mirroring. You can eventually check the size of mirrorbox*.msg and Intercom.msg (better if using evtsize -show) to check if there is any delay applying messages.
  • If this is not a delay issue, the problem need to be investigated by IBM support. After you have collected the required documentation for the analysis, you can realign what users see from the UI running a planman resync.


Conclusion

Even if the topic is not easy and very low level in how IWS works, I hope that I was successful in providing a background to understand how stuff works and how to analyze and recover error situations.
Let me know in the comments if you any question.

If you like this article and you find it useful, please share it on social media so other people may take advantage of it.

13 comments:

  1. Hi, Beautiful post. thanks for the details. I would like to know whether there is a way we could disable the mirroring feature in TWS.

    ReplyDelete
    Replies
    1. Hi, why do you want to disable mirroring?
      The mirroring is intended to be always on, however IBM support has some procedure to bypass mirroring as a temporary workaround to be used until they provide a fix.
      This procedure is not public because we would like to know and fix any issue related to mirroring, we don't want customers to just disable it without reporting the issue.
      So, if you have an issue with mirroring that requires to temporary disable it, please open a PMR.

      Delete
  2. Great post, thanks Franco!

    ReplyDelete
  3. Just started reading these great blogs as we're upgrading to v9.3. Extremely useful insight into the new functionality - will need to re-read a few times to fully absorb!

    Many thanks,

    Ray

    ReplyDelete
  4. Very good article Franco...Does it mean the mirrorbox.msg will never corrupt and it will take care of itself if it increases abruptly

    ReplyDelete
    Replies
    1. About corruption, the management of error condition has been improved in the last year for both msg files and Symphony file, may it's not possible to guarantee no file corruption will happen since it can be due to external causes.

      About full mirrorbox.msg, correct TWS automatically handle this for mirrorbox, cleaning up the queue and re-syncing the mirroring from the Symphony file.
      You don't have to worry about this situation, but of course this doesn't mean it's not a problem, a growing mirrorbox.msg is anyway the symptom of a bottleneck that is introducing a delay.

      Delete
  5. Hello, my name is Antonio.
    We opened a PMR because the DWC did not cool properly, in the open PMR we were told to run planman resync from time to time, while leaving a fix pack to fix the problem.
    We are having many problems with resynchronization. fills up mirrorbox.msg and starts to duplicate Symphony continuously, filling the MDM file system and stopping all production. When will the fixpack for version 9.2 come out?

    ReplyDelete
    Replies
    1. Hi Antonio, you should ask this question via official PMR.
      If you need to run planman resync too often, you can also ask in the PMR how to disable the mirroring until the fix will be available.

      Delete
  6. This comment has been removed by a blog administrator.

    ReplyDelete
  7. Hello,

    If you disable Plan Mirroring, TDWC does not update correctly, right?

    thanks in advance

    ReplyDelete
    Replies
    1. L3 should be able to temporary disable mirroring and let DWC access Symphony directly, I suggest to open a PMR to indicate the problem you have with mirroring in order to have a solution to use mirroring and eventually disabling mirroring as an temporary solution.

      Delete
  8. Hi Franco,
    Great Article..

    When we issue a planman resync command,we get output as "Symphony load not yet started".
    Does this mean the mirrorbox.msg file is corrupt or the DB server is not accepting the msg from the mirrorbox.
    Would a DB restart help to mitigate the issue?
    Regards,
    Sid

    ReplyDelete
    Replies
    1. "AWSJCL070I Symphony file load is not yet started" should mean that the application server has not yet received the request to resync via mirrorbox.msg.
      Corruption is rare, and you should see other error messages in the application server log or from planman itself.
      It could be both that the mirrorbox thread is stuck, e.g. due to a deadlock, or there is some other problems with the mirrobox.msg.
      E.g. on Linux, when you delete a file, this is removed from the directory, but the processes that have the file open continue to read the old file. So if you delete mirrorbox.msg without restarting the application server, the application server is still using the old mirrorbox, while planman is sending the resync message on the new mirorrbox.msg.

      Delete