Monitoring System Health

System Health allows administrators to check the operation of the various components of the Manager, a trap notification is sent if it detects a process error. By default, system health is not enabled upon installation. You must enable it through the Manager UI. Once you enable the SNMP trap notifications, the trap packet will contain the name of the process that failed, as well as the address of the Manager.

Configure

You can set the times related to process monitor testing. The times include how often the test should run (its interval) and how many seconds the test can run with no response before a timeout occurs.

To set process monitor test configuration:

  1. From the Manager, select Administration>Alarms>System Health.
  2. Click Configure.
  3. In the System Health Configuration window, select the Settings tab.
  4. Configure the Process Monitor Settings (They are described below). The default value for each setting is 60 seconds and the default timeout is 15 seconds (Web Server Test Interval is 60).
  5. Click OK.

Process Monitor Settings

Database Test Interval
Checks if the process monitor can be logged into the database using JDBC. The login connection parameters are found in /usr/signiant/dds/web/signiant.ini[JDBCCLASSNAME, DBURL, DBUSER, DBPASS].

To simulate an error condition, shut down the database without using the normal siginit scripts, or manually change a connection parameter in signiant.ini.

Web Server Test Interval
Checks if the process monitor gets the index page for the Manager UI from the webserver. The index page is defined by combining values from /usr/signiant/dds/web/signiant.ini [APPROOTURL,SCHDREPORTINGSTYLESHEETPATH] and making a connection to a local host.

To simulate an error condition, shut down the database without using the normal siginit scripts, or manually change a connection parameter in signiant.ini.

Scheduler Server Test Interval
Checks if the process monitor connects to the scheduler (i.e., ddsschsrvr, using the port defined as SCHDSRVRPORT in the file /usr/signiant/dds/web/signiant.ini) The client will return a prompt string.

To simulate an error condition, kill the scheduler from the shell.

Rules Server Test Interval
Checks if the process monitor can be logged into to the rules server (database server port as defined in etc/dds.conf).

To simulate an error condition, shut down the database without using the normal siginit scripts, or manually change a connection parameter in /etc/dds.conf.

Process Controller Test Interval
Checks if the process monitor connects to the process controller port as defined by /etc/dds.conf (and get a process ID returned).

To simulate an error condition, shut down the database without using the normal siginit scripts, or manually change a connection parameter in /etc/dds.conf.

Certificate Authority Test Interval
Checks if the process dds_ca is running.

From time to time you may want to check how much disk space is free on any of the drive mounts on the Manager. Once the amount of free disk space on any of the drive mounts drops below the user-specified threshold amount, a warning is triggered and emailed to the person identified for notification in the Process Monitor Notification Configuration screen.

To set the free disk space on disk test configuration, configure the Free Disk Space settings on the Settings tab in System Health Configuration window.

  1. In Test Interval, specify test frequency (in minutes).
  2. In Drive Mounts, specify the drives on the Manager that you want to test.
  3. In Threshold, specify the minimum amount of free disk space, expressed as a percentage, allowed before a warning is triggered.
  4. Click OK.

Notification

To set System Health or SNMP information:

  1. From the Manager, select Administration>Alarms>System Health.

  2. In the System Health Configuration window, select the Notification tab.

  3. To send a notification email, enable Enable Notification Mail. When this option is enabled, system Health will send any error messages to the specified email addresses. The message contains the name of the process that failed, as well as the address of the Manager. This notification email is sent only on the initial detection.

  4. Configure the settings in the following fields:

    • Enable Timeout Mail: When timeout mail is enabled, System Health will send notification of any timeouts it detects. The message contains the name of the process that timed out, as well as the address of the Manager.
      Note: When this option is enabled, you need to set the timeout values appropriately under Process Monitor Test Configuration. By default they are all set to 5 seconds, but depending on the processing power and size of your system, these may need to be adjusted.
    • Notification Mail To: The email address (or addresses, separated by commas) to which you want email notification messages to go.
    • Notification Mail CC: The email address (or addresses, separated by commas) to which you want to carbon copy email notification messages.
    • Notification Mail BCC: The email address (or addresses, separated by commas) to which you want to blind carbon copy email notification messages.
    • Notification Mail From Name: The email address you want to appear in the From field for any notification messages from System Health.
    • Notification Mail From Mail: The email address you want to appear in the Mail field for any notification messages from System Health.
    • Notification Mail Subject: The subject of notification emails.
  5. To enable SNMP, select Enable SNMP and configure the following settings:

  6. SNMP Trap Hosts: Specify the IP address(es) or domain name(s) of machine(s) that are SNMP trap host(s)

  7. SNMP Community String: Specify the SNMP password

  8. Click OK.

System Health traps (in standard MIB format)

pmuStatusFailure TRAP-TYPE
Indicates that System Health is reporting one or more failed tests. The names of the failed tests are sent in separate traps.
::=100

pmuStatusOk TRAP-TYPE
Indicates that System Health is reporting that all tests have been passed successfully. This trap will only be sent if a previous pmuStatusFailure trap has been sent.
::= 101

pmuTestFailure TRAP-TYPE
Indicates that System Health is reporting a failed test. The name of the test that failed is sent as a string along with the trap, one test per trap.
::= 110

pmuTestOk TRAP-TYPE
Indicates that System Health is reporting that a test which had previously failed has now been successfully passed. The name of the name of the test is sent as a string along with the trap, one test per trap.
::= 111

Run Tests

Checking the process status allows users to see the state of each of the Manager components. The state is displayed as Running, Starting, Stopping, Stopped, Problem or Timing Out.

To check the process status:

  1. From the Manager, select Administration>Alarms>System Health.
  2. Click Run All Tests to display the current status of the Manager components.