NgoKnow: August 2018

Saturday, August 25, 2018

IBM AIX MPIO: Best practices and considerations

IBM Power Systems™ servers are designed to offer very high stand-alone availability in the industry. Enterprises must occasionally restructure their infrastructure to meet new IT requirements and handle scheduled outages (such as scheduled system maintenance).
MPIO best practices have never been officially documented. There have been some documents and IBM Redbooks® that have briefly mentioned certain MPIO aspects for specific scenarios and environments, but recommendations pertaining to MPIO configurations, in general, have been lacking.
System reliability and availability are increased by a careful consideration of the user-modifiable options in each system configuration. This article outlines the best practice configuration considerations that pertain to MPIO on AIX.
Some of the features described in this article are specific to particular technology levels of AIX, or are specific to the path control module (PCM) supplied with AIX. If using Subsystem Device Driver Path Control Module (SDDPCM) or a vendor-supplied Object Data Manager (ODM) package (often referred to as a host attachment kit, or something similar), then some of these options might be unavailable, and other options can be added.
The AIX MPIO infrastructure allows IBM or third-party storage vendors to supply ODM definitions, which have unique default values for the important disk attributes. Thus, for example, the default value for attributes on an hdisk representing a logical unit number (LUN) from an IBM System Storage® SAN Volume Controller (SVC) might be different from the default values for an hdisk representing a LUN on an IBM System Storage DS8000® system. As a result, the default values for the attributes are appropriate for most situations. Generally, the hdisk attributes should be left at their default values, especially the attributes that are not mentioned in this article.
The disk attributes described in the following sections can be displayed by using the lsattr command, and can be changed with the chdev command. The path attributes, such as path_priority, can be displayed or set by using the lspath and chpath commands. Refer to the AIX publications or AIX man pages for details on those commands.
This article does not address the attributes associated with the adapters being used to attach to the MPIO devices. Some of those attributes might also affect error detection and recovery times. In particular, the fc_err_recov attribute for Fiber Channel adapters is an important one to consider.

Consideration 1: MPIO algorithm and path_priority

The MPIO algorithm setting determines whether:

The PCM can attempt to distribute I/O across all available paths to a given LUN
The I/O will be active only on one path at a time
The I/O flow will be weighted based on a combination of the algorithm setting and the path_priority settings per disk

A device that has multiple controllers can designate one controller as the active or preferred controller. For such a device, the PCM uses just the paths to the active or preferred controller as long as there is at least one such path that is enabled and not failed. Thus, algorithms that use all available paths might only use a subset of those paths at one time for such devices.

algorithm = fail_over

This is the default algorithm for most disks using the ODM definitions included with AIX. Some third-party ODMs use a different default value.
With this algorithm, I/O can only be routed down one path at a time. With algorithm=fail_over, the PCM keeps track of all the enabled paths (per disk) in an ordered list. If the path being used to send I/O fails or is disabled, the next enabled path in the list is selected and I/O is routed to that path. The sequence for path selection within the list is customizable by modifying the path priority attribute on each path, which will then sort the list by the ascending path priority value.
The fail_over algorithm is always used for virtual SCSI (VSCSI) disks on a Virtual I/O Server (VIOS) client, although the backing devices on the VIOS instance might still use round_robin, if required. Fail_over is also the only algorithm that might be used if using SCSI-2 reserves (reserve_policy=single_path).

algorithm = round_robin

With this algorithm, I/O will be distributed and activated across all enabled paths to a disk. The percentage of I/O routed down each path can be weighted by setting the path_priority attribute on each path for each disk. If a path fails or is disabled, it is no longer used for sending I/O. The priority of the remaining paths is then recalculated to determine the percentage of I/O that should be sent down each path. If all paths have the same path_priority value, the PCM attempts to equally distribute I/O across all enabled paths. Optimal performance in a failed path scenario is to ensure that the ordered path list alternate paths between separate fabrics.

algorithm = shortest_queue

This algorithm is available in the latest technology levels of AIX for some devices. The algorithm behaves very similar to round_robin when the load is light. When the load increases, this algorithm favors the path that has the fewest active I/O operations. Thus, if one path is slow due to congestion in the storage area network (SAN), the other less-congested paths are used for more of the I/O operations. The path priority values are ignored by this algorithm.
Recommendation: If using SCSI-2 reserves or vSCSI disks, then fail_over must be used. For other situations, shortest_queue (if available) or round_robin enable maximum use of the SAN resources.

Consideration 2: Path health check settings

Path health check mode (hcheck_mode)

The path health check mode determines the paths that the MPIO's path health checker will probe for path availability during normal business operations. The health checker never probes paths that are in a Disabled or Missing state. Paths in those two states must be recovered manually with chpath (for Disabled paths) or with cfgmgr (for Missing paths). If a disk is not open and in use as is the case, for instance, when its volume group is varied off, no path health checks will take place down any path for that disk.
There are three possible modes for the MPIO path health checker.
hcheck_mode = nonactive: In this mode, the PCM sends health check commands down paths which have no active I/O. That includes paths with a state of failed. If the algorithm selected is fail_over, then the health check command is also sent on each of the paths that have a state of enabled but have no active I/O. If the algorithm selected is round_robin or shortest_queue, then the health check command is only sent on paths with a state of failed, because the round_robin and shortest_queue algorithms both keep all enabled paths active with I/O when the disk is in use. If the disk is idle, the health check command is sent on any paths that do not have a pending I/O at the expiration of the health check interval.
hcheck_mode = enabled: In this mode, the PCM sends health check commands down all enabled paths, even paths that have other active I/O at the time of the health check.
hcheck_mode = failed: In this mode, the PCM only sends path health checks down paths that are marked as failed.
Recommendation: The default value for all devices is nonactive, and there is little reason to change this value unless business or application requirements dictate otherwise.

Path health check interval (hheck_interval)

The path health check interval is the interval, in seconds, at which MPIO path health checks will probe and check path availability of open disks, based on the hcheck_mode setting.
A hcheck_interval = 0 setting disables MPIO's path health check mechanism, which means any failed paths require manual intervention to recover or re-enable.
Recommendation: The best practice guideline for hcheck_interval is that it should be greater than or equal to the rw_timeout (read/write timeout) value on the disks. Also note that it is not a good idea to lower the rw_timeout value in order to set a lower health check interval. The default rw_timeout values set in ODM are based on the recommendations of the device manufacturers for each device type. The following section provides technical details regarding this best practice recommendation.
It might be tempting to think that a smaller health check interval is preferable as it might lead to faster detection or recovery of failed paths. However, the cost of setting a lower health check interval far outweighs the benefits. There are several reasons for this.

Because the health check commands can be sent on every path of every open disk (depending on hcheck_mode) at the expiration of the health check interval, a small health check interval can quickly use up a lot of bandwidth on the SAN if there are a large number of disks.

The health check commands count against the disk's queue_depth (only to be changed upon recommendation from the storage vendor), and they receive a higher priority for processing than normal user I/O. Because error scenarios typically take longer than good path scenarios, a small health check interval can negatively impact the user I/O on good paths when there are one or more failing paths. Note that because queue_depth is a function of the disk driver, queue_depth is on a per-LUN basis rather than a per-path basis. For example, assume that a device has a queue_depth of 8, with eight paths. If four of those paths have failed, the health check commands on those paths might take anywhere from a few seconds up to rw_timeout to fail. During that time, at least four of the eight commands in the queue_depth will be consumed by the health check commands, leaving an effective queue_depth of only four commands for the good paths and regular I/O for that disk.

It is not always desirable to recover a path quickly. In a situation where a link is suffering from repeated, intermittent failures, the more quickly the link is recovered by a health check command, the more likely it is that a user I/O will be sent on that link only to fail due to the intermittent errors. A longer health check interval reduces the use of links with frequent but intermittent failures.

AIX implements an emergency last gasp health check to recover paths when needed. If a device has only one non-failed path and an error is detected on that last path, AIX sends a health check command on all of the other failed paths before retrying the I/O, regardless of the health check interval setting. This eliminates the need for a small health check interval to recover paths quickly. If there is at least one good path, AIX discovers it and uses it before failing user I/O, regardless of the health check interval setting.

Recent technology levels of AIX also make use of asynchronous events from the Fibre Channel (FC) device drivers to manipulate path states. This makes AIX less dependent on the health check commands to detect path failures or to recover paths when using Fibre Channel.
For most cases, the default value of hcheck_interval is appropriate. There have been some storage vendors who, in older versions of their ODM definitions, had set hcheck_interval to a value smaller than the rw_timeout value. The previous recommendation from AIX development stands in those cases: Increase hcheck_interval such that it is greater than or equal to rw_timeout value. It is much more likely to be a good idea to increase the health check interval than to decrease it. Better performance is achieved when hcheck_interval is slightly greater than the rw_timeout value on the disks.
Extreme cases of the problems described in bullets 2 and 3 above can cause severe degradation of I/O performance if the health check interval is set to a small value.

Consideration 3: Time out policy

Recent technology levels of AIX include a timeout_policy attribute for some devices. This attribute indicates the action that the PCM should take when a command timeout occurs. A command timeout occurs when an I/O operation fails to complete within the rw_timeout value on the disk. There are three possible values for timeout_policy.
timeout_policy = retry_path: This represents the legacy behavior, where a command may be retried on the same path that just experienced a command timeout. This is likely to lead to delays in the I/O recovery, as it is likely that the command will continue to fail on this path. Only after several consecutive failures, will AIX fail the path and try the I/O on an alternate path.
timeout_policy = fail_path: This setting causes AIX to fail the path after a single command timeout, assuming that the device has at least one other path that is not in the failed state. Failing the path forces the I/O to be retried on a different path. This can lead to much quicker recovery from a command time out and also much quicker detection of situations where all paths to a device have failed. A path that is failed due to timeout policy can later be recovered by the AIX health check commands. However, AIX avoids using the path for user I/O for a period of time after it recovers to help ensure that the path is not experiencing repeated failures. (Other PCMs might not implement this grace period.)
timeout_policy = disable_path: This setting causes the path to be disabled. A disabled path is only recovered by manual user intervention using the chpath command to re-enable the path.
Recommendation: If this attribute is available on the device, a value of fail_path is the recommended setting.

Consideration 4: How many paths to configure for AIX MPIO

In an MPIO configuration, more is not necessarily better. In fact, an excessive number of paths in an MPIO configuration can actually contribute to system and application performance degradation in the event of SAN, storage, or Fibre Channel fabric issues or failures.
The general recommendation for the number of paths to configure in an MPIO environment is 4 to 8 per LUN, with 16 paths being recommended as the maximum, to be used only in specialized situations. It is important to note that MPIO does support many more paths than 8 or 16, but from a design and functional perspective, four to eight paths have been proven to be the most effective.
Businesses that need to configure more than eight paths per LUN need to carefully consider the following details:

When an error is encountered on an MPIO disk, error recovery normally takes place down all configured paths. The most common types of disk or SAN errors that occur will also lead to multiple retry attempts on each path for each failed I/O. With "N" paths, there could easily be a situation where a disk encounters an error that would lead to five tries on each path, multiplied by the rw_timeout value on the disks. So, total recovery per I/O could potentially be:

1

(N * rw_timeout value * 5)

If multiple disks were to encounter similar issues at the same time, the consequences for applications might be severe. For example, a marginal, constantly bouncing link in the SAN fabric might lead to this type of error recovery, resulting in extreme performance degradation.
This situation is somewhat ameliorated by setting the timeout_policy attribute to fail_path, if that attribute is available with the device type that is being used. However, the timeout policy attribute cannot account for all possible error scenarios.
With the round_robin algorithm, having too many paths results in overhead as the PCM attempts to load balance I/O among the many paths.
With the fail_over algorithm, the PCM encounters additional overhead in determining the paths to use for failover in a failed path scenario.
Each configured path requires additional memory in AIX, as each path is represented by data structures in the MPIO device drivers. Having too many paths to a large number of disks can reduce the amount of memory available to the rest of the system for running applications.
As noted above, the health check commands count against the queue depth for the device. So, health check processing has a greater effect on devices with a large number of paths, especially with devices that have smaller queue depths, and especially when there are paths in the failed state.

The optimal configuration for a device having four paths on AIX is to use four physical paths to the storage subsystem with a 1:1 relationship between the host-side host bus adapter (HBA) port and the remote storage ports. If using multiport adapters on the AIX host, split at least half the paths among separate physical adapters for optimum redundancy. The AIX and device ports can be connected to the same FC switch or to two different switches in the same fabric. If using two switches, there is no single point of failure. However, certain switch or port failures might affect an entire SAN, thus impacting all four paths.
One possible eight-path configuration that provides full redundancy uses two distinct SAN fabrics. The AIX node and the storage device each have two ports connected to each of the two SAN fabrics, using a total of four ports on AIX and four ports on the storage device. There are four paths between AIX and the storage device for each of the two distinct SAN fabrics, for a total of eight paths. Thus, there is no single point of failure for either SAN fabric, and there are redundant SAN fabrics. (Note: This is just an example. It is completely possible to have full redundancy with four paths per LUN using dual fabrics.)
The only case for more than eight paths is for specialized storage devices that configure a cluster of controllers, or for devices using Peer-to-Peer Remote Copy (PPRC). For example, an hdisk representing an IBM HyperSwap® pair of LUNs on two DS8000 devices could have 16 paths if each of the DS8000 systems used to form the HyperSwap pair are configured in the 8-path configuration described above. After the two 8-path hdisks are configured as a single HyperSwap enabled hdisk, it will have 16 paths.
There are other possible configurations beyond what is described here that can be considered. However, as noted above, going beyond eight paths can be more problematic than helpful, and should be carefully considered.
Recommendation: Configure 4 or 8 paths per disk, or up to 16 paths for rare situations. Carefully consider the impacts of extra, unnecessary redundancy before using more paths.

Consideration 5: Operational considerations

Scheduled maintenance: AIX MPIO is capable of robust error detection and recovery. However this error detection and recovery might take some time, and that delay might impact applications. If scheduled maintenance is planned for a SAN or for a storage device, it is best to identify the disk paths that will be impacted by that maintenance and use the rmpath command to manually disable those paths before starting the maintenance. AIX MPIO stops using any disabled or Defined paths, and therefore, no error detection or recovery will be done as a result of the scheduled maintenance. This ensures that the AIX host does not go into extended error recovery for a scheduled maintenance activity. After the maintenance is complete, the paths can be re-enabled with cfgmgr. (Note: When disabling multiple paths for multiple LUNs, rmpath is simpler than chpath, as it does not have to be run on a per-disk basis.)
The lspath command (or in newer technology levels, the lsmpio command) can be used to determine the MPIO paths that are associated with specific SAN ports.
Changing attributes: For most attributes and most levels of AIX, attributes could historically only be changed on devices that were not in use. For disks, this meant that the disk must be closed (for example, volume group varied off) in order to change attributes. If the disk could not be closed, such as the disks containing rootvg, then the user had to include the -P flag in the chdev command to write the attribute change to ODM and then restart AIX in order for the attribute to take effect.
For the newest technology levels of AIX (at the time of publishing this article), some disk attributes on some devices support the -U flag on the chdev command. This flag instructs chdev to attempt a dynamic update of the attribute value. With this flag, the attribute value can be changed without closing the disk and the change takes effect immediately.

Sunday, August 12, 2018

AIX sendmail Basics

About this document
How sendmail is started
The sendmail configuration file /etc/sendmail.cf
    Local host definitions Dw, Dm, and Cw
    Configuring for relay DS
    Masquerading DM

About this document

sendmail is a program for sending and receiving email messages. To receive an email message, sendmail must run in the background listening on port 25, the smtp port, for incoming mail messages. When email arrives on port 25, sendmail evaluates the address of the message and will deliver it to a local user, forward it to the host that the mail is addressed to , or in the case of an error, bounce the mail back to the originator. There are other options sendmail may take but these are the most common functions and the only ones discussed in this document.
This document applies to AIX Versions 4.2 and 4.3 and sendmail levels 8.7, 8.8 and 8.9. For the sending of email messages, sendmail is not directly accessed by the user. sendmail will be called by other programs to deliver the mail. The mail command is a common command line tool for creating email, as are elm and pine.

How sendmail is started

sendmail is started on system boot in the /etc/rc.tcpip file. It requires several options to run as a background process and listen for incoming messages. By default, sendmail is started with a command equivalent to the following:

     startsrc -s sendmail -a "-bd -q30m"

To stop sendmail, enter the following command:

     stopsrc -s sendmail

The sendmail configuration file /etc/sendmail.cf

The sendmail.cf contains a great number of variables and options, as well as what are known as rewrite rules. This document will cover the most used variables.
NOTE: IBM Support does not support the altering of the rewrite rules in the sendmail.cf, other than removing comments to enable functions. For that, you may consider an IBM Consult Line contract.

Local host definitions Dw, Dm, and Cw

sendmail will attempt to define the variables Dw, Dm and Cw automatically in memory by default. Dw, Dm and Cw represent, respectively, your host name, domain name and host aliases. This is done in a two-step process.

The product of the hostname command is used to define the Dw variable.
Then a gethostbyname system call is used by sendmail to find a domain name and any aliases, which are used to define the Dm and Cw variables.

/etc/hosts

Test it

Use the hostname command to verify the local system's host name definition and correct it if necessary with smitty hostname. Run host <hostname> to see the information that sendmail is using to define Dm and Cw. Remember, sendmail is expecting a domain name.
Also, a sendmail debug option may be used to display these definitions. Enter the following command to view the System Identity section:

     sendmail -d0.4

Fix it

sendmail is built to work with DNS for resolving host names. If a system is not configured for DNS, that is, the /etc/resolv.conf file does not exist, then sendmail must be told not to use the DNS by creating the file /etc/netsvc.conf with an entry as follows:

     hosts=local

If NIS is used, then the entry should read either hosts=local,nis or hosts=nis,local, your choice.
Because sendmail is built to use a DNS, sendmail also expects the host name to have a domain name. If a host running sendmail does not have a domain name, sendmail will loop because it cannot define the Dm variable. The typical error message is fix $j in config file, but this is misleading so don't fix $j in the config file. This is fixed one of two ways.
Edit the /etc/hosts file and give the local systems host name a domain name 1.1.1.1 bogus will become 1.1.1.1 bogus.lab.net bogus (the short name was moved to the alias position) or 1.1.1.1 bogus. This is considered incorrect but will work as long as the DS variable is not defined. The DS variable is covered in the next section.

Configuring for relay DS

When a workstation that is behind a firewall needs to send and receive Internet mail, the DS and DM variables are used. The comments for the DS entry describe the server defined there as a Smart Relay. A Smart Relay is a system that has access to a working DNS, and has Internet connectivity so that it is able to make a network connection and deliver the mail. Typically, the DS is a firewall, or a system that is trusted by the firewall.
The DS options would be defined, as follows:
DSsmtp:[1.1.1.99] (If an IP address is used, it must be in square brackets.)

Masquerading DM

Typically, hosts on an intranet are behind a firewall and hidden from internet users. Incoming mail is addressed to users at a domain name, for example, joe@hotmail.com. By defining the DM variable, sendmail will use the domain on the from: line of all outgoing messages so that Internet users may reply to this address.
By setting DM as follows, all outgoing mail will be from user@lab.net and joe@hotmail.com will be able to reply:

     DMlab.net

If an error message such as, sender domain must exist is received on Internet messages, then ruleset S94 must be uncommented to enable envelope masquerading. Search the sendmail.cf for S94 and remove the pound sign (#).
Example of ruleset S94 on AIX Version 4.2.1, sendmail Version 8.7

 ################################################################### 
 ###  Ruleset 94 -- convert envelope names to masqueraded form   ### 
 ################################################################### 

 S94 
 #R$+                    $@ $>93 $1

Example of ruleset S94 from AIX Version 4.3.3, sendmail Version 8.8.8

 ################################################################### 
 ###  Ruleset 94 -- convert envelope names to masqueraded form   ### 
 ################################################################### 

 S94 
 #R$+                    $@ $>93 $1 
 R$* < @ *LOCAL* > $*    $: $1 < @ $j . > $2

Uncomment the first line and comment out the second. NOTE: Masquerading will not occur for the root by default. Check the CE option in the sendmail.cf for more details.

DSsmtp:firewall.lab.net
or

Error report mail notifications with errnotify

Having the pleasure of working across many client accounts, it’s funny to see some of the convoluted scripts people have written to receive alerts from the AIX error log daemon. Early in my AIX career, I used to do the exact same thing, and it involved a whole bunch of SSH keys, some text manipulation, crontab, and sendmail. Wouldn’t it be nicer if AIX had some way of doing all of this for us? Well, you know I wouldn’t ask the question if the answer wasn’t yes!
AIX has an Error Notification object class in the Object Data Manager (ODM). By default, there are a number of predefined errnotify entries, and each time an error is logged via errlog, it checks if that error entry matches the criteria of any of the Error Notification objects. What we’re about to do, is add another entry into the errnotify object class to be checked and actioned upon.

The end result will be AIX sending an email upon any new entries into the error log. I also discuss how you can further refine alerts for particular error types and classes. The only prerequisite in this solution is that you’re able to send mail out from the server.
Before we get started, we need to understand all the errnotify object class descriptors that can be configured. Table 1 below was created from information available on the IBM Information Center [1].
Table 1

Descriptor	Value	Description
en_alertflg	TRUEFALSE	Identifies whether the error can be alerted. This descriptor is provided for use by alert agents associated with network management applications using the SNA Alert Architecture.
en_class	H – (Hardware Error Class)S – (Software Error Class)O – (Messages for the errlogger command)U – (Undetermined)	Identifies the class of the error log entries to match.
en_crcid	Identifier	Specifies the error identifier associated with a particular error.
en_dup	TRUEFALSE	If set, identifies whether duplicate errors as defined by the kernel should be matched.
en_err64	TRUEFALSE	If set, identifies whether errors from a 64-bit or 32-bit environment should be matched.
en_label	Identifier	Specifies the label associated with a particular error identifier as defined in the output of the errpt -t command.
en_method	Path to application	Specifies a user-programmable action, such as a shell script or command string, to be run when an error matching the selection criteria of this Error Notification object is logged. Additional arguments shows in Table 2.
en_name	Text string	Uniquely identifies the object. This unique name is used when removing the object.
en_persistenceflg	0 – non-persistent (removed at boot time)1 – persistent (persists through boot)	Designates whether the Error Notification object should be automatically removed when the system is restarted.
en_pid	Numeric	Specifies a process ID (PID) for use in identifying the Error Notification object. Objects that have a PID specified should have the en_persistenceflg descriptor set to 0.
en_rclass	Device class	Identifies the class of the failing resource. For the hardware error class, the resource class is the device class. The resource error class is not applicable for the software error class.
en_resource	Text string	Identifies the name of the failing resource. For the hardware error class, a resource name is the device name.
en_rtype	Text string	Identifies the type of the failing resource. For the hardware error class, a resource type is the device type by which a resource is known in the devices object class.
en_symptom	TRUE	Enables notification of an error accompanied by a symptom string when set to TRUE.
en_type	INFO – (Informational)PEND – (Impending loss of availability) PERM – (Permanent) PERF – (Unacceptable performance degradation) TEMP – (Temporary) UNKN – (Unknown)	Identifies the severity of error log entries to match.

The en_method descriptor contains the command which will be run when an entry in the error log matches our new errnotify object. This descriptor has a number of additional parameters which can also be used to add further details on the command line. A list of all parameters and their values is listed in Table 2.
Table 2

Argument	Description
$1	Sequence number from the error log entry
$2	Error ID from the error log entry
$3	Class from the error log entry
$4	Type from the error log entry
$5	Alert flags value from the error log entry
$6	Resource name from the error log entry
$7	Resource type from the error log entry
$8	Resource class from the error log entry
$9	Error label from the error log entry

Now that we have an understanding on the values that we can configure in the errnotify object, we can create a simple “catch-all” errnotify object which will trigger an email for all entries logged in the error log.

Step 1
Create a temporary text file (e.g. /tmp/errnotify) with the following text:

errnotify:
  en_name = "mail_all_errlog"
  en_persistenceflg = 1
  en_method = "/usr/bin/errpt -a -l $1 | mail -s \"errpt $9 on `hostname`\" user@mail.com"

Step 2
Add the new entry into the ODM.

# odmadd /tmp/errnotify

Step 3
Test that it’s working by adding an entry into the error log.

# errlogger 'This is a test entry'

If required, you can delete the ODM entry with the following command:

# odmdelete -q 'en_name=mail_all_errlog' -o errnotify
0518-307 odmdelete: 1 objects deleted.

That’s it! A nice and simple way of getting email alerts for new entries in the error log, without the use of scripts. The handy thing about this solution is that you can include this into your AIX golden image so it’s already configured for any new AIX installations. I recommend that for environments which are supported by multiple systems administrators, that you create a shared mailbox. This way, you can manage the adding/removing of users to the reports at the mail server, and not on the hosts.
The above “catch-all” solution is great, but there may be times that you only want to be notified for particular errors. For example, you might only want to be notified for hardware errors, or when the error type is permanent. To do this, you just need to modify the particular object class descriptors.
The example below will only notify on permanent hardware messages in the error log.

errnotify:
  en_name = "mail_perm_hw"
  en_class = H
  en_persistenceflg = 1
  en_type = PERM
  en_method = "/usr/bin/errpt -a -l $1 | mail -s \"Permanent hardware errpt $9 on `hostname`\" user@mail.com"

The errnotify object class has been around for quite some time, and probably isn’t something new to veteran AIX system administrators. While some enterprise environments use 3rd party utilities to monitor errlog on AIX, this is a quick and easy alternative method of receiving notifications without all that much effort.
[1] – http://pic.dhe.ibm.com/infocenter/aix/v6r1/index.jsp?topic=%2Fcom.ibm.aix.genprogc%2Fdoc%2Fgenprogc%2Ferror_notice.htm

Practical Guide to AIX "Filesystems"

1) Creation of Filesystem:

The crfs command creates a file system on a logical volume within a previously created volume group.

A new logical volume is created for the file system unless the name of an existing logical volume is specified using the -d. An entry for the file system is put into the /etc/filesystems file.

Commands

##With an existing logical volume:

# crfs -v jfs2 -d <lv> -m <mountpoint> -A yes

-v vfs type (Specifies the virtual file system type)
-d Specifies the device name of a device or logical volume on which to make the file system.
-m Specifies the mount point, which is the directory where the file system will be made available
-A Specifies whether the file system is mounted at each system restart:
yes:File system is automatically mounted at system restart.
no:File system is not mounted at system restart (default value).

Note: start of changeThe crfs command accesses the first letter for the auto mount -A option.end of change

## Create logical volume, filesystem, mountpoint, add entry to /etc/filesystems at the specified size

# crfs -v jfs2 -g <vg> -m <mountpoint> -a size=<size in 512k blocks|M|G) -A yes

Note: there are two types of filesystems jfs and jfs2, jfs2 allows you to reduce/shrink the filesystem size but you cannot reduce a jfs filesystem.

2) mount/unmount Filesystems:

mount is the command used to mount filesystems

mount [<fs>|<lv>] 
mount -a 
mount all

To unmount filesystem use "umount"

umount <fs>

Note: You can't write into or read any content from  filesystem when its in unmounted state

-f unmount filesystem forcibly
umount -f <fs>

for v7.1 onwards you can use name for "unmount" command as well.

umountall: Unmounts groups of dismountable devices or filesystems ( applicable to v7.1)

3) List Filesystems:

To list filesysem use "lsfs" or "mount"

lsfs Lists all filesystems in the /etc/filesystems entry

lsfs -a To list all filesystems (default)

lsfs -q <fs> (detailed) 

lsfs -q List all filesystems with detailed info (shows size of FS and LV in it. so we can check whether size of LV=size os FS)

lsfs -l Specify the output in list format

lsfs -c Specify the output in column format

lsfs -v jfs Lists all jfs filesystems 

mount       (to list all mounted filesystems)

mount <fs> ( to list the mounted filesysem)

Note: use the '-q' to see if the logical volume size is bigger than the filesystem size

4) Display Filesystem usage:

To display information about all mounted file systems, enter: df

Command Examples

1) If your system has the /, /usr, /site, and /usr/venus file systems mounted, the output from the df command resembles the following:

df

Filesystem 512-blocks Free %Used Iused %Iused Mounted on
/dev/hd0 19368 9976 48% 4714 5% /
/dev/hd1 24212 4808 80% 5031 19% /usr
/dev/hd2 9744 9352 4% 1900 4% /site
/dev/hd3 3868 3856 0% 986 0% /usr/venus

2) To display information about /test file system in 1024-byte blocks, enter:

df -k /test

Filesystem 1024 blocks Free %Used Iused %Iused Mounted on
/dev/lv11 16384 15824 4% 18 1% /tmp/ravi1

This displays the file system statistics in 1024-byte disk blocks.

3) To display information about /test file system in MB blocks, enter:
df -m /test

Filesystem MB blocks Free %Used Iused %Iused Mounted on
/dev/lv11 16.00 15.46 4% 18 1% /tmp/ravi1

This displays file system statistics in MB disk blocks rounded off to nearest 2nd decimal digit.

4) To display information about the /test file system in GB blocks, enter:

df -g /test

Filesystem GB blocks Free %Used Iused %Iused Mounted on
/dev/lv11 0.02 0.02 0% 18 1% /tmp/ravi1

This displays file system statistics in GB disk blocks rounded off to nearest 2nd decimal digit.

5) Resize Filesystems:

chfs -a size=<new size> <fs>

Command Examples

chfs -a size=1G /var (specific size, can be used to increase and decrease)
chfs -a size=+1G /var (increase by 1GB)
chfs -a size=-1GB /var (reduce by 1GB)

Note1:This will automatically increase or decrease the underlying logical volume as well.
Note2:You can't reduce jfs filesystem

6) Modify/Change Filesystems:

Command Examples

## Change the mountpoint

chfs -m <new mountpoint>
chfs -m /test /new ==>Change the mount point from /test to /new

## Do not mount after a restart

chfs -A no <fs>
## Mount read-only

chfs -p ro <fs>
## Remvoe attribute of a filesystem

Remove account attribute of /test.(from /etc/filesystems file)

chfs -d account /test
chfs -a options='rw' /shadow ==> shows with lsfs rw (I think rw is the deafault anyway)

7) Remove Filesystems:

Command Examples

rmfs <fs>
rmfs -r /test ==>Deletes FS /test its mount point and associated LV

Note1: You need to unmount the filesyem before removing.
Note2: if all filesystems have been removed from a logical volume then the logical volume is removed as well.

8) Freeze File System:

If you don't want your file system to perform any writes for a period of time, maybe due to an admin task like a split copy or a backup, you can freeze the file system. After the admin tasks are completed, you can thaw the file system.

chfs -a freeze=<time in seconds> <fs>
chfs -a freeze=off <fs>

9) Split mirrored copy of filesystem:

chfs -a splitcopy=<split copy mountpoint> -a copy=2 <fs>
chfs -a splitcopy=/backup -a copy=2 /testfs

This will mount the 2nd copy of mirrored filesystem testfs to /backup in read-only mode for backup purpose

10) defrag fielsystem:

The defragfs command can be used to improve or report the status of contiguous space within a file system.

Command Examples

defragfs /test ==>To defragment the file system /test
defragfs -q /test ==>Display the current defrag status of the file system

For example, to defragment the file system /home, use the following command:

defragfs /home

Here is an example output:

# defragfs /home
Defragmenting device /dev/hd1. Please wait.

Total allocation groups : 32
Allocation groups skipped - entirely free : 26
Allocation groups defragmented : 6
defragfs completed successfully.

Total allocation groups : 32
Allocation groups skipped - entirely free : 26
Allocation groups that are candidates for defragmenting : 6
Average number of free runs in candidate allocation groups : 1
#

11) fuser & filesystem:

Command Examples

fuser /etc/passwd lists the process numbers of local processes using the /etc/passwd file

fuser -cux /var shows which processes are using the given filesystem

fuser -cuxk /var it will kill the above processes

fuser -dV /tmp shows deleted files (inode) with process ids which were open by a process (so its space could not be freed up)

(-V: verbose will show the size of the files as well)
if we rm a file, while it is opened by a process its space will not free up.
solution: kill the process, wait for the process to finish or reboot the system

12) Checking and Repairing:

Command Examples

fsck [-y|-n] <fs> (check a filesystem)

fsck -p lt;fs> (restores primary superblock from backup copy if corrupt)

fsck -y n /dev/lv00 ==>To check the filesystem associated to /dev/lv00 assuming response "yes"

13) Miscellaneous Filesystem Commands:

Command Examples

skulker ==> cleans up file systems by removing unwanted or obsolete files

fileplace <filename> ==> displays the placement of file blocks within logical or physical volumes, it will show if a file fragmented

IBM AIX- Admin Best Practices

When I was started my career as AIX Admin, it was like Greek and Latin, so scared . I am not aware where,how to start .

In this article, I will walk you through best practices as AIX admin which makes your life easy.

These are also applicable to other flavors of Unix Administration ( like Solaris,Linux & HP-UX) , the only difference some of commands differs to flavour to flavor..

Rule 1: Learn Process

If you pass this area it would be get very much easy life going forward.

I would like emphasize on below points especially.

Follow ITIL processes which are adopted by most of the companies.
Get to know about the SLAs ( Service Level Agreement)
Always try to acknowledge tickets as per SLA and update on regular intervals
If its P1 login to bridge call and voice out your findings promptly and wisely
Always perform approved changes within the prescribed change window

Table describes different ITIL process in short.

Process Name	Definition	Priorities	Tools
Incident/Ticket Management	An unplanned interruption to an IT Service or a reduction in the Quality of an IT Service. Failure of a Configuration Item that has not yet impacted Service is also an Incident.	P1,P2,P3 & P4	Maximo,BMC Remedy,HP-Service Manager ,Peregrine
Change Management	A process to control and coordinate all changes to an IT production environment.	Normal,Urgent,Emergency &Expedite	Maximo,BMC Remedy,HP-Service Manager ,Peregrine
Service Request /Task Management	A monitoring and reporting the agreed Key Performance Indicators (KPI) corresponding to the compliance with customer and management.		Maximo,BMC Remedy,HP-Service Manager ,Peregrine
Problem Management	Problem Management includes the activities required to diagnose the root cause of incidents identified through the Incident Management process, and to determine the resolution to those problems		Maximo,BMC Remedy,HP-Service Manager ,Peregrine
Release management	Process of managing software releases from development stage to software release.

Rule 2: Get to know Coverage Area & Contacts

Make supported servers inventory: If possible make sheet in such way that it includes environment,jump server,application,datacentre location and console information.
Get the access for the servers
Collect vendor contact information with phone numbers
Also keep application team contact information handy

Rule 3: Day to day operations

Backups:

Take system backup mksysb on regularly atleast for one week and keep it in other server preferably in NIM server
Verify mksysb with “lsmksysb -l -f /mksysbimg”(check size).
Check the /etc/exclude.rootvg to see if any important filesystem/dir is excluded from the mksysb backup.
Ensure file systems (non-rootvg )backups as per backup software of your company. ( Eg: TSM or Net backup)
Take system snap for every week ( make cron entry) and keep log file in different server or make a copy in your desktop

System Consistency Checks:

Ensure the current OS level is consistent: “oslevel -s;oslevel -r;instfix -i|grep ML;instfix -i|grep SP;lppchk -v” [If the os is inconsistent, then first bring the os level to consistent state and then proceed with the change].
Proactively remediate compliance issues.
Check your firm policies on server uptime and arrange for reboot , generally some organizations fix it as < 90 days / < 180 days period .

Troubleshooting issues:

Don't do issue fixing without a proper incident record.
Engage relevant parties while working on the issue
Always try get the information about the issue from the user ( requestor) with questions line "what, when, where"
Look at errpt first
Check ‘alog -t console -o’ to see if its boot issue
Also looking log files mentioned in "/etc/syslog.conf" , may give some more information for investigation.
Check backups if your looking for configuration change issues
if your running out of time,involve your next level team and managers
Take help from vendors like IBM,EMC,Symantec if necessary

P1 issues:

if its a priority 1 (P1) issue you may need to consider few more additional points apart from above.

On sev1 issues, update the SDM (Service Delivery Manager) in the ST/Communicator multi chat at regular intervals.
Over the conference voice call(bridge call ), if they verbally request you to perform any change, get the confirmation in writing in the multi ST chat.
Update the incident record ( IR) in regular intervals.
Update your team with the issue status(via mail).
Document any new leanings(from issues/changes) and share it with team.

Working on a Change:

Thumb Rule: Change should go in sequence manner DEV ==> UAT/QA==> PROD environment servers.

Make sure change record is in fully approved otherwise don't start any of your task
Ensure proper validated CR procedure is in place; Precheck -> Installation -> Backout -> Post-Verification
Supress alerts if needed
Remember Application/Database teams are responsible for their Application/Database backup/restore and stop/start. Therefore alert the application teams .
Check the history of the servers(CRs or IRs )…to see if there were any issues or change failures for these servers.
EXPECT THE UNEXPECTED : Ensure you have the proper back out plan in place.
Ensure you are on right server('uname -n'/'hostname') before you perform change.
Make sure your id as well as root id is not expired and working.
Ensure no other from your team are working on the same task to avoid one change being performed by multiple SAs. Its better to verify with the ‘who -u’ command, to see if there are any SAs already working on the server.
Remember one change at onetime; multiple changes could cause problems & can complicate troubleshooting.
Ensure there are no other conflicting changes from other departments such as SAN, network, firewall, application.. which could dampen your change.
Maintain/record the commands run/console output in the notepad(named after the change).

if its configuration change:

Take backup of pre and post values and document them
Take screenshots if you comfortable in taking
If your are updating configuration of a file take #cp -p <filename> filename_`date +"%m_%d_%Y_%H:%M`"

if its a change to reboot or update s/w :

Check if the server is running any cluster (HACMP/PowerHA), if so then you have to follow different procedure.
Always remember three essential things are in place before you perform any change “backup(mksysb); system information; console”
Take system configuration information (sysinfo script).
Check the lv/filesystems consistency “df -k”(df should not hang); all lvs should be in sync state “lsvg -o|lsvg -il”.
Check errpt & ‘alog -t console -o’ to see if there are any errors.
Ensure latest mksysb(OS image backup) kept in relevant NIM server
Ensure non-rootvg file systems backup taken
Verify boot list & boot device: “bootlist -m normal -o” “ipl_varyon -i”
Login to HMC console

Additional points for reboot:

Put the servers in maintenance mode (stop alerts) to avoid unnecessary incident alerts.
Check filesystems count “df -g|wc -l” ; verify the count after migration or reboot.
Ensure there are no schedule reboots in crontab. If there is any then comment it before you proceed with the change.
If the system has not rebooted from long-time(> 100 days); then perform ‘bosboot’ & then reboot the machine(verify the fs/appfs after reboot), & then commence with the migration/upgrade. [Don't reboot the machine if the bosboot fails!]
Look for the log messages carefully; don't ignore warnings.

Additional points for OS & S/W upgrades:

Ensure hd5(bootlv) is 32MB (contiguous PPs) [very important for migration]
For OS updates Initiate the change on console. If there is any network disconnection during the change, you can reconnect to the console and get the menus back.
If situation demands ,ensure there is enough free filesystem space(/usr, /var, / ), required for the change.
Have the patches/filesets pre-downloaded and verified.
Check/verify the repositories on NIM/central server; check if these repositories were tested/used earlier.
If there are two disks in rootvg, then perform alt disk clone for one disk. This is fastest & safest back-out method in case of any failure. Though you perform alt disk clone, ensure you as well take mksysb.
For migration change, check if there is SAN(IBM/EMC..) used, if so, then you have to follow the procedure of exporting vgs, uninstall sdd* fileset;and after migration reinstall sdd* fileset, reimport vgs etc.
Perform preview(TL/SP upgrade) before you perform actual change; see if there are any errors reported in preview(look for keyword ‘error’ / ‘fail’); look for the tail/summary of messages;
Though the preview may report as ‘ok’ at the header, still you have to look in the messages and read the tail/summary of preview.
If preview reports any dependency/requisite filesets missing then have those downloaded as well.
Ensure you have enough free space in rootvg. Min of 1-2 GB to be free in rootvg(TL upgrade/OS migration).
Ensure application team have tested their application on the new TL/SP/OS to which you are upgrading your system.
If you have multiple putty sessions opened; then name the sessions accordingly [Putty -> under behaviour -> window title]; this will help you in quickly getting to the right session or else use PuttyCM ( Putty Connection Manager)
Ensure for TL upgrades, you go by TL by TL, shortcut to direct TL could sometimes cause problem.

What if you are crossing change widow ?

inform the relevant application teams and SDMs and take extended with proper approvals
Raise a incident record in supporting the issue.

What if change fails ?

Inform the relevant application teams and SDMs
Close the record with the facts
Attend the change review calls for the failed changes

Successful Change:

if possible send the success status to relevant parties with artifacts
Update the change request with relevant artifacts and close it

Last but not the least:

Don't hesitate to take your team mates help or vendor support when your issue taking more time
Inform your managers if the issue in escalation situation ( if its P1 you need to inform prior).
Always perform change with proper approvals in place
Take backups and make your life easy

Saturday, August 11, 2018

AIX FAQs

1. What does the acronym AIX stand for?
A: Advanced Interactive eXecutive.

2. How can I restart the SSHD service?
A: Please use the following commands to stop and restart the SSHD service:

stopsrc -s sshd
startsrc -s sshd

3. How can I copy the contents of one file system to another?
A: Use the cp command with the recursive flag set. (cp –R).

4. How do I enable telnet sessions on my reserved VLP system?
A: By default telnet is disabled on the reserved VLP system. Please use the following commands to start/stop the telnet service:

startsrc –s telnet
stopsrc –s telnet

5. How can I enable FTP service on my reserved VLP system?
A: By default FTP service is disabled on the reserved VLP system. Please use the following commands to start/ stop the FTP service:

startsrc –t ftp
stopsrc –t ftp

6. Is there a menu driven way to configure AIX, install programs, etc?
A: Yes, type in the command smit or smitty to access the menu driven AIX configuration utility.

7. What is the purpose of proc file system in the root directory?
A: The proc file system is a mounted file system used to trace a process system call, receive signals, and incurred machine faults.

8. Where can I get more AIX “How To”?
A: Visit the User “How-To” section of the Information Center:http://publib.boulder.ibm.com/infocenter/systems/scope/aix/index.jsp

9. Where can I get more Application Developer help online?
A: Please visit the Software Products section of the IBM Web site: http://www-01.ibm.com/software/sw-bycategory/

10. Where can I get the latest AIX toolbox for Linux Applications?
A: Visit the AIX Toolbox for Linux Applications Web site. http://www-03.ibm.com/systems/p/os/aix/linux/toolbox/download.html

11. Where can I read more about AIX and Linux Interoperability?
A: Please refer to the IBM redbook AIX and Linux Interoperability.

12. Where can I read more about running Linux Applications on AIX?
A: Please refer to the IBM Redbook Linux Applications on System p

13. Where can I read reference material on AIX 5L troubleshooting and problem solving?
A: Please refer to the IBM Redbook Problem Solving and Trouble shooting in AIX5L

14. Why Can’t I Kill My Process?
A: To kill a process sometimes it is necessary to eliminate a process entirely. This is the purpose of the kill command. The syntax of the kill command, which is actually a general purpose process signaling utility, is as follows:

kill [-signal] PID

The kill command (or kill -15 which is the default signal) sends a SIGTERM signal to a process. This signal can be trapped, thus ignored by a process. The kill -9 command sends a SIGKILL command to the process. If the process is currently in USER mode, this signal cannot be caught or trapped and the process will terminate. Occasionally, processes will not die even after being sent the kill signal. The majority of such processes fall into one of the following categories: Zombies. A process in the zombie state (displayed as Z status in BSD ps displays and as under System V). A zombie process is one in which all its resources have been freed, but the parent process’s acknowledgment has not occurred. Zombies are always cleared the next time the system is booted and do not adversely affect system performance. Processes in kernel mode waiting for unavailable resources.

There are two modes a process can be in, USER mode and KERNEL mode. The process goes into kernel mode anytime it needs to access system functions via a system call routine. While in the kernel mode, signals are ignored until the system call exits back to user mode. At that time, any pending signals are handed to the user process. If while in kernel mode, the process goes to sleep while waiting on a resource and the resource never becomes available, the process will never exit kernel mode. The only way to kill a process that is “ignoring” or “sleeping” in kernel mode is to restart the system.

Note: Signals are defined in the /usr/include/signal.h file and the command kill -l may be used to generate a list of their symbolic names, delivery of signals to a system call.The kernel delays the delivery of all signals, including SIGKILL, when starting a system call, device driver, or other kernel extension. The signal takes effect upon leaving the kernel and returning from the system call. This happens when the process returns to the user protection domain, just before running the first instruction at the caller return address.

15. Will my 64-bit application run on the 32-bit kernel?
A: Most likely. The environment in which the application was built has no bearing on where it can run. Compiler switches are available to create either a 32-bit or a 64-bit executable program from program source code when compiled on either the 32-bit kernel or the 64-bit kernel. In addition, 32-bit programs and 64-bit programs can both be run on either the 32-bit kernel or the 64-bit kernel. However, if your application needs the use of a kernel extension (a program that extends the kernel and may, for example, provide a new system call for the application) which is not supported on both the 32- and 64-bit kernels, your application will only run with the kernel supported by the kernel extension.

16. How do I install applications from the /stage/middleware directory?
A: Some of the applications are in the AIX lppsource format and they can be installed with “smitty install”. You can tell if it is this type of install if there is a “.toc” file in the directory. If the application is in the tar format, you must move it to another directory to uncompress and untar the files. Then you can install the application. You need to have root authority to do this.

17. Is there a program I can use to monitor system performance?
A: Yes, the NMON program works for AIX and LINUX. You can find out more information and download it from IBM Developer works.

18. Error “Could not chdir to home directory /home/u000XXXX: The file access permission does not allow the specified action.” What is the resolution?
A: Make sure the directory ownership on /home/u000XXXX is set as u000XXXX:staff

If not, you can correct it by using the command: chown <u000XXXX:staff> </home/u000XXXX>. You should be a root user while issuing this command.

19. How to configure temp space on the reserved VLP System?
A: The temp space is already configured as a volume group and labeled as tempvg. To create a filesystem on this volume group follow the below directions:

1. Type smitty jfs or smitty jfs2, depending upon which type of filesystem you want to configure.

2. Select Add a Journaled File System.

3. Select Add a Standard Journaled File System and press Enter.

4. You should see 2 Volume Groups to choose from, rootvg and tempvg. Select tempvg and press Enter.

5. Change Unit Size from Megabytes to Gigabytes by pressing the Tab key.

6. In the “Number of units” section, enter how many Gigabytes you would like this filesystem to be. i.e. if you want this to be a 30 GB filesystem, you would enter 30.

7. Change “Mount AUTOMATICALLY at system restart?” to Yes by pressing the Tab Key.

8. Press Enter and it should configure the Filesystem.

20. How do I increase the size of an AIX filesystem?
A: To increase the size of filesystem, use the command chfs.

chfs -a size=+’size to be increased in MB or GB’ /FS
Where FS is the name of filesystem.

Example 1: To increase the filesystem of /usr/temp01 with 512MB,
The command would be: chfs -a size=+512M /usr/temp01

Example 2: To increase the filesystem of /tmp by 2GB,
The command would be: chfs -a size=+2G /tmp.

21. When using smitty can I see the command line equivalent for the smit menu selection?
A: Yes, press the F6 (Function Key + 6) to see the corresponding command

22. Are there Linux tools I can run on the AIX system?
A: Yes, the Linux Toolbox for AIX is located in the /stage/middleware/AIX/linux_tools directory. They can be installed with the rpm command.

23. Does OpenSSH come with AIX?
A: OpenSSH is a free software tool that supports SSH1 and SSH2 protocols. It’s reliable and secure and is widely accepted in the IT industry to replace the r-commands, telnet, and ftp services, providing secure encrypted sessions between two hosts over the network. OpenSSH_5.2p1 is available on the AIX 7.1, OpenSSH_5.4p1 on AIX 6.1 and OpenSSH_5.2p1 on AIX 5.3 reservations.

24. How do I start a CDE session using the VPN?
A: To start a CDE (common desktop environment) session on AIX using the VPN client, you need to follow these steps:

1. Start the VPN appliance and log in with your User ID and Password. You should see the Lock icon on the bottom right hand corner of your screen to show that VPN connection is established.

2. Bring up a terminal emulator such as PuTTY and SSH to your Reserved VLP Server.

3. Start the VNC server using the command: vncserver.

In case you encounter difficulties starting the VNC server, you need to follow these steps:

1. Edit “/opt/freeware/bin/vncserver” file. Go to the following section:

# Add font path and color database stuff here, e.g.:
#
# $cmd .= “ -fp /usr/lib/X11/fonts/misc/,/usr/lib/X11/fonts/75dpi/”;
# $cmd .= “ -co /usr/lib/X11/rgb”;

Uncomment the third line, and change it to: $cmd .= “ -fp
/usr/lib/X11/fonts/:/usr/lib/X11/fonts/misc/:/usr/lib/X11/fonts/75dpi/”;

2. Edit the file /.vnc/xstartup, and comment out this line “twm &”, and add this line:
# This line gives you a CDE desktop when you sign on to VNC

/usr/dt/bin/dtsession &

3. Ensure that all the files except xstartup are removed from /.vnc directory.

4. Also ensure that all the directories are removed in the/tmp/.X11-unix directory.

5. Now start your VNC server and it will pick up the changes made. Type: vncserver

6. Start the VNC Viewer client such as RealVNC or TightVNC. In the Server or destination field, type the IP address of your Reserved VLP Server:port number. For example: 172.29.13X.XXX:1. 6. The VNC Viewer client will ask you for you VNC password, enter your VNC password. Once authenticated, CDE session would be started.

25. How do I start a CDE session using SSH?
A: To start a CDE (common desktop environment) session on AIX using the SSH Gateway:

1. Access your reserved VLP system using SSH gateway.

In case you encounter difficulties starting the VNC server you need to follow these steps:

a) Edit “/opt/freeware/bin/vncserver” file. Go to the following section:
# Add font path and color database stuff here, e.g.:
#
# $cmd .= “ -fp /usr/lib/X11/fonts/misc/,/usr/lib/X11/fonts/75dpi/”;
# $cmd .= “ -co /usr/lib/X11/rgb”;

Uncomment the third line, and change it to: $cmd .= “ -fp
/usr/lib/X11/fonts/:/usr/lib/X11/fonts/misc/:/usr/lib/X11/fonts/75dpi/”;
b) Edit the file /.vnc/xstartup, and comment out this line “twm &”, and add this line:
# This line gives you a CDE desktop when you sign on to VNC
/usr/dt/bin/dtsession &
c) Ensure that all the files except xstartup are removed from /.vnc directory.
d) Also ensure that all the directories are removed in the/tmp/.X11-unix directory.
e) Now exit from the reservation.
2. From the SSH gateway, establish an SSH Tunnel and forward the VNC Server port.
For Example: ssh -4gL {tunnel_port}:{reserved_server_ip_address}:5901 {reserved_server_user_id}@{reserved_server_ip_address}
ssh -4gL 20099:172.29.1XX.XX:5901 u000XXXX@172.29.1XX.XX

3. Start your VNC Viewer client such as RealVNC or TightVNC. In the Server or destination field, type the IP address of the SSH Gateway: port forwarded on SSH Gateway.
For example: 198.81.193.104:20099.

4. The VNC Viewer client will ask you for your VNC password, enter your VNC password and your VNC session will start with the CDE desktop.

26. How can I change the hostname of my reserved VLP system?
A: The hostname for the reserved VLP AIX system can be changed by any of the following procedures:

1. You may use the command ‘hostname myhost01’

2. You may edit the /etc/resolv.conf file to make the changes permanently effective for the duration over which the reservation is in Active state. You may use the following format to add the necessary attributes to the /etc/resolv.conf file for changing the hostname and domainname for a given IP address: `IP_address hostname domainname`

3. You to use the command ‘smitty tcpip’. The command would provide you with the option to edit the already configured hostname. The steps are as follows:

Run the command smit tcpip.
Select minimum configuration
Select en0 and change the hostname.

27. How do I use GCC (Gnu C Compiler) to create a 64-bit binary?
A: In order to create a 64-bit binary, you should use GCC with the -maix64 argument. For example, gcc -maix64.

28. Why do the file /var/log/lastlog consume so much disk space?
A: When you log in to a UNIX system, the file shows who last logged in. That information is stored in a binary file called lastlog. Each user has their personal record; UID 8 is at record 8, UID 239 at record 239, and so on. This is a feature of UNIX called “sparse file”. The /var/log/lastlog file main purpose includes spooling directories and files, administrative and logging data, and transient and 20 temporary files. Run ‘du’ on it to see how much actual disk space it occupies. The file can be removed.

29. Why do I receive errors about libc.a(aio.o) while running db2start on AIX?
Or
How do I resolve the following error I get when running db2start? “exec(): 0509-036 cannot load program db2start because of the following errors: 0509-130 Symbol resolution failed for /usr/lib/threads/libc.a(aio.o)”
A: When the asynchronous I/O is turned off and you may encounter the following error when you try to run db2start:

0509-130 Symbol resolution failed for /usr/lib/threads/libc.a(aio.o) because:
0509-136 Symbol kaio_rdwr (number 0) is not exported from dependent module /unix.
0509-136 Symbol listio (number 1) is not exported from dependent module /unix.
0509-136 Symbol acancel (number 2) is not exported from dependent module /unix.
0509-136 Symbol iosuspend (number 3) is not exported from dependent module /unix.
0509-136 Symbol aio_nwait (number 4) is not exported from dependent module /unix.
0509-192 Examine .loader section symbols with the ‘dump -Tv’ command.

To turn ON asynchronous I/O:

1. smitty aio.

2. Change / Show Characteristics of Asynchronous I/O.

3. Set “STATE to be configured at system restart” to “available”.

4. Apply the changes, reboot the AIX box.

30. How can I NFS mount my local CD-ROM to my VLP system?
A: Following are the steps required to mount a CD-ROM as a cdrom filesystem, export the NFS filesystem from the server, and NFS mount the filesystem on the client. This assumes the server with the CD-ROM is an AIX system with 4.3 or above.
This also assumes that the VLP target system can reach the server with the CD-ROM on the network, for some this would be via the VPN-VPN connection or the server must be reachable from the Public Internet.

Using Smitty:

On the server where the CD-ROM will be physically mounted:

Check the status of portmap and the NFS daemons:

Enter lssrc -s portmap.

Enter lssrc -g nfs.

If they are not active, start them by running startsrc -s portmap and then startsrc -g fs.

Mount the CD-ROM:

Enter mkdir /cdrom to create a mount point, if onedoes not already exist.

Load the CD into the CD-ROM drive.

Enter smitty cdrfs.

Select Add a CD-ROM File System.

Select your CD-ROM device from the F4 list.

Enter the mount point you just created for MOUNT POINT (/cdrom).

If you want the filesystem to mount on a reboot, change Mount AUTOMATICALLY at system restart to yes. Note: If you specify yes for Mount AUTOMATICALLY at system restart, you must have media in the CD-ROM drive when you reboot or the mount will fail.

Enter # mount /cdrom.

To add the filesystem for NFS exporting:

Enter smitty mknfsexp.

Enter the PATHNAME of the directory to export (for example, /cdrom).

Change the MODE of export directory to read-only.

Enter the HOSTS & NETGROUPS allowed client access (the IP address of the VLP target system).

Enter HOSTS allowed root access (the IP address of the VLP target system).

Press Enter to export the filesystem.

Note: If you are going to be installing on the client machine, you must enter the client name for HOSTS allowed root access.

Verify that the filesystem is exported:

Enter showmount -e and find it in the list.

On the VLP Target System Client:

Check the status of portmap and the NFS daemons:

Enter lssrc -s portmap.

Enter lssrc -g nfs.

If they are not active, start them by running startsrc -s portmap and then startsrc -g nfs.

Verify that the server has the filesystem exported:

Enter showmount -e . Note: will be the hostname of the server above.

Create the directory you will be using to access the software.

Enter

# mkdir /cdrom.

To NFS mount the filesystem on the client:

Enter smitty mknfsmnt.

Enter the PATHNAME of the mount point (for example, /cdrom).

Enter the PATHNAME of the remote directory (for example, /cdrom).

Enter the HOST (server IP address from above) where the remote directory resides.

Note: HOST will be the hostname of the server.

Change the MODE for this NFS file system to read-only.

Press enter to NFS mount the filesystem.

To swap CD-ROM’s:

:-# umount /mnt (to unmount the NFS filesystem)
:
: Change CD-ROM disk
:
:-# mount :/cdrom /mnt
:
:-Repeat for each CD-ROM disk to installed

To save repeated swapping of disks, all of the CD-ROMS contents could be copied to a directory on the servers local disk and then doing step 3 on the server to nfs export that directory. Then mount that nfs exported directory on the VLP target system. The steps are the same except after you create the /cdrom mount point.

Using commands from Command Line:

On the server with the CDROM drive

# mount /cdrom (assuming the cdrfs mount has been configured: see above notes)

# mknfsexp -d /cdrom -t ro -c -r -N

The next two steps must be done eveytime a CD-ROM is swapped from the drive:

On the VLP target system

# mount :/cdrom /mnt

Perform the necessary task on CDROM

# umount /mnt (to unmount the NFS filesystem)

Change CD-ROM disk

Repeat

When finished On the server with the CDROM drive (apollo)

# rmnfsexp -d /cdrom -B
Troubleshooting:
Look for the following errors:
mount: 1831-011 access denied for ...
mount: 1831-008 giving up on ...
If they occur, try the following suggestions:
Make sure that the client’s hostname and IP address are resolvable by the server. Also, make sure that the server’s hostname and IP address are resolvable by the client. You can do so by running the following:
On the server:
host host
The output of these lines has to match exactly.
On the client:
host host
The output of these lines has to match EXACTLY.
On the client, enter netstat -in. If there is more than one network interface, make sure all IP addresses of the client are resolvable by the server. You can do this by running (on the server): host
Execute this command for each IP address listed in the netstat -in output.
If you are still getting errors:
On the server, enter smitty rmnfsexp.
Enter the PATHNAME of the exported directory (for example, /cdrom).
Press Enter to remove the directory from the exports list.
Enter umount /cdrom.
Enter rmdir /cdrom.
Return to step 1 of the section “On the server”. If you still cannot get the CD-ROM NFS mounted, contact your AIX support center for further assistance.

31. How can I tell if the pSeries hardware supports the 64 bit kernel? How do I tell which version of the kernel I am running?
A: Hardware:

> bootinfo -y
64 -- The hardware supports 64 bit kernel.

Or

> /usr/sbin/prtconf -c

CPU Type: 64-bit

Kernel:
> bootinfo -K (32 - The AIX kernel is 32 bit).
> bootinfo -K (64 - The AIX kernel is 64 bit).
> getconf LONG_BIT

32. How can I change between the 32-bit kernel and 64-bit kernel on my AIX machine?
A: By default, AIX is set to a 32-bit kernel. To switch between a 32-bit mode and 64-bit mode, type these commands at the command line:

# ln -sf /usr/lib/boot/unix_64 /unix
# ln -sf /usr/lib/boot/unix_64 /usr/lib/ boot/unix
# bosboot -ad /dev/ipldevice
# shutdown -Fr
# bootinfo -K should now show 64
# ln -sf /usr/lib/boot/unix_mp /unix
# ln -sf /usr/lib/boot/uni x_mp /usr/lib/boot/unix
# bosboot -ad /dev/ipldevice
You now need to reboot your machine:
# shutdown -Fr
# bootinfo -K should now show 32

33. What command can I use to increase or decrease the amount of paging space provided to the AIX Operating System? Can I do this dynamically?
A: You can increase and decrease paging space dynamically. The command is “chps - d “ to decrease the amount of paging space. Use “chps -s “ to increase the amount of paging space.
i.e # chps -s 2 hd6

34. /var filesystem is full due to a very large file in /var/spool/mail/*. Can I delete it?
A: Files in /var/spool/mail are flat text files that serve as the user’s mailbox. You can just move them out of the way or zero them out, if not needed.

35. A fileset needs to be installed from a command line. How do I do this?
A: When installing from local directory, execute: installp –acgNX

36. All of the filesets in a directory are not showing up when I try to install them via smit, how do I fix this?
A: In the directory that contains the filesets issue the command:

inutoc . (The dot refers to the current directory.)

37. How do I remove a cron job?
A: Log in as the user running the cron job and run crontab -e. This screen will look like a page opened using the vi editor. Remove or comment out the job you do not want to run and save your changes.

38. How can I tell if I have a JFS2 (enhanced journal file system)?
A: Run one of the following commands:

# lsfs
or
# mount

39. How can I verify how much real memory is on the system?
A: Use the bootinfo command, which displays the system’s physical memory in kilobytes. Command: bootinfo –r

40. How can you determine which process is using up the most CPU time?
A: You can check the CPU time by piping the ps output through the sort command.

For example: ps -ef | sort -n +3

41. How do I get NFS-mounting with Linux to work?
A: Linux, by default, requires any NFS mount to use a reserved port below 1024. AIX, by default, uses ports above 1024. Use the following command to restrict AIX to the reserved port range:

# /usr/sbin/nfso -o nfs_use_reserved_ports=1

For more information on this topic, refer to the AIX 5L Version 5.2 System Management Guide:
Communications and Networks, select Network File System and SMBFS, then select SMBFS.

42. How do I increase the factor size for a Volume group?
A: AIX V4 and later versions relaxed the limitation of 1016 physical partitions per physical volume, by introducing the concept of a volume group factor. When creating a volume group, you can specify your own PP/PV limitation, in multiples of 1016 (that is, 1016, 2032, 3048, etc). This is done via the -t flag on mkvg.

For example, the following command will create a volume group that will allow up to 2032 PPs/PV with the default PP size of 4MB at AIX V4. In AIX V5, mkvg automatically determines the PP size if one is not specified.

mkvg -y VGname -t 2 hdisk1

The disadvantage is that increasing the number of PPs allowed per disk will limit the number of disks allowed in the volume group. If the 1016 PP/PV limit is maintained, you can add up to 32 physical volumes in a standard volume group or up to 128, if it is a “BIG” volume group.

The chvg command also has a -t flag to allow you to change the max PPs/PV limit on an existing volume group, so that a larger drive could be added at that point.
Example:

chvg -t 2 VGname

Only volume groups that have been created with a factor size other than 1, or that have been to a new factor size, can exceed the 1016 PP/PV limit.

Any volume group created outside the default factor size or changed to a different factor Cannot be used on systems prior to AIX 4.3.1, even if the factor size is later changed back to 1. An example of the error received when accessing the volume group will look like the following:

0516-002 lqueryvg: The volume group version is incompatible with this level of the operating system and cannot be activated.

Note: These options are only available from the command line, not from SMIT.

43. How do you change the system time?
A: Enter: Smitty

Select System Environments.

Select Change/Show Date, Time, and Time Zone.

Select Change/Show Date and Time.

Change your values accordingly and hit Enter to commit your changes.

44. How do you check the maximum number of processes per user?
A: Use the lsattr command. Enter: lsattr -El sys0

45. I installed a rpm package, and files were successfully installed, but rpm -qa did not list the package. How do I solve this issue?
A: If you know that the package is installed, you can run:

# rpm -U—justdb
This will only update the rpm database and not install any new files. You can then verify the database was updated.

# rpm -qa

(Check to see if the package is now listed.)
Also check for package integrity.

# rpm –V

46. What do I do with defunct processes?
A: Defunct processes, also called zombies, can accumulate in your process table when an application forks several child processes and does not exit. If this becomes a problem, the simplest solution is to modify the application so its sigaction subroutine ignores the SIGCHLD signal. The child processes will then exit normally when they are finished and will not accumulate as defunct processes in your process table.

For more information, see the sigaction subroutine description in AIX 5L Version 5.2 Technical Reference: Base Operating System and Extensions, Volume 2.

47. What file shows executed cron jobs?
A: The file /var/adm/cron/log shows all executed cron jobs.

48. What is the maximum number of servers allowed for asynchronous I/O?
A: The maximum number of servers for asynchronous I/O is 1000.

49. Why am I seeing high CPU utilization for the KPROC process named wait?
A: The wait process runs when there are no processes available for execution, or when the CPU is waiting for I/Os to disk. If there are no I/Os pending to a local disk, all time charged to the wait process is classified as idle time.

On a uniprocessor system, the process ID for the single wait process is 516. On a SMP system, one wait process exists for each processor.

If the ps report shows a high aggregate time for this process, it simply means there were significant periods of time when no other process was ready to run on the CPU, or the system was spending time waiting for pending disk I/Os.

For more information on this topic, see the pSeries Information Center.

50. Will my machine run the 64-bit kernel?
A: 64-bit hardware is required to run the 64-bit kernel. For AIX 5.2, all IBM eServer pSeries 64-bit hardware can run either the 64-bit kernel or the 32-bit kernel. To verify the processor capability, run the following command:

# /usr/sbin/prtconf -c

The command will return “32” or “64” depending on the capability of the system. If your system does not have the prtconf command, you can use the bootinfo -y command.

In AIX 5.2, the 32-bit kernel is installed by default. The 64-bit kernel, along with JFS2 (enhanced journaled file system), can be enabled at installation time.

51. Why am I unable to run the Visual Age compilers? Why can’t I run xlC or xlC_r7?
A: You need make sure your path is set to include /usr/vac/bin.

export PATH=$PATH:/usr/vac/bin

This should be in your profile in your home directory as well. When running “cc” it should invoke the c++ compiler. Also, gcc is installed as well. All of the compilers are lower case (i.e. xlc, xlc_r7). You may also add /usr/vacpp/bin to your PATH to enable the use of the older style naming conventions (i.e. xlC, xlC_r7).

52. How can I get EN_US.UTF-8 locale installed on my reservation?
A: EN_US.UTF-8 locale is pre-installed on AIX 6.1 and AIX 7.1 reservations. Hence, the application requesting for the availability of EN_US.UTF-8, should be configured for the locale usage.

Saturday, August 25, 2018

Consideration 1: MPIO algorithm and path_priority

Consideration 2: Path health check settings

Path health check mode (hcheck_mode)

Path health check interval (hheck_interval)

Consideration 3: Time out policy

Consideration 4: How many paths to configure for AIX MPIO

Consideration 5: Operational considerations

Sunday, August 12, 2018

Contents

Test it

Fix it

Contents

1) Creation of Filesystem:

2) mount/unmount Filesystems:

3) List Filesystems:

4) Display Filesystem usage:

5) Resize Filesystems:

6) Modify/Change Filesystems:

7) Remove Filesystems:

8) Freeze File System:

9) Split mirrored copy of filesystem:

10) defrag fielsystem:

11) fuser & filesystem:

12) Checking and Repairing:

13) Miscellaneous Filesystem Commands:

Rule 1: Learn Process

Rule 2: Get to know Coverage Area & Contacts

Rule 3: Day to day operations

Backups:

System Consistency Checks:

Troubleshooting issues:

P1 issues:

Working on a Change:

if its configuration change:

if its a change to reboot or update s/w :

Additional points for reboot:

Additional points for OS & S/W upgrades:

What if you are crossing change widow ?

What if change fails ?

Successful Change:

Last but not the least:

Saturday, August 11, 2018