What do I check on a customer cluster environment, if I believe the cluster setup is causing the problem?

General Setup

The Cluster Service (ClusSvc) needs to be running using a domain account which is a member of each node's local Administrators group (directly, not via a global group), and needs "Logon as a service" and "Lock pages in memory" user rights. Make sure the password for the account does not expire.

Set the boot delay time to different values on both servers (5s and 30s).

Do not turn both servers on simultaneously while connected to shared storage unit until the Cluster Service is running on at least one of them.

Ensure reliable connectivity to a domain controller or set both servers as backup domain controllers.

Change the Cluster log size from 64kB (default) to 128kB.

Do not configure immediate automatic failback. Set the failback to occur during off-peak hours (e.g. between 21 and 6 or run it manually, if needed).

It is recommended to set the quorum drive to reside on its own disk resource (pick a small, mirrored drive, if possible - quorum uses no more than a few hundred kB).

Make sure that the cluster name group only contains cluster IP address and cluster name.

Set all advanced resource settings "Looks Alive" and "Is Alive" to reference "Use value from resource type".

If running NT 4.0 EE with SP5, Cluster Administrator should be run from machines with SP5 (or running SP5 version of clusteradmin.exe.)

Limit on total number of resources is 1,600 (starting with SP4).

Network Configuration

The only supported configuration includes private interconnect for cluster only communication and public one used for both client to cluster and cluster to cluster communication (this requires at least 2 NICs on each node). Configuration with no private network for cluster communication is not supported.

Do not use DHCP assigned addresses for any of the Cluster networking interfaces.

Set all NICs to a specific speed (DO NOT use autodetect) and specify appropriate duplex settings.

For the private interconnect, in Windows NT 4.0, unbind the WINS and do not set the default gateway. Also disable the Server, Workstation, and NetBIOS interface bindings for the interconnect NIC. In Windows 2000, disable NetBIOS over TCP/IP on the Advanced WINS tab in TCP/IP properties. Cluster Service uses Windows sockets with RPC, not NetBIOS, for internal communication. To optimize the response on the public network, set the public interface adapter higher in the order of TCP/IP bindings.

Subnet mask for the heartbeat and client networks should be the same

The private interconnect should be using a crossover cable or an isolated hub.

Windows 2000 is using its plug and play capabilities to detect disconnected network cables and connectivity problems which allows the cluster to properly fail over. This is done by extending connectivity testing beyond simple heartbeat (Cluster service communicates between nodes by sending a heartbeat signal - a single UDP packet - every 1.2 second to confirm connectivity) and running ping to an external host on the same subnet (typically the local gateway). In case of lack of conclusive information based on the heartbeat, the decision about the failover is depends on which node receives ICMP echo reply.

Installing File Share Resource

Do not create file shares using NT Explorer or Server Manager; use Cluster Administrator instead.

When setting up File Shares, make sure that they do not "Affect the Group".

For subdirectory sharing, SP4 or later has to be installed; however SP5 is required for their dynamic discovery.

When assigning shared permissions to the resource, use the Cluster Administrator interface rather than Explorer or Server Manager. Make sure that the Cluster Service account has Full Control on the share and NTFS level.

Installing Print Spooler Resource

The Print Spooler resource depends on Physical Disk, and Network Name which in turn depends on IP Address resource.

The Print resource needs to be configured by:

Installing ports on both nodes (printer ports must have the same name on both nodes)

Installing printer drivers on both nodes (after the installation, printers can be deleted)

Running Add Printer wizard over the network

Testing and Troubleshooting

In Windows 2000, logging is enabled by default in the %SystemRoot%\Cluster\Cluster.log. The logs are more "reader friendly" and contain references to resource and group names, rather than GUIDs (which NT 4.0 refers to).

When running into problems while connecting to the cluster via Cluster Administrator, connect to the node name rather than to the cluster name.

If the Clusdb file (containing backup of cluster registry hive) gets corrupted, it can be restored using its copy called Chkxxxx.tmp ("xxxx" changes) located in MSCS folder on the quorum drive. This file can be simply copied (after stopping cluster service on both nodes) over damaged Clusdb file in the Winnt\Cluster folder on one of the nodes. Once the node starts successfully, copy tmp files from MSCS folder to the other node and restart it.

Maintenance

Before you initiate a shutdown of a node, move all groups in Cluster Administrator to the other one. You can also use a batch file in which you first stop the Cluster service (with the "net stop clussvc" command), and then use Shutdown.exe from the Resource Kit to shut down the node. If this procedure is not followed, you might receive the following event log entry during the shutdown of the cluster node that owns a resource:

System Process - Lost Delayed-Write Data The system was attempting to transfer file data from buffers to Device\Harddisk#\Partition#\. The write operation failed, and only some of the data may have been written to the file.

This happens because, during shutdown of one node, the Cluster service stops the network heartbeat to the other node (by design). This, in turn, initiates the failover to the surviving node, but if the first node still writes data to a disk resource which it owns, the message is generated and the data corruption can occur.

Replacing Failed Disk in a Shared Disk Resource The Cluster Service may not start with the Event ID:1034 due to its dependency on disk signatures in identifying and mounting volumes. Refer to

http://support.microsoft.com/support/kb/articles/q243/1/95.asp and

http://support.microsoft.com/support/kb/articles/q217/2/24.asp for the solution.

The installation of service packs should be done in the same fashion on both nodes, after the proper backup, of course. Prior to installation on a node, all resource groups should be moved to the other one. All non-critical services, including the Cluster service should be stopped(once the failover completes). The same process should be followed on the other node.