This homepage is intended to provide helpful information for building a PC cluster by intermediate or advanced UNIX users (if you use Outlook to read your mail you probably should stop reading). It is not a self-contained document, explaining everything step by step. Instead it only emphasises points to be concerned about by those who have a rough idea how to build a cluster.

I. OS
We installed RedHat 8.0, mainly because we planned to run a production software that is supported on RedHat. Installation was very easy. For anybody with some experience only Custom installation is what I recommend. The following should be installed: Ethernet and Internet support, rsh/ssh/ftp including corresponding daemons; NIS and NFS. X-Windows is probably not needed for most scientific users, although it is probably worth of having to run diagnostics and some system utilities for setting the machines up. We installed OS onto two machines, one for the front end machine ("master") with lots of packages and much smaller set for the rest (slaves).

II. Cloning
We chose to clone the slaves rather than use BOOTP. For this an external "cloning machine" had to be used because Dell 4500C has no space for two HDDs. There is a nice reference on cloning, so we did as it said, except that MBR did not seem to be copied with only 446 bytes, so we copied 64K. We did not use "nc" because Dell 4500C comes without a floppy drive and we did not bother with creating a bootable CDROM and using "nc" (perhaps it would have been easier). So we took the installed disk, connected it as secondary IDE slave (hdd) and attached a fresh disk as secondary IDE master (hdc) (Western Digital was a bit mean in not putting jumper info on the disks so it had to be found on their homepage, thanks WD at least for that). hdc had to be changed 22 times (I scratched my fingers and lost some blood when doing it). Finally, /etc/fstab had an entry LABEL=/work rather then /dev/hda2. This means /work was also a part of the extended partition. After cloning /work did not seem to work even after being formatted, so I simply replaced LABEL=/work by /dev/hda2 and there was no further problem.
How to clone two identical disks


Copying one disk including reboot time (SCSI card would hog enough time to turn your hair gray) took about 1 hour. After the disk is put back and RedHat gets booted, one does:

III. System set-up (some part can be done before cloning and the rest individually).


IV. What did not work or misbehaved

USB mouse (labelled Dell by Logitech; a USB optical 3-button mouse) behaved strangely: it would work at first and then it would copy garbage, and finally stop being recognised. This may have been in part due to trying to install X-Windows but system reboots did not clear it. Magically it came back to life later. X-Windows I could not set up. Most likely the monitor is to blame, although the USB mouse may have something to do with that too. I do not think graphics adapter with its annoyingly large video memory (64M) gave any problem, though. Graphical installation of RedHat had no problem with graphics and mouse too. At some point xinetd on the master would not let any IP contact from the slaves. (Connexion refused). I am strongly convinced that restoring the /etc/hosts* files after I slightly changed them to what worked did not cure the problem. It seems that "setup" did something weird. I rerun and rebooted the machine many a time that day and somehow it went the way I wanted: rsh from slaves to master resumed working (the opposite never gave any problem; perhaps due to the fact I was messing with master's network files).

The default shell on all slaves is bash no matter what you do with chsh command. The same trouble was on another SUSE cluster here so it may be a general problem of NIS in Linux. On the master one can choose any shell. Possible solution: BEFORE CLONING, log in as each joe user and change the shell (not tried yet).

Note on mounting master's /usr on slaves as /usr/global. It worked quite fine. Most programs installed in master:/usr work on slaves right away. Some don't because they have some directories hardwired inside of the scripts. Possible solution is to do something like ln -s /usr/global/local/software /usr/local/software if it wants to be found in /usr/local/software. This has to be executed on each slave and since it requires root it takes time and is somewhat bothersome. A better solution: do it once ahead of time BEFORE CLONING (think WISELY what software you may need). Warning: in some cases running master's software on slaves would slow down the network. If you think/know this might be the case consider installing the software on slaves directly.

Funny thing happened with the additional 11 nodes. They had an on-board FastEthernet and an added-on PCI Gigabit card. Those were found by kudzu and labelled eth0 (Giga) and eth1(Fast). The interesting thing about multicards is that the last interface seems to be the default one. This means if you connect the nodes by Gigabit they cannot reach each other if you do rsh, because they think you are using FastEthernet. The solution is to switch around eth0 and eth1. This is easy, change in /etc/modules.conf
alias eth0 e100
alias eth1 e1000
into
alias eth0 e1000
alias eth1 e100
AND change /etc/sysconfig/network-scripts/ifcfg-eth?. I just realised that alternatively, one just changes GATEWAY in /etc/sysconfig/network (? not tried).

Something mysterious happenned when cloning 20 new machines that came with Seagate Barracuda ST340016A disks instead of previously used Western Digital disks. The latter ones were cloned without a trifle of trouble. Seagate disks could only be copied when nothing else was running on the machine, such as X Windows. Running dd from X Windows would lead to core dumps after a while and in other cases dd would just freeze. If one opened a couple of text-based windows with Alt-F? (without X Windows) then the process of cloning would become slower several times. I have no explanation for this, it looks like a bug in either Linux or firmware drivers or a bug in dd command. However, if left alone, Seagate disks were copied fine. To be precise there is one difference in copying WD and Seagate disks. WD disks had an extended partition, that had to be cloned by rebooting several times. Seagate disks were made to have no extended partition, with 4 normal ones: /, /work, /usr and swap. It is not clear if this has anything to do with dd failures. Seagate disks were extremely silent compared to noisy WD disks.

A trouble occurred with installing OS on the Seagate disks. I did not want to reinstall the whole system just because I cannot clone WD->Seagate (slightly different geometry). My plan was to copy / and /usr systems with dd because they had the same size and then install GRUB to MBR with:
root (hd2,0)
find /boot/grub/stage
setup (hd2)
(These are commands typed in after executing "grub"). This seemed to work but the disk would not boot (even GRUB would not boot). I could not use the recommended way of installing GRUB from a floppy since there is no floppy and an extrenal USB floppy drive would not be booted from. This having failed I did install the most minimal Linux from CDROM for the sole purpose of installing GRUB. After that I recopied /usr and / (I think I did cp -av /mnt/dsk1 /mnt/dsk2 for both after mounting what needed). Thus I made a master Seagate disk to be cloned. Interestingly, cp seemed faster than dd, although no exact check was done, perhaps because /usr uses only 1/3 of its space and dd copies everything.