I. OS
We installed RedHat 8.0, mainly because we planned to run a production
software that is supported on RedHat.
Installation was very easy. For anybody with some experience only Custom
installation is what I recommend. The following should be installed:
Ethernet and Internet support, rsh/ssh/ftp including corresponding daemons;
NIS and NFS. X-Windows is probably not needed for most scientific users,
although it is probably worth of having to run diagnostics and some system
utilities for setting the machines up.
We installed OS onto two machines, one for the front end machine ("master")
with lots of packages and much smaller set for the rest (slaves).
II. Cloning
We chose to clone the slaves rather than use BOOTP. For this an
external "cloning machine" had to be used because Dell 4500C has
no space for two HDDs. There is a nice reference on cloning, so
we did as it said, except that MBR did not seem to be copied with only 446
bytes, so we copied 64K. We did not use "nc" because
Dell 4500C comes without a floppy drive and we did not bother with
creating a bootable CDROM and using "nc" (perhaps it would have been
easier). So we took the installed disk, connected it
as secondary IDE slave (hdd) and attached a fresh disk
as secondary IDE master (hdc) (Western Digital was a bit mean
in not putting jumper info on the disks so it had to be found on their
homepage, thanks WD at least for that). hdc had to be changed 22 times
(I scratched my fingers and lost some blood when doing it).
Finally, /etc/fstab had an entry LABEL=/work rather then /dev/hda2.
This means /work was also a part of the extended
partition. After cloning /work did not seem to work
even after being formatted, so I simply replaced LABEL=/work by /dev/hda2 and
there was no further problem.
How to clone two identical disks
III. System set-up
(some part can be done before cloning and the rest individually).
IV. What did not work or misbehaved
USB mouse (labelled Dell by Logitech; a USB optical 3-button mouse) behaved strangely: it would work at first and then it would copy garbage, and finally stop being recognised. This may have been in part due to trying to install X-Windows but system reboots did not clear it. Magically it came back to life later. X-Windows I could not set up. Most likely the monitor is to blame, although the USB mouse may have something to do with that too. I do not think graphics adapter with its annoyingly large video memory (64M) gave any problem, though. Graphical installation of RedHat had no problem with graphics and mouse too. At some point xinetd on the master would not let any IP contact from the slaves. (Connexion refused). I am strongly convinced that restoring the /etc/hosts* files after I slightly changed them to what worked did not cure the problem. It seems that "setup" did something weird. I rerun and rebooted the machine many a time that day and somehow it went the way I wanted: rsh from slaves to master resumed working (the opposite never gave any problem; perhaps due to the fact I was messing with master's network files).
The default shell on all slaves is bash no matter what you do with chsh command. The same trouble was on another SUSE cluster here so it may be a general problem of NIS in Linux. On the master one can choose any shell. Possible solution: BEFORE CLONING, log in as each joe user and change the shell (not tried yet).
Note on mounting master's /usr on slaves as /usr/global. It worked quite fine. Most programs installed in master:/usr work on slaves right away. Some don't because they have some directories hardwired inside of the scripts. Possible solution is to do something like ln -s /usr/global/local/software /usr/local/software if it wants to be found in /usr/local/software. This has to be executed on each slave and since it requires root it takes time and is somewhat bothersome. A better solution: do it once ahead of time BEFORE CLONING (think WISELY what software you may need). Warning: in some cases running master's software on slaves would slow down the network. If you think/know this might be the case consider installing the software on slaves directly.
Funny thing happened with the additional 11 nodes. They had an on-board
FastEthernet and an added-on PCI Gigabit card. Those were found by kudzu and
labelled eth0 (Giga) and eth1(Fast). The interesting thing about multicards
is that the last interface seems to be the default one. This means if you
connect the nodes by Gigabit they cannot reach each other if you do rsh,
because they think you are using FastEthernet. The solution is to switch around
eth0 and eth1. This is easy, change in /etc/modules.conf
alias eth0 e100
alias eth1 e1000
into
alias eth0 e1000
alias eth1 e100
AND change /etc/sysconfig/network-scripts/ifcfg-eth?.
I just realised that alternatively, one just changes GATEWAY in
/etc/sysconfig/network (? not tried).
Something mysterious happenned when cloning 20 new machines that came with
Seagate Barracuda ST340016A disks instead of previously used Western Digital
disks. The latter ones were cloned without a trifle of trouble. Seagate disks
could only be copied when nothing else was running on the machine, such as
X Windows. Running dd from X Windows would lead to core dumps after a while
and in other cases dd would just freeze. If one opened a couple of text-based
windows with Alt-F? (without X Windows) then the process of cloning would
become slower several times. I have no explanation for this, it looks like a
bug in either Linux or firmware drivers or a bug in dd command. However, if
left alone, Seagate disks were copied fine. To be precise there is one
difference in copying WD and Seagate disks. WD disks had an extended partition,
that had to be cloned by rebooting several times. Seagate disks were made
to have no extended partition, with 4 normal ones: /, /work, /usr and swap.
It is not clear if this has anything to do with dd failures.
Seagate disks were extremely silent compared to noisy WD disks.
A trouble occurred with installing OS on the Seagate disks. I did not
want to reinstall the whole system just because I cannot clone WD->Seagate
(slightly different geometry). My plan was to copy / and /usr systems with dd
because they had the same size and then install GRUB to MBR with:
root (hd2,0)
find /boot/grub/stage
setup (hd2)
(These are commands typed in after executing "grub").
This seemed to work but the disk would not boot (even GRUB would not boot).
I could not use the recommended way of installing GRUB from a floppy since
there is no floppy and an extrenal USB floppy drive would not be booted from.
This having failed I did install the most minimal Linux from CDROM for the sole
purpose of installing GRUB. After that I recopied /usr and / (I think I did
cp -av /mnt/dsk1 /mnt/dsk2 for both after mounting what needed). Thus I made
a master Seagate disk to be cloned. Interestingly, cp seemed faster than dd,
although no exact check was done, perhaps because /usr uses only 1/3 of its
space and dd copies everything.