Some notes about building a Linux cluster (original) (raw)

Briansoft fr
Spam prevention notes
Old IRIX & Cluster notes
Windows/Cygwin & Ubuntu notes
Linux Embedded, screen & mouse
HTC Hero, Samsung S2, LG G3 &
Archos 50d Oxygen Wiko UPulse+View
Sony reader & SpectreXT
LCD dead pixel test
Book reviews & movie management
Digital imaging workflows
Camping Kangoo & Partner
Computer quotes Freeware:
BinToAscii.exe
GenPalette.exe
LargeImage.exe
PanelToForm.exe
PanoWarp.exe
PngSnapShot.exe
RotAndStack.exe
Pixel count/ratio converter
Thumbnailer.exe
FileNbExt/RollLog/ImDim
XSynch.exe Source code: Navigation: < Previous Next >
$upport my site

Some notes on how to build a Linux cluster

Compiled in 2003 — Based on many FAQs and Usenet group contributions.

"SUPERCOMPUTER: what it sounded like before you bought it."

On this page:

Cluster And Grid Computing
Cluster Hardware
Software / distributed libraries
Other distributed OSs
Distributed Numerical processing
Installing OpenMosix on RedHat
IP Choice
Cluster Books
Bugs
Presentation
Working with OpenMosix
DSH, The Distributed Shell
Cluster administration
Hard Drives
Installed software and services
Some images available as wallpaper
Some images available on a royalty-free compilation CD
Some images available as a poster

Cluster And Grid Computing

Linux based clusters are pretty much in fashion these days. The ability to use cheap off-the-shelf hardware together with open source software makes Linux an ideal platform for supercomputing. Linux clusters are either: "Beowulf" clusters, "MOSIX" clusters or "High-Availability" clusters. Cluster types are chosen depending on the application requirements which usually fall in one of the following categories: computational intensive (Beowulf, Mosix), I/O intensive (supercomputers) or high availability (failover, servers...). In addition to clusters, there are two related architectures: distributed systems (seti@home, folding@home, Condor...) that can run on computers with totally different architectures spread out over the Internet; and multiprocessor machines (those, like the SGIs are much superior to clusters when memory needs to be shared fast between processes).

New 2006: Although I don't work on clusters anymore, I try to follow what's going on. There are several recent developments like Starfish or Amazon EC2 (Elastic Compute Cloud). The first one is implemented in Ruby and only takes a few lines of code to get up and running. The latter runs a Linux virtual image remotely on as many servers as you need. There are many grid services available, but the economics of them is sometimes hard to fathom; Sun's service has been a failure for instance as it takes way too much work on the programmer's part.

High Availability

High Availability (HA) clusters are highly fault tolerant server systems where 100% up-times are required. In the event of failure of a node, the other nodes which form the cluster will take over the functionality of the failed node transparently. HA clusters are typically used for DNS server, proxy and web servers.

High-Availability cluster Links:

Linux High Availability Project

Beowulf clusters

Clusters are used where tremendous raw processing power is required like in simulations and rendering. Clusters use parallel computing to reduce the time required for a CPU intensive job. The workload is distributed among the nodes that make up the clusters and the instructions are executed in parallel. More nodes means faster execution and less time taken.

The difference between the two main cluster types, Mosix and Beowulf is that OpenMosix is a kernel implementation of process migration, whereas Beowulf is a programming model for parallel computation. Beowulf clusters need distributed application programming environments such as PVM (Parallel Virtual Machine) or MPI (Message Passing Interface). PVM is the standard interface for parallel computing. But MPI is becoming the industry standard. PVM still has an upper edge over MPI as there are more PVM aware applications when compared to MPI based ones. But this could soon change as MPI becomes popular.

Beowulf Links:

The Beowulf Project
Beowulf FAQ (3 years old)
The parallel execution framework
Linux Clusters Finally Break the TeraFLOP barrier

Mosix clusters

MOSIX is a software package that was specifically designed to enhance the Linux kernel with cluster computing capabilities. The core of MOSIX are adaptive (on-line) load-balancing, memory ushering and file I/O optimization algorithms that respond to variations in the use of the cluster resources, e.g., uneven load distribution or excessive disk swapping due to lack of free memory in one of the nodes. In such cases, MOSIX initiates process migration from one node to another, to balance the load, or to move a process to a node that has sufficient free memory or to reduce the number of remote file I/O operations. MOSIX clusters are typically used in data centers and data warehouses.

Mosix works best when running plenty of separate CPU intensive tasks. Shared memory is its big drawback, like Beowulf: for applications using shared memory, such as Web servers or database servers, there will not be any benefit from [Open]Mosix because all processes accessing said shared memory must resided on the same node.

MPI and openmosix play very nicely together, the processes that MPI spawns on remote nodes are migrated by the OpenMosix load balancing algorithm like any other process. So in a sense it's better in a dynamic environment where different MPI programs will be running that are unaware of each other and they could both try to overload any individual node.

Mosix Links:

A good read about Distributed OSs
Mosix, Posix and Linux
The MOSIX Project
Mosix on FreshMeat.
OpenMosix and Slashdot comments.
OpenMosix Internals
The Mandrake Mosix Terminal Server Project.
LTSP+OpenMosix: A Mini How-To
A Recipe for a diskless MOSIX cluster using Cluster-NFS
K12LTSP MOSIX Howto... for diskless clusters and the Slashdot comments.
Sparse: Linux cluster with diskless clients, SuSE 8.0, ClusterNFS and MOSIX
ClusterKnoppix
Debian GNU/Linux openmosix
openMosix programs in perl with Parallel::ForkManager (also available from CPAN).

Cluster Hardware

What's in a cluster ?

Commodity hardware components (usually PC), interconnected by a private high-speed network. The nodes run the jobs. It is sometimes connected to the outside world through only a single node, if at all. Overall cost may be a third to a tenth of the price of a traditional supercomputer.

What's the most important: CPU speed, memory speed, memory size, cache size, disk speed, disk size, or network bandwidth ? Should I use dual-CPU machines ? Should I use Alphas, PowerPCs, ARMs, x86s or Xeons ? Should I use Fast Ethernet, Gigabit Ethernet, Myrinet, SCI, FDDI, FiberChannel ? Should I use Ethernet switches or hubs ? ...

The short answer is: it all depends upon your application! Benchmark, profile, find the bottleneck, fix it, repeat.

Some people have reported that dual-CPU machines scale better than single-CPU machines because your computation can run uninterrupted on one CPU while the other CPU handles all the network interrupts.

Can I make a cluster out of different kinds of machines: single-processor, dual-processor, 2.0GHz, 2.4GHz, etc ?

Yes. Depending upon the application mix, the cluster can be configured to look like multiple clusters, each optimized for the specific application.

Is it possible to make a cluster of already used workstations ?

Yes, you can use your secretary's computer's unused CPU cycles. Install a cluster package like Mosix or Condor on it and add its IP to the overall cluster. This kind of architecture is called a COW (Cluster Of Workstations). The main drawbacks are security (any compromised machine compromises the entire cluster), unstable apps (if you crash a machine, part or even the entire cluster could be affected) and deciding who gets to run the CPU intensive jobs.

How do you connect the nodes ?

Lately, with the fashion of simple clusters of low-quality hardware, the most common type of interconnect has been the cheap Gb ethernet LAN. But there are other type that can be faster or with lower latency or both: Inifiniband, Myrinet, SCI, Quadrix... Those can be expensive though.

Hardware Links:

The cluster cookbook
x86 64 bits, not quite ready yet but good reviews...
Alpha processors, reliable but being phased out (that's what you get for going for DEC to Compaq to HP to Alient or whatever its called now).
Double up your cluster with two CPU boards which are a good way to double CPU power for very extra money. On the other and, quad (or more) motherboards are prohibitively expensive.
Sharing an IEEE 1394 Device Between Machines
Different types of interconnects.

Hardware cost for a sample cluster (March 2003)

Here's an Excel spreadsheet of the hardware list and total cost involved in our cluster at CSU. About 24000$ (including shipping) for a 12 nodes, 24 processor cluster. Some of the key characteristics are: AMD Athlon MP 2400 with better heatsinks, Tyan Thunder K7X Pro motherboards (without SCSI), internal 2Tb RAID-5 IDE/ATA array, Gb LAN, 24Gb of RAM, Linux RedHat 8.0 with OpenMosix...

Software / distributed libraries

Where's the Beowulf software?

There isn't a software package called "Beowulf". There are, however, several pieces of software many people have found useful for building Beowulf clusters: MPI, LAM, PVM, Linux....

How much faster will my software run?

It's very application dependent. You'll need to split it into parallel tasks that communicate using MPI or PVM. Then you need to recompile it. Some "highly parallel" tasks can take advantage of running on multiple processors simultaneously, but then a Mosix cluster can do that more easily.

What are PVM and MPI?

PVM and MPI are libraries that allow you to write message-passing parallel programs in Fortran or C to run on a cluster. PVM used to be the defacto standard until MPI appeared. But PVM is still widely used and very good. MPI (Message Passing Interface) is a defacto standard for portable message-passing parallel programs standardized by the MPI Forum and available on all massively-parallel supercomputers.

Is there a compiler that will automatically parallelize my code for a Beowulf?

No. But there is BERT from plogic.com which will help you manually parallelize your Fortran code. And NAG's and Portland Group's Fortran compilers can also build parallel versions of your Fortran code, given some hints from you (in the form of HPF and OpenMP directives). These versions may not run any faster than the non-parallel versions. The gnu gcc, Portland Group, Intel, Fujitsu and Absoft are the most commonly used compilers.

PVM (Parallel Virtual Machine)
Condor: High Throughput Computing (HTC)
Harness: the next generation system beyond PVM.
MPI (Message Passing Interface). Note that you cannot get MPI to run on Mosix.
Religious war about the Best Linux distro ?

Other distributed OSs

Rocks

With ROCKS, you get:

SCSI Support
Easily reinstalled nodes
Pre-installed queue software
brain dead admin tools

Mandrake 'Click'

The new Mandrake 'Click' Turn-Key Clustering Distribution (in beta). Pretty easy to install.

ClusterKnoppix/Quantian

ClusterKnoppix and Quantian are Knoppix boot CDs with OpenMosix self-discovery daemons and network boot. It's an excellent way to have an instant cluster: put the CD in, boot the master, reboot the slaves and you are clustering !!! Includes mathematical software. You don't need hard drives to run it and it's very easy and quick to install. If you want to try out clustering, start with this first.

Distributed Numerical processing

About GridMathematica and some Slashdot comments.
For interactive symbolic manipulation, Maxima is an excellent open-source alternative.
For numerical applications, Numerical Python and its associated packages beat both Matlab and Mathematica.
For 3D visualization, you can get VTK, which also has Python bindings.
You can get a copy of Numerical Python and run PyPVM or PyMPI with it for distributed computing.

Installing OpenMosix on RedHat

You will need (use appropriate versions if newer than this info which I won't update):

RedHat 8.0
openmosix-kernel-2.4.20-openmosix3.i686.rpm (or the most current, or Athlon architecture)
openmosix-tools-0.2.4-1.i386.rpm (or Athlon)
openMosixUserland-0.2.4.tgz
openMosix-2.4.20-4.gz (identical to your kernel)
2.4.20 (or latest) kernel source installed in /usr/src/linux/
DSH (Distributed Shell) Then:

Install RedHat
Install openMosix rpms
Install kernel source to /usr/src/linux and patch with openMosix-2.4.20-4.gz
Compile and install openMosixUserland-0.2.4
Configure /etc/openmosix.map
Reboot new openMosix kernel
setpe -W -f; edit /etc/openmosix.map on each node
Install libdshconfig and DSH
Put the other nodes' hard drives as slaves in your just configured box, format it and remount your main drive as read-only (mount -o remount,ro /), dd if=/dev/sda of=/dev/sdb. Mount the new disk to update a few files (like /etc/sysconfig/network-scripts/ifcfg-eth0).

Using OpenMosix

What do you need to know to use OpenMosix ?

In short: nothing. Log in the Master node, compile your programs (with optional AMD Athlon optimization), run them, they will migrate to an available node. You never need to login into the slave nodes.

In long: it depends if the task is CPU intensive or I/O intensive. Read on...

CPU intensive tasks

Those are perfect for Mosix and should migrate well (there are some exceptions, like Java applications) and make the cluster feel like an SMP machine. You can use pmake or make -j 24 to compile in parallel.

I/O intensive tasks

Use mosmon to find a node (N) that doesn't have much running, copy your data to its disk "/mfs/N/dir" and just launch your code, it will move to that node in order to minimize I/O.

The Mosix Filesystem (MFS)

All disks are shared between nodes in the following way: /mfs/Node#/Dir. In addition you can refer to process dependent directories: /mfs/here (the process' current node), /mfs/home (the process' home node).

The Distributed Shell (DSH)

DSH allows you to run the same command on all the nodes: dsh -r ssh "uname -a"

More Mosix control

the following commands give you info on running processes, or some control on how they are migrated by OpenMosix: migrate, mon, mosctl, mosrun, setpe, tune, or the graphical programs openMosixview, openMosixprocs, openMosixcollector, openMosixanalyser, openMosixhistory, openMosixmigmon... Use man.

More info ?

Read the Mosix HowTo (only 99 pages).

Possible hostnames for a cluster

Lists kindly generated by Google sets.

Northern mythology

Accalia, Aegir, Aesir, Aldwin, Asgard, Avalon, Balder, Baldric, Beowulf, Bragi, Brahma, Dolvar, Drexel, Fenrir, Forseti, Frey, Freya, Freyr, Frigg, Fulla, Grendel, Heimdall, Hel, Hermod, Hildreth, Hod, Hofstra, Hrothgar, Hyglac, Hyoga, Hyster, Idun, Jord, Loki, Megalon, Midgard, Naegling, Njord, Norns, Odin, Ragnarok, Ran, Rhung, Sif, Skadi, Snotra, Thor, Tyr, Ull, Unferth, Valhalla, Valkyrie, Vidar, Vor, Wiglaf, Yggdrasil...

More technical

criticalregion, dde, deadlock, distributed, globallock, messagepassing, multiprocess, mutex, namedpipe, parallel, rendezvous, rpc, semaphore, setiathome, sharedmemory, thread...

Hungry

Anise, Basil, Chamomile, Chive, Coriander, Dill, Fennel, Hyssop, laurel, Lavender, Mint, Oregano, Parsley, Rosemary, Tarragon, Thyme...

IP Choice

Private IP Networks

RFC 1918 reserved three IP network addresses for private networks. The addresses 10.0.0.0/8, 192.168.0.0/16 and 172.16.0.0/20 can be used for anyone for setting up their own internal IP networks.

Class 'E' Addresses

The Class 'E' IP address is reserved for future use. In Class 'E' addresses, the 4 highest-order bits are set to 1, 1, 1, and 1 (in practice 241.0.0.0 through 248.255.255.254). Routers and Firewalls currently ignore class 'E' IP addresses so you can use them between the machines, and have the master node also have a normal IP as the only way in the cluster.

Cluster Books

Except for the first one, most of those books are either outdated, very vague, too specific (only source code) or with bad reviews. You've been warned.

The Linux Enterprise Cluster by Karl Kopper.
Linux Clustering: Building and Maintaining Linux Clusters by Charles Bookman.
Building Linux Clusters by David H. M. Spector.
Linux Cluster Architecture by Alex Vrenios .
Beowulf Cluster Computing with Linux (Scientific and Engineering Computation) by Thomas Sterling (Editor), et al.
How to Build a Beowulf cluster by Thomas L. Sterling, et al.
Shared Data Clusters: Scaleable, Manageable, and Highly Available Systems (VERITAS Series) by Dilip M. Ranade.
Cluster Computing by Rajkumar Buyya (Editor), Clemens Szyperski.
High Performance Cluster Computing: Architectures and Systems by Rajkumar Buyya.
High Performance Cluster Computing: Programming and Applications, Volume 2 by Rajkumar Buyya.
In Search of Clusters (2nd Edition) by Gregory F. Pfister.
High Performance Mass Storage and Parallel I/O: Technologies and Applications by Hai Jin.

And a little wallpaper as a gift:

[ManyEmperors_LinuxCluster.jpg]
A cluster of penguins...

Welcome to the Neumann cluster

Bugs

There is a bug in mfs that causes some recursive commands to fail painfully. For instance the following commands will cause the node to hang: find /, du -ks /... In other words, do not lauch recursive commands from / (the root directory) or /mfs. A workaround is to use find / \! -fstype mfs or locate. Note: that problem may be solved now (by unmounting a shared memory module), but I don't want to risk it...

rpm -Uvh --oldpackage fileutils-4.1-10.1.i386.rpm to avoid bug in cp and move returning "cp: skipping file `foo', as it was replaced while being copied".

Presentation

Neumann is an x86 Linux cluster running OpenMosix. There are twelve nodes in the cluster, each node having two AMD Athlon MP 2400+ CPUs, 2Gb of RAM and 60Gb of disk space. The main node, on which you login, also has a 2Tb RAID array. A 1Gbps ethernet switch interconnects the nodes, and only the master node is connected to the outside via a 100Mbps LAN. Each CPU has 3 floating point units able to work simultaneously.

Each node runs Red Hat Linux 8.0, fully patched, with a customized kernel 2.4.20. OpenMosix is a kernel modification that makes a group of computers act like a multiprocessor machine (SMP), so the simplest way to use Neumann is to just ignore it, it will run your CPU intensive jobs on whatever node has spare CPU.

Working with OpenMosix

In addition to the current page, you might want to take a look at the following information:

The OpenMosix project site.
I strongly recommend taking a look at the OpenMosix HowTo pdf file, even if you are only a user.
The most recent version of this HowTo is also available in other formats (including html) here.

The simplest way to use OpenMosix is to just ignore it. Just compile and run your CPU intensive jobs on the master node, and they will be dispatched automagically to whatever node has spare CPU cycles. And when a node becomes overloaded, processes will migrate to a more idle node.

MFS

One very useful part of OpenMosix if the MFS (Mosix FileSystem). You can access any drive on any node from any node using a path like /mfs/3/usr/src/. Just prepend /mfs/NodeNum to the normal path. Note that you don't need to do this for your home directory as it's already linked to /mfs/1/home/username. Note that when doing I/O from a local disk (like /), the processes need to migrate back so they can read the proper drive (/ on node 1, not / on node N); if on the other hand you use mfs, processes can read without migrating. So Command </mfs/1/dir/InputFile >/mfs/1/dir/OutputFile is much preferable to Command <InputFile >OutputFile Remember this syntax when hard-coding pathnames in your programs, it is the simplest and best way to optimize your programs for OpenMosix.

In addition to /mfs/NodeNum/, there are virtual directories accessible on a 'per process' base such as /mfs/home or /mfs/here for quick access. Read the HowTo for more.

Mosix Commands

If you want more in depth control, here are a few tools, both X-windows and command line:

X-windows	Command line
openmosixview Display the state of the various nodes and their load (the load includes both CPU and outstanding I/O). openmosixmigmon Display the various jobs running on the various nodes. The green squares are jobs that have been migrated away from the master node. You can click on them and drag them onto another 'penguin' to migrate them manually. Or issue migrate PID node# from the command line. I would advise you to run your large jobs with mosrun -j2-12 -F job and then move them around manually with openmosixmigmon. In other words, try not to run your big jobs on the master node (neumann, 1) as it's already busy handling network I/O, interactive users and more. openmosixprocs Process box (similar to to Linux system monitor). Double click a process to be able to change its settings (nice, locked, migration...) openmosixhistory Process history for the cluster openmosixanalyser Similar to xload for the entire cluster	mosrun use this to start a job in a specific way, for instance mosrun -j2-12 -F job will start a job on one of the slave nodes and it will not migrate automatically after that. mosrun -c job for CPU intensive jobs, mosrun -i job for I/O intensive... man mosrun for more. Various of the mosrun options have command line aliases, for instance nomig to lauch a process not allowed to migrate. migrate PID [Node]Manually migrate a process to another node. Note that if the process is not node-locked it may change node later.mps and mtopOpenMosix aware replacements for the ps and top utilitiesmosctl Controls the way a node behaves in the cluster (if it's willing to run remote jobs and more...) man mosctl for more.

X-windows

Command line

openmosixview Display the state of the various nodes and their load (the load includes both CPU and outstanding I/O). openmosixmigmon Display the various jobs running on the various nodes. The green squares are jobs that have been migrated away from the master node. You can click on them and drag them onto another 'penguin' to migrate them manually. Or issue migrate PID node# from the command line. I would advise you to run your large jobs with mosrun -j2-12 -F job and then move them around manually with openmosixmigmon. In other words, try not to run your big jobs on the master node (neumann, 1) as it's already busy handling network I/O, interactive users and more. openmosixprocs Process box (similar to to Linux system monitor). Double click a process to be able to change its settings (nice, locked, migration...) openmosixhistory Process history for the cluster openmosixanalyser Similar to xload for the entire cluster

mosrun use this to start a job in a specific way, for instance mosrun -j2-12 -F job will start a job on one of the slave nodes and it will not migrate automatically after that. mosrun -c job for CPU intensive jobs, mosrun -i job for I/O intensive... man mosrun for more. Various of the mosrun options have command line aliases, for instance nomig to lauch a process not allowed to migrate. migrate PID [Node]Manually migrate a process to another node. Note that if the process is not node-locked it may change node later.mps and mtopOpenMosix aware replacements for the ps and top utilitiesmosctl Controls the way a node behaves in the cluster (if it's willing to run remote jobs and more...) man mosctl for more.

Advanced clusterization

I you write shell scripts, you can use the power of OpenMosix with the simple '&' to fork jobs to other nodes. In C, use the fork() function; there is a code sample chapter 12 of the OpenMosix HowTo. Also see the HowTo file for more information on Python, Perl, Blast, Pvm, MPICH... Note that Java doesn't migrate.

OpenMosix is designed to allow you to run multiple jobs simultaneously in a simple way. You cannot run a single job (even multithreaded) on more than one processor. If you need to do something like that, you need to use the MPI (Message Passing) library. MPICH is the implementation of MPI which is compatible with OpenMosix, version 1.2.5 is installed in /usr/local/mpich for use with the PGI compilers (not the GNU compilers). There's also an MPI library coming with RedHat Linux, but there's no warranty about its OpenMosix compatibility, so make sure you are using the proper one. Also read this short page about MPICH on OpenMosix (it tells you to use a MOSRUN_ARGS environment variable). A more recent discussion about Looking for Portable MPI I/O Implementation.

If you need checkpoint and restart facility for extremely time consuming problems (the cluster can reboot and the computation continues), you can ask for Condor to be installed on top of OpenMosix.

You normally do not need to log onto the slave nodes. But it is possible. You just need to type ssh neumann2 (3..12). Note that you can use the usual ssh-keygen -t dsa, cat ~/.ssh/*.pub >>~/.ssh/authorized_keys, ssh-agent $SHELL and ssh-add to define a (possibly empty) passphrase so that you won't have to enter your password each time. Note that you can access neumann# only from neumann, '#' being the node number (2 to 12). See below.

DSH, The Distributed Shell

DSH allows you to run a command on all the nodes of the cluster simultaneously. For instance:

dsh -a reboot

Reboot the entire (-a) cluster.

dsh -c -g slaves "shutdown -h now"

Shut down gracefully all the slave nodes.

dsh -a df

Execute df on every node, in order (option -c executes concurrently)

dsh -c -g custom /home/CustomJob

Run a custom job on the computers defined in the file ~/.dsh/group/custom, so that even if the master node crashes, they shouldn't be affected (but bear in mind that /home is a symlink to /mfs/1/home

dsh -a passwd

the password files need to be the same on all the nodes, so you need to change your password everywhere at once... Sorry, but you need to type it in 24 times !

Setting up password-less access (necessary for dsh).

Run ssh-keygen -t dsa from a shell. It creates a secret key and a public key. Use an empty passphrase.
Copy the public key onto any machine you want to access: cat ~/.ssh/*.pub >>~/.ssh/authorized_keys. Here we copy to the same machine because the home directory is shared by every node (except for root where you need to do cat ~/.ssh/*.pub >>/mfs/Node#/root/.ssh/authorized_keys for every node).
Add ssh-agent $SHELL to you login file (.profile or .bash_profile)
Type ssh-add each time after you log in and type your passphrase.
You can now type ssh neumann# to log onto any node (or use dsh) without password. The first time you have to answer yes to add the node to the list of known hosts.

Cluster administration

Kernels

Several kernels are available to boot from, depending on use and issue to troubleshoot:

2.4.20-openmosix-gd

Optimized OpenMosix kernel. This is the default kernel. Note that this kernel is slightly different on the master than on the slaves (although named the same). Everything that wasn't strictly necessary has been removed from the slave kernels (no CD, no floppy, no SCSI, no enhanced video, no sound...) and almost all from the master kernel.

2.4.20gd

Optimized 2.4.20 kernel

2.4.20-openmosix2smp

Basic openmosix kernel

2.4.20

Vanilla 2.4.20 kernel, no customization.

2.4.18-14smp

Default multiprocessor RedHat 8.0 kernel

2.4.18-14

Basic RedHat 8.0 kernel, if all else fails.

A tarred copy of the slave kernel is kept in /boot on the master node. Note that when you make xconfig dep clean bzImage modules modules_install install, the last phase (install) cannot work if you try to recompile the kernel currently in use; reboot and use another kernel first.

Important note: if you recompile a kernel, you need to edit /boot/grub/grub.conf after running make install and change root=LABEL=/ to root=/dev/hda3 (master) or hda7 (slaves). If you get a kernel panic on boot, you probably forgot to do that.

There is a complete disk image of the neumann2 slave in the file Slave.img.gz. It was created by putting the slave drive in the master node and doing dd if=/dev/hdd | gzip >Slave.img.gz. You can create a new slave by putting a blank hard drive in the master (or any node) and doing zcat Slave.img.gz >/dev/hdd. It takes about 12 hours and OpenMosix should be disabled while you do that (/etc/init.d/openmosix stop). After completion, you need to change the node hostname and IP with /usr/bin/neat and also various settings that may have changed since the image file was created (last kernel...). Note: if you connect the drive as primary slave it is /dev/hdb, secondary master is /dev/hdc and secondary slave is /dev/hdd.

Boot sequence

ECC memory scrubbing is the first thing to start and makes it look like the computer is stuck for a minute when just turned on (screen stays blank, no POST beeps...). Important: do not turn the machine off during the sequence.
The 3ware hardware raid controller BIOS setup runs briefly (press [Alt][3] to enter its configuration mode)
Then the BIOS starts several tests. Press [F2] to interrupt and enter the BIOS. Password is the same as the root password. Everything that wasn't necessary has been disabled: no floppy, CD, USB, serial or parallel port. Keyboard and mouse are optional. Full ECC (memory parity) recovery is enabled. At the end the BIOS shows a summary page for 30 seconds.
The Intel boot agent takes over; press [Ctrl][S] to enter it, but there's nothing useful in it.
The GRUB boot loader starts and waits for 30 seconds to boot alternate kernels.
The Linux kernel starts. If there's been an unclean shutdown, you have 5 seconds to press Y to do a integrity disk checking. Once the kernel has finished loading, OpenMosix is operational.

OpenMosix administration

The user IDs must be the same on all the nodes. In other words /etc/passwd, /etc/shadow and /etc/group must be identical on all nodes. You can edit and run /root/AddUser.sh to add a user on all nodes.

Use /etc/init.d/openmosix {start|stop|status|restart} to control openmosix on the current node.

/etc/mosix.map contains the list of nodes and must be identical on all nodes. You want to keep it consistent with /etc/hosts. Autodiscovery is not used.

If a slave node crashes, the jobs that had been migrated to it are lost (and the cluster may hang for a while while it sort things out). If the master node crashes, all jobs are lost (except if they were started somewhere else). This is because a job that has been migrated away keeps some references/pointers/resources on its home node.

To keep the master node from running too many jobs, either migrate them manually with migrate or openmosixmigmon, or set its speed to some low value with mosctl setspeed 7000

Hard Drives

Partitions

Partitions are slightly different on the master than on the slaves.

Mount	Type	Size	Master	Slaves
/boot	ext3	100Mb	/dev/hda1	/dev/hda1
swap	4Gb	/dev/hda2	/dev/hda3
/	ext3	1Gb	/dev/hda3	/dev/hda7
/usr	ext3	4Gb	/dev/hda5	/dev/hda2
/var	ext3	1Gb	/dev/hda6	/dev/hda6
/tmp	ext3	1Gb	/dev/hda7	/dev/hda5
/home	ext3	47Gb	/dev/hda8
/spare	ext3	47Gb	/dev/hda8
/raid	ext3	1.7Tb	/dev/sda1

Note that there's no specific use for the 11 /spare partitions (totaling about 500Gb) present on the slaves. Here is my suggestion: if you have I/O intensive jobs, put your data on one of those drives (say /mfs/7/spare/) and then run your job in a similar way: mosrun -j7 -F -i "job </mfs/7/spare/LargeDataFile" so that the job stays close to its data and doesn't use the network. Remember to delete your data afterwards (or back it up on /raid).

RAID management

The RAID is based on a 3ware hardware controller model Escalade 7500-8 with 8 250Gb Maxtor hard drives. The RAID is configured as a RAID-5 where data parity is spread over all the drives. One drive may fail without causing any data loss. After a drive failure, the array must either be rebuilt on the remaining drives, or a blank drive may be inserted to replace the failed one prior to rebuilding. In case of drive failure, an email is sent to the admin.

Configuration is done both on boot in the 3ware utility (for basic configuration), or with cli or preferably with 3dm through a web browser. From neumann itself, type mozilla http://localhost:1080/ to manage the array.

U: drive and remote mounts

Some NFS mounts are available under /mnt/, for instance personal network files are mounted under /mnt/udrive/yourusername. Note that this is the Engineering network username, not the Neumann username, which may be different.

Installed software and services

Compilers and their options

The AMD Athlon MP processors are able to run native i86 code, but for full performance, you should use the special athlon compiling options (SSE and 3DNow!) whenever applicable in your compiler. One of the main difference between Intel and AMD is that the Athlon has 3 floating point units (FPUs) leading to a theorical performance of 2 double precision multiplications and 2 single precision additions per clock cycle. The actual performance is 3.4 FLOPs per clock, leading to a peak performance of 160 Gflops for the entire cluster.

For the GNU compilers are the default Linux (g77, gcc, gcj). Use g77 --target-help or gcc --target-help to get the relevant compiling options for optimizing your programs (add them to your makefile):

For the PGI compilers (pgcc, pgCC, pgdbg, pgf77, pgf90, pghpf...) you need the proper variables set ($PGI, $MANPATH...) in your ~/.profile or ~/.bash_profile before you can use them or use man. Note that PGI workstation is limited to 4 processors per thread and that you can compile only from node 1.

CC	CFLAGS	Description
gcc	-O3 -march=athlon-mp	GNU compiler 3.2
pgcc	-O2 -tp athlonxp -fast	Portland group C compiler
pgCC	-O2 -tp athlonxp -fast	Portland group C++ compiler
bcc	-3	Bruce's C compiler
FC	FFLAGS	Description
g77	-O3 -march=athlon-mp	G77 3.2
pgf77	-O2 -tp athlonxp -fast	Portland group fortran 77 compiler
pgf90	-O2 -tp athlonxp -fast	Portland group fortran 90 compiler
pghpf	-O2 -tp athlonxp -fast	Portland group High Performance Fortran compiler
Other languages	Description
gcj	?	Ahead-of-time GNU compiler for the Java language
perl	?	The Perl Compiler
python	?	An interpreted, interactive, object-oriented programming language
pgdbg	?	The PGI X-windows degugger

Services running on Neumann (5 is the normal runlevel)

Active services	Disabled services
anacronapmdatdautofscrondgpmhttpdisdnkeytablekudzulpdnetfsnetworknfslockntpdopenmosixpcmciaportmaprandomrawdevicesrhnsdsendmailsshdsyslogxfsxinetd	aep1000bcm5820firstbootiptablesirdanamednfsnscdsaslauthdsmbsnmpdsnmptrapdtuxwinbind
xinetd based services:
mosstatd sgi_fam	chargen chargen-udp daytime daytime-udp echo echo-udp rsync servers services time time-udp

Specific software

Type /mnt/fluent/bin/fluent to start the fluent application (remotely mounted from gohan). Also installed is xfig. I'm having trouble compiling Grace, a successor to xmgr.

If you need to work with MS Office documents, you can start OpenOffice with /usr/lib/openoffice/program/soffice or /usr/lib/openoffice/program/scalc or more in that directory...

The PGI Workstation suite of compilers are installed in /usr/pgi (see above).

Software packages installed on Neumann

Here's the short list below. For a complete list with version numbers and comment, see the complete RPM list with rpm -qa --info at the command line.

4Suite gal libf2c netconfig setuptool a2ps gawk libgal19 netpbm sgml acl gcc libgcc newt sh alchemist gconf libgcj nfs shadow anacron GConf libghttp nmap slang apmd GConf2 libglade nscd slocate ash gd libglade2 nss_ldap slrn aspell gdb libgnat ntp sox at gdbm libgnome ntsysv specspo atk gdk libgnomecanvas oaf splint attr gdm libgnomeprint octave star audiofile gedit libgnomeprint15 Omni statserial authconfig gettext libgnomeprintui omtest strace autoconf gftp libgnomeui openjade stunnel autofs ggv libgtop2 openldap sudo automake ghostscript libIDL openmosix swig automake14 gimp libjpeg openmosixview switchdesk automake15 glib libmng openmotif sysklogd basesystem glib2 libmrproject openoffice syslinux bash glibc libogg openssh SysVinit bc Glide3 libole2 openssl talk bdflush glut libpcap orbit tar bind gmp libpng ORBit tcl binutils gnome libpng10 ORBit2 tcpdump bison gnumeric librpm404 pam tcp_wrappers bitmap gnupg librsvg2 pam_krb5 tcsh blas gnuplot libstdc++ pam_smb telnet bonobo gpg libtermcap pango termcap byacc gphoto2 libtiff parted tetex bzip2 gpm libtool passivetex texinfo cdecl gqview libungif passwd textutils chkconfig grep libunicode patch time chkfontpath groff libusb patchutils timeconfig compat grub libuser pax tk comps gtk libvorbis pciutils tmpwatch control gtk+ libwnck pcre traceroute cpio gtk2 libwvstreams perl transfig cpp gtkam libxml php ttfprint cracklib gtkhtml libxml2 pilot tux crontabs gtkhtml2 libxslt pine umb cups guile lilo pinfo units curl gzip linc pkgconfig unix2dos cvs hdparm linuxdoc pmake unzip cyrus hesiod lockdev pnm2ppa up2date db4 hotplug logrotate popt urw desktop hpijs logwatch portmap usbutils dev htmlview lokkit postfix usermode dev86 httpd losetup ppp utempter dhclient hwbrowser LPRng procmail util dia hwcrypto lrzsz procps VFlib2 dialog hwdata lsof psmisc vim diffstat ImageMagick ltrace pspell vixie diffutils imlib lvm psutils vte docbook indent lynx pvm w3c dos2unix indexhtml m4 pygtk2 webalizer dosfstools info magicdev pyOpenSSL wget doxygen initscripts mailcap python which dump intltool mailx pyxf86config whois e2fsprogs iproute make PyXML WindowMaker ed iptables MAKEDEV qt wireless eel2 iputils man quota words eject irda maximum raidtools wvdial emacs isdn4k mc rcs Xaw3d eog jadetex memprof rdate xdelta esound jfsutils metacity rdist xfig ethereal kbd mingetty readline XFree86 ethtool kbdconfig minicom redhat Xft evolution kernel mkbootdisk reiserfs xinetd expat krb5 mkinitrd rhl xinitrc expect krbafs mktemp rhn xisdnload fam ksymoops mod_perl rhnlib xml fbset kudzu mod_python rhpl xmltex fetchmail lam mod_ssl rmt xmlto file lapack modutils rootfiles xpdf filesystem less mount rp xsane fileutils lesstif mouseconfig rpm xscreensaver findutils lftp mozilla rpm404 xsri finger lha mpage rsh Xtest firstboot libacl mrproject rsync yelp flex libaio mt samba yp fontconfig libart_lgpl mtools sane ypbind foomatic libattr mtr screen zip fortune libbonobo mutt scrollkeeper zlib freetype libbonoboui nautilus sed ftp libcap ncurses sendmail gail libcapplet0 nedit setserial gaim libelf net setup