HEP Cluster
General information
Hardware:- CPU: Intel® Core™ i7 920 @ 2.67 GHz (quad core)
- Memory: 12GB/node (DDR3 1333MHz)
- Network: 1 Gbit
-
Operating System:
- CentOS 5.4
-
Queuing System
- Torque & Maui
-
Compilers
- GNU C, C++, FORTRAN 90 & 77 (version 4.1.2 20080704 (Red Hat
4.1.2-46))
- Intel C, C++, FORTRAN 90 compiler (version 10.1 20070913)
-
Programing libraries:
- LAPACK, BLAS, ATLAS, GSL
-
Parallel execution environment:
- LAM-MPI (version 7.1.2)
-
Queues:
- long:
- walltime: INFINITY
- short:
- walltime: 2 hours
- parallel:
- walltime: INFINITY
- batch:
- type: route
Connecting
The preferred method of connecting to Soliton is "ssh" (Secure Shell). Following is an example of connecting to Soliton:- ssh username@soliton.physics.uoc.gr
Changing Password
One of the first things you should do after successfully log into Soliton is changing your password. To change your password simply type yppasswd. You will have to first type in your current password, then a new password two times. If you don't change your initial password within two weeks of getting your account, you will be locked out of the system.- yppasswd
Job submission and control
After creating the executables of a serial or a parallel application on front end node you are ready to run a job on a set of compute nodes. Torque is a resource manager providing control over batch jobs and distributed compute nodes. The followings describe the fundamental usage of Torque. For further detail, please refer to Torque man pages on line.Submitting a job:
In a PBS system, submitted job must be a shell script file, not a binary file. The script is executed on compute node(s). Let us assume that the file is saved as test_pbs.sh, and the file is submitted. To submit a job, use the qsub command:
- qsub test_pbs.sh
- 2176.cl-hepX.physics.uoc.gr
-
% qstat
Job id | Name | User | Time Use | S | Queue |
---------------- | ---------------- | ---------------- | -------- | - | ----- |
2176.cl-hepX | test_pbs.sh | username | 93:06:00 | R | serial |
To delete a submitted job, use the qdel command. Let's assume the qstat command outputs the followings.
-
% qstat
Job id | Name | User | Time Use | S | Queue |
---------------- | ---------------- | ---------------- | -------- | - | ----- |
2176.cl-hepX | test_pbs0.sh | username1 | 93:06:00 | R | serial |
2177.cl-hepX | test_pbs1.sh | username2 | 93:06:00 | R | serial |
- qdel 2177
-
% qstat
Job id | Name | User | Time Use | S | Queue |
---------------- | ---------------- | ---------------- | -------- | - | ----- |
2176.cl-hepX | test_pbs0.sh | username1 | 93:06:00 | R | serial |
Another command of interest is "pbsnodes" which shows job allocation per node.
-
cl-hepX
- state = free
- np = 2
- ntype = cluster
- jobs = 0/2177.cl-hepX.physics.uoc.gr
- status = opsys=linux,uname=Linux cl-hepX
- 2.6.10nodes#6 SMP Tue Jan 25 09:31:07 EET 2005
- i686,sessions=10354,nsessions=1,nusers=1,idletime=38310582,
- totmem=2076908kb,availmem=2050752kb,physmem=2076908kb,
- ncpus=2,loadave=1.00,netload=2674130409,state=free,
- rectime=1186135909
Serial (sequential) job script:
#PBS -N Myjob
#PBS -l nodes=4:ppn=2
#PBS -l walltime=01:00:00
#PBS -o mypath/myjob.out
#PBS -e mypath/myjob.err
#PBS -j oe
#PBS -q serial
#PBS -M mail_list
#PBS -m mail_options
#PBS -d path
/path/program_name
#PBS -l nodes=4:ppn=2
#PBS -l walltime=01:00:00
#PBS -o mypath/myjob.out
#PBS -e mypath/myjob.err
#PBS -j oe
#PBS -q serial
#PBS -M mail_list
#PBS -m mail_options
#PBS -d path
/path/program_name
#PBS -N myJob | Defines the job name in the queue |
#PBS -l nodes=4:ppn=2 | Number of nodes and number of CPU/node. For parallel jobs only. |
#PBS -l walltime=01:00:00 | Maximum excution time for this job |
#PBS -o /mypath/myjob.out | The full path and the output filename to store the standard output of the program |
#PBS -e mypath/myjob.err | The full path and the output filename to store the standard error of the program |
#PBS -j oe | Declares if the standard error stream of the job will be merged with the standard output stream of the job. |
#PBS -q serial | Defines the destination of the job. The destination names a queue, a server, or a queue at a server. |
#PBS -M mailist | Declares the list of users to whom mail is sent by the execution server when it sends mail about the job. |
#PBS -m mail_options | Defines the set of conditions under which the execution server will send a mail message about the job. |
#PBS -d path | Defines the working directory path to be used for the job. |
/path/program_name | The name of the executable file with it's full path |
#PBS -N Myjob
#PBS -l walltime=01:00:00
#PBS -o mypath/myjob.out
#PBS -e mypath/myjob.err
#PBS -j oe
#PBS -q parallel
#PBS -M mail_list
#PBS -m mail_options
#PBS -d path
MACHINEFILE=$PBS_NODEFILE
lamboot -v $MACHINEFILE
mpirun -np X /path/program_name
lamclean -v
lamhalt -v
#PBS -l walltime=01:00:00
#PBS -o mypath/myjob.out
#PBS -e mypath/myjob.err
#PBS -j oe
#PBS -q parallel
#PBS -M mail_list
#PBS -m mail_options
#PBS -d path
MACHINEFILE=$PBS_NODEFILE
lamboot -v $MACHINEFILE
mpirun -np X /path/program_name
lamclean -v
lamhalt -v
#PBS -N myJob | Defines the job name in the queue |
#PBS -l walltime=01:00:00 | Maximum excution time for this job |
#PBS -o /mypath/myjob.out | The full path and the output filename to store the standard output of the program |
#PBS -e mypath/myjob.err | The full path and the output filename to store the standard error of the program |
#PBS -j oe | Declares if the standard error stream of the job will be merged with the standard output stream of the job. |
#PBS -q parallel | Defines the destination of the job. The destination names a queue, a server, or a queue at a server. |
#PBS -M mailist | Declares the list of users to whom mail is sent by the execution server when it sends mail about the job. |
#PBS -m mail_options | Defines the set of conditions under which the execution server will send a mail message about the job. |
#PBS -d path | Defines the working directory path to be used for the job. |
MACHINEFILE=$PBS_NODEFILE | Defines automatically a list of the available execution nodes (required - do not change) |
lamboot -v $MACHINEFILE | Initializes the MPI parallel excution environment (required - do not change) |
mpirun -np X /path/program_name | Executes the job in the MPI parallel environment. The -np X flag, defines the number of the required CPUs (np: number of processors). |
lamclean -v | Cleans the LAM environment |
lamhalt -v | Stops the LAM environment |