HEP Cluster


General information

Hardware:
  • CPU: Intel® Core i7 920 @ 2.67 GHz (quad core)
  • Memory: 12GB/node (DDR3 1333MHz)
  • Network: 1 Gbit
Software:
  • Operating System:
      CentOS 5.4
  • Queuing System
      Torque & Maui
  • Compilers
      GNU C, C++, FORTRAN 90 & 77 (version 4.1.2 20080704 (Red Hat 4.1.2-46))
      Intel C, C++, FORTRAN 90 compiler (version 10.1 20070913)
  • Programing libraries:
      LAPACK, BLAS, ATLAS, GSL
  • Parallel execution environment:
      LAM-MPI (version 7.1.2)
  • Queues:
      long:
      walltime: INFINITY
      short:
      walltime: 2 hours
      parallel:
      walltime: INFINITY
      batch:
      type: route

Connecting

The preferred method of connecting to Soliton is "ssh" (Secure Shell). Following is an example of connecting to Soliton:
      ssh username@soliton.physics.uoc.gr
where "username" is your user name on Soliton, and soliton.physics.uoc.gr is the front node.

Changing Password

One of the first things you should do after successfully log into Soliton is changing your password. To change your password simply type yppasswd. You will have to first type in your current password, then a new password two times. If you don't change your initial password within two weeks of getting your account, you will be locked out of the system.
      yppasswd

Job submission and control

After creating the executables of a serial or a parallel application on front end node you are ready to run a job on a set of compute nodes. Torque is a resource manager providing control over batch jobs and distributed compute nodes. The followings describe the fundamental usage of Torque. For further detail, please refer to Torque man pages on line.

Submitting a job:
In a PBS system, submitted job must be a shell script file, not a binary file. The script is executed on compute node(s). Let us assume that the file is saved as test_pbs.sh, and the file is submitted. To submit a job, use the qsub command:

      qsub test_pbs.sh
If the job submission is succeeded, an allocated job ID is displaed:
      2176.cl-hepX.physics.uoc.gr
If the qstat, Torque command, is executed just after submitting a job via qsub command, you can see the submitted job is in the Torque queue.
    % qstat
    Job id NameUserTime UseSQueue
    ---------------- ---------------- ---------------- -------- - -----
    2176.cl-hepX test_pbs.sh username 93:06:00 R serial

Deletion of a submitted job:
To delete a submitted job, use the qdel command. Let's assume the qstat command outputs the followings.

    % qstat
    Job id NameUserTime UseSQueue
    ---------------- ---------------- ---------------- -------- - -----
    2176.cl-hepX test_pbs0.sh username1 93:06:00 R serial
    2177.cl-hepX test_pbs1.sh username2 93:06:00 R serial
The qdel takes one argument to specify a job ID to delete.
      qdel 2177
    % qstat
    Job id NameUserTime UseSQueue
    ---------------- ---------------- ---------------- -------- - -----
    2176.cl-hepX test_pbs0.sh username1 93:06:00 R serial

Show job allocation per node:
Another command of interest is "pbsnodes" which shows job allocation per node.

    cl-hepX
      state = free
      np = 2
      ntype = cluster
      jobs = 0/2177.cl-hepX.physics.uoc.gr
      status = opsys=linux,uname=Linux cl-hepX
      2.6.10nodes#6 SMP Tue Jan 25 09:31:07 EET 2005
      i686,sessions=10354,nsessions=1,nusers=1,idletime=38310582,
      totmem=2076908kb,availmem=2050752kb,physmem=2076908kb,
      ncpus=2,loadave=1.00,netload=2674130409,state=free,
      rectime=1186135909
    ...

Torque Script examples:
Serial (sequential) job script:

#PBS -N Myjob
#PBS -l nodes=4:ppn=2
#PBS -l walltime=01:00:00
#PBS -o mypath/myjob.out
#PBS -e mypath/myjob.err
#PBS -j oe
#PBS -q serial
#PBS -M mail_list
#PBS -m mail_options
#PBS -d path
/path/program_name

Quick explanation:
#PBS -N myJob Defines the job name in the queue
#PBS -l nodes=4:ppn=2 Number of nodes and number of CPU/node. For parallel jobs only.
#PBS -l walltime=01:00:00 Maximum excution time for this job
#PBS -o /mypath/myjob.out The full path and the output filename to store the standard output of the program
#PBS -e mypath/myjob.err The full path and the output filename to store the standard error of the program
#PBS -j oe Declares if the standard error stream of the job will be merged with the standard output stream of the job.
#PBS -q serial Defines the destination of the job. The destination names a queue, a server, or a queue at a server.
#PBS -M mailist Declares the list of users to whom mail is sent by the execution server when it sends mail about the job.
#PBS -m mail_options Defines the set of conditions under which the execution server will send a mail message about the job.
#PBS -d path Defines the working directory path to be used for the job.
/path/program_name The name of the executable file with it's full path

Parallel job script:

#PBS -N Myjob
#PBS -l walltime=01:00:00
#PBS -o mypath/myjob.out
#PBS -e mypath/myjob.err
#PBS -j oe
#PBS -q parallel
#PBS -M mail_list
#PBS -m mail_options
#PBS -d path
MACHINEFILE=$PBS_NODEFILE
lamboot -v $MACHINEFILE
mpirun -np X /path/program_name
lamclean -v
lamhalt -v

Quick explanation:
#PBS -N myJob Defines the job name in the queue
#PBS -l walltime=01:00:00 Maximum excution time for this job
#PBS -o /mypath/myjob.out The full path and the output filename to store the standard output of the program
#PBS -e mypath/myjob.err The full path and the output filename to store the standard error of the program
#PBS -j oe Declares if the standard error stream of the job will be merged with the standard output stream of the job.
#PBS -q parallel Defines the destination of the job. The destination names a queue, a server, or a queue at a server.
#PBS -M mailist Declares the list of users to whom mail is sent by the execution server when it sends mail about the job.
#PBS -m mail_options Defines the set of conditions under which the execution server will send a mail message about the job.
#PBS -d path Defines the working directory path to be used for the job.
MACHINEFILE=$PBS_NODEFILE Defines automatically a list of the available execution nodes (required - do not change)
lamboot -v $MACHINEFILE Initializes the MPI parallel excution environment (required - do not change)
mpirun -np X /path/program_name Executes the job in the MPI parallel environment. The -np X flag, defines the number of the required CPUs (np: number of processors).
lamclean -v Cleans the LAM environment
lamhalt -v Stops the LAM environment