使用SGE（Sun Grid Engine）来管理多节点任务是一种高效的方式，特别是当你有多台机器和多个核心可供利用时。你可以在单一的控制节点上提交任务，无需关心任务分配到哪个节点上，这为用户提供了方便的资源调度方式。例如，如果你有5台机器，每台机器有8个核心，总共有40个核心。当你从其中一台机器上提交了1000个作业，SGE将会智能地将这1000个作业分配给可用的40个核心来执行。这样的自动任务分配和资源管理使得任务调度更为高效，允许你更好地利用集群资源。

[toc]

SGE-高性能集群管理

用SGE来管理多节点任务，在单一的控制节点上投放任务，不用考虑这些任务被分配到哪个节点上，方便用户调取资源。比如现在 5台机子，每台机子 8 个核，则共有 40 个核。现在，我从其中一台机子上提交了 1000 个作业，系统将自动将这 1000 个作业分配给这 40 个核来做。
更多阅读：
《N1 Grid Engine 6 用户指南》《N1 Grid Engine 6 安装指南》
*《N1 Grid Engine 6 管理指南》

一、前提条件

已建立好NFS /data
已建立好NIS

二、安装

master服务器上的操作

vi /etc/hosts
打开/etc/hosts文件以后，有如下内容，不知道对后续的是否有影响
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
在这两行后面加入了
192.168.100.1  C01     C01.xx
192.168.100.2  G02     G02.xx    
192.168.100.3  G03     G03.xx

修改主机名:

# 方法一：
# 修改配置文件 /etc/sysconfig/network 内容：
    NETWORKING=yes
    HOSTNAME=control
# 修改配置文件 /proc/sys/kernel/hostname 内容：
    control
# 或者：
#    echo "G02"> /proc/sys/kernel/hostname
# 方法二：
#    hostnamectlset-hostname C01

下载安装

# yum -y install epel-release
# yum -y install jemalloc-devel openssl-devel ncurses-devel pam-devel libXmu-devel hwloc-devel hwloc hwloc-libs java-devel javacc ant-junit libdb-devel motif-devel csh ksh xterm db4-utils perl-XML-Simple perl-Env xorg-x11-fonts-ISO8859-1-100dpi xorg-x11-fonts-ISO8859-1-75dpi

用户权限

# groupadd -g 490 sgeadmin
# useradd -u 495 -g 490 -r -m  -d /home/sgeadmin -s /bin/bash -c "SGE Admin" sgeadmin
# visudo，然后添加如下内容（相同的操作免密码）
%sgeadmin       ALL=(ALL)       NOPASSWD: ALL

安装

# cd /data/software/src
# wget -c https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/sge-8.1.9.tar.gz
# tar zxvfp sge-8.1.9.tar.gz
# cd sge-8.1.9/source/
# sh scripts/bootstrap.sh 
# ./aimk 
这一步的时候，报错：
BUILD FAILED
/data/src/sge-8.1.9/source/build.xml:85: The following error occurred while executing this line:
/data/src/sge-8.1.9/source/build.xml:30: Java returned: 1

Total time: 56 seconds
not done
查看文档发现：
This tries to build all the normal targets, some of which might be
problematic for various reasons (e.g. Java).  Various `aimk` switches
provide selective compilation; use `./aimk -help` for the options, not
all of which may actually work, especially in combination.

   Useful aimk options:

[horizontal]
`-no-qmon`:: Don't build `qmon`;
`-no-qmake`:: don't build `qmake`;
`-no-qtcsh`:: don't build `qtcsh`;
`-no-java -no-jni`:: avoid all Java-related stuff;
`-no-remote`:: don't build `rsh` etc.
(obsoleted by use of `ssh` and the SGE PAM module).

For the core system (daemons, command line clients, but not `qmon`) use

# ./aimk -only-core
意思就是不是每个模块都需要安装，于是我选择了去掉java，明显的在java这一步报错的
# ./aimk -no-java -no-jni
上一步安装成功，接着
# ./aimk -man
接着：
# export SGE_ROOT=/data/software/gridengine && mkdir $SGE_ROOT
# echo Y | ./scripts/distinst -local -allall -libs -noexit
# chown -R sgeadmin.sgeadmin /data/software/gridengine

# cd  $SGE_ROOT
# ./install_qmaster
开始各种选择：
1.press enter at the intro screen
2.press “y” and then specify sgeadmin as the user id (sgeadmin)
3.leave the install dir as /BiO/gridengine (/data/software)
4.You will now be asked about port configuration for the master, normally you would choose the default (2) which uses the /etc/services file
5.accept the sge_qmaster info
6.You will now be asked about port configuration for the master, normally you would choose the default (2) which uses the /etc/services file
7.accept the sge_execd info
8.leave the cell name as “default”
9.Enter an appropriate cluster name when requested (Enter new cluster name or hit to use default [p6444] >>,这里选择的回车，出来的结果是creating directory: /data/software/gridengine/default/common, Your $SGE_CLUSTER_NAME: p6444)
10.leave the spool dir as is ( 回车，选择默认)
11.press “n” for no windows hosts! （ 这一步选择的是n，不是默认的的哈）
12.press “y” (permissions are set correctly)
13.press “y” for all hosts in one DNS domain
14.If you have Java available on your Qmaster and wish to use SGE Inspect or SDM then enable the JMX MBean server and provide the requested information - probably answer “n” at this point!(这一步选择n,不然也会报错额)
15.press enter to accept the directory creation notification
16.enter “classic” for classic spooling (berkeleydb may be more appropriate for large clusters)
17.press enter to accept the next notice
18.enter “20000-20100” as the GID range (increase this range if you have execution nodes capable of running more than 100 concurrent jobs)
19.accept the default spool dir or specify a different folder (for example if you wish to use a shared or local folder outside of SGE_ROOT
20.enter an email address that will be sent problem reports
21.press “n” to refuse to change the parameters you have just configured
报错：Command failed: ./utilbin/lx-amd64/spooldefaults Command failed: configuration Command failed: /tmp/configuration_2018-03-16_09:00:40.43362 Probably a permission problem. Please check file access permissions. Check read/write permission. Check if SGE daemons are running.
重新安装，上面的报错就没有啦
22.press enter to accept the next notice
23.press “y” to install the startup scripts
24.press enter twice to confirm the following messages
看到提示信息：cp /data/software/gridengine/default/common/sgemaster /etc/init.d/sgemaster.p6444 /usr/lib/lsb/install_initd /etc/init.d/sgemaster.p6444
25.press “n” for a file with a list of hosts
26.enter the names of your hosts who will be able to administer and submit jobs (enter alone to finish adding hosts) 输入了C01,然后enter,输入G02,然后enter，输入G03,然后enter (这里输入了一串乱码，可能对后面造成影响) 27.skip shadow hosts for now (press “n”)
27.choose “1” for normal configuration and agree with “y”
28.press enter to accept the next message and “n” to refuse to see the previous screen again and then finally enter to exit the installer You may verify your administrative hosts with the command # qconf -sh and you may add new administrative hosts with the command # qconf -ah
安装结束以后，
# cp /data/software/gridengine/default/common/settings.sh /etc/profile.d/
# source /etc/profile

# qconf -ah G02
    adminhost "G02" already exists
# qconf -ah G03
    adminhost "G03" already exists

slave服务器上的安装

G02 slave服务器的操作
compute01# yum -y install hwloc-devel
compute01# hostnamectl set-hostname G02
compute01# vi /etc/hosts
192.168.100.1  C01     C01.shhrp
192.168.100.2  G02     G02.shhrp       gpuserver.hengrui.com
192.168.100.3  G03     G03.shhrp

compute01# groupadd -g 490 sgeadmin
sgeadmin已经存在了，可能是因为之前安装了，vim /etc/group，把sgeadmin对应的uid从991改成了490
compute01# useradd -u 495 -g 490 -r -m  -d /home/sgeadmin -s /bin/bash -c "SGE Admin" sgeadmin
提示用户已经存在了，vim /etc/passwd
sgeadmin:x:993:991:Grid Engine admin:/:/sbin/nologin
改成了
sgeadmin:x:495:490:SGE Admin:/:/bin/bash
visudo，然后添加如下内容（相同的操作免密码）
%sgeadmin       ALL=(ALL)       NOPASSWD: ALL
然后
compute01# export SGE_ROOT=/data/software/gridengine
compute01# export SGE_CELL=default
compute01# cd $SGE_ROOT
compute01# ./install_execd  #全部选择默认即可 
compute01# cp /data/software/gridengine/default/common/settings.sh /etc/profile.d/
安装过程中报错：
Checking hostname resolving
---------------------------

Cannot contact qmaster. The command failed:

   ./bin/lx-amd64/qconf -sh

The error message was:

   denied: host "pp" is neither submit nor admin host

You can fix the problem now or abort the installation  procedure.
The problem could be:

   - the qmaster is not running
   - the qmaster host is down
   - an active firewall blocks your request
解决办法：
# qconf -ah pp  # 主节点上操作  
将sge路径写入到环境变量中
# vim /etc/profile
# SGE
# export SGE_ROOT=/data/software/gridengine
# export PATH="${SGE_ROOT}/bin/lx-amd64:$PATH"
# 然后：
# source /etc/profile

同理G03

# vim /etc/group 
将
sgeadmin:x:981:
改成了
sgeadmin:x:490:
# vim /etc/passwd
将
sgeadmin:x:986:981:Grid Engine admin:/:/sbin/nologin
改成
sgeadmin:x:495:490:SGE Admin:/:/bin/bash
然后按照G02的操作

最后在主控结点上查看一下是否成功

qhost
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
G02                     lx-amd64       32    2   16   32  2.96  125.6G   10.1G  120.0G     0.0
G03                     lx-amd64       72    2   36   72  0.04   94.1G    3.2G    4.0G     0.0

如果不是这样，就需要重启
如果重装失败
ps -ef |grep sge
kill掉跟sge相关的东西
至此，安装结束

三、集群重启

# 控制节点：
# /etc/init.d/sgemaster.p6444 restart
# 运行节点：
# /etc/init.d/sgeexecd.p6444 restart
au 代表这个节点有问题，需要重启一下

在确认防火墙已经关掉的情况下，运行节点还是 “au”的状态，查看了sge的进程，发现是pp来启动的，果断的不对呀，kill掉以后，重新用root来启动sge

[root@g03 ~]# q
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@C01                      BIP   0/0/64         -NA-     lx-amd64       au
---------------------------------------------------------------------------------
all.q@G02                      BIP   0/0/24         -NA-     lx-amd64       au
---------------------------------------------------------------------------------
all.q@G03                      BIP   0/0/64         -NA-     lx-amd64       au

[root@g03 ~]# ps -ef |grep sge
pp       35025     1  0 09:46 ?        00:00:09 /data/software/gridengine/bin/lx-amd64/sge_execd
root     41269 41142  0 13:16 pts/0    00:00:00 grep --color=auto sge
[root@g03 ~]# kill -9 35025
[root@g03 ~]# /etc/init.d/sgeexecd.p6444 start
   Starting Grid Engine execution daemon
[root@g03 ~]# q
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@C01                      BIP   0/0/64         0.75     lx-amd64
---------------------------------------------------------------------------------
all.q@G02                      BIP   0/0/24         0.01     lx-amd64
---------------------------------------------------------------------------------
all.q@G03                      BIP   0/0/64         0.04     lx-amd64

四、队列的建立与管理

队列管理的常用命令：

更多阅读：http://www.softpanorama.org/HPC/Grid_engine/sge_queues.shtml
qconf 如下参数
查看：
-sql  show queue list Show a list of all currently defined cluster queues.
-sq  queue_list show queues Displays one or multiple cluster queues or queue instances.
修改
-mq  queuename modify queue configuration Retrieves the current configuration for the specified queue, executes an editor, and registers the new configuration with the sge_qmaster.
删除：
# qconf -dq queue_name
新增：
-Aq  fname add new queue Add the queue defined in fname to the cluster.  Name of the queue creates is specified in the file. It is not define by  fname
-aq  queue_name add new queue. In this case qconf retrieves the default queue configuration (see queue_conf man page) and invokes an editor for customizing the queue configuration. Upon exit from the editor, the queue is registered with sge_qmaster. A minimal configuration requires only that the queue name and queue hostlist be set.
相应的配置文件：
$SGE_ROOT/$SGE_CELL/common/act_qmaster         Grid Engine master host file
$SGE_ROOT/$SGE_CELL/spool/qmaster/cqueues/    Queues Configuration Directory

GPU队列的配置

默认情况下，SGE将所有节点(and the CPU cores or slots therein) 放在同一个队列all.q下。这样，SGE就不知道每个节点的GPUs，同时也不知道怎么给GPU分配任务。

1.Make SGE aware of available GPUs;
2.set every GPU in every node in compute exclusive mode;
3.split all.q into two queues: cpu.q and gpu.q;
4.make sure a job running on cpu.q does not access GPUs;
5.make sure a job running on gpu.q uses only one CPU core and one GPU
1.让SGE知道GPUs
# cd data/backup
# qconf -sc > qconf_sc.txt
# cp qconf_sc.txt qconf_sc_gpu.txt 
# 打开 qconf_sc_gpu.txt，然后加入这一行
方案一（没成功）：
gpu                    gpu                BOOL        ==    FORCED      NO         0        0
报错：
Job 64 does not request 'forced' resource "gpu" of host G02
Job 64 does not request 'forced' resource "gpu" of host G03
verification: no suitable queues
Exiting.
改成：
gpu                    gpu                BOOL        ==    YES      NO         0        0
方案二：
然后：
qconf -Mc qconf_sc_gpu.txt
然后加入新的变量的那一行
提示：
root@G03 added "gpu" to complex entry list
检查是否有gpu了
qconf -sc | grep gpu 
3台服务器都有了gpu
2.Setting GPUs in compute exclusive mode（这一步不知道怎么弄的）
rocks run host compute 'nvidia-smi -c 1'
Manual page for nvidia-smi indicates that this setting does not persist across reboots.
3.Disabling all.q
qconf -sq all.q > all.q.txt
去除allq.txt
qmod -f -d all.q

cpu队列的配置
cp all.q.txt  cpu.q.txt

编辑cpu.q.txt的内容
qname                 cpu.q
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make smp mpi
rerun                 FALSE
slots                 1,[G02=32],[G03=72]
tmpdir                /tmp
shell                 /bin/sh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
#配置信息说明：
每个节点最大提供的slots数目
slots                                 20

如果不指定slots个数，则使用1个，后面跟的是每个节点最大的使用slots数
slots                                 1,[b08=16],[b09=32]

qconf -mhgrp @allhosts #可以查看到@allhosts包含的节点，修改gpu.q为
hostlist              G02 G03
则，我们看到队列中只含有G02,G03节点
一般需要修改的内容包括：
hostlist lus
processors 32
slots 32
shell /bin/bash
pe_list ms

qconf -Aq cpu.q.txt
gpu队列
cp all.q.txt  gpu.q.txt
修改gpu.q.txt内容
qname                 gpu.q
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make smp mpi
rerun                 FALSE
slots                 1,[G02=8],[G03=8]
tmpdir                /tmp
shell                 /bin/sh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        gpu=True
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
添加队列
qconf -Aq gpu.q.txt
分别修改G2和G3
qconf -me G02
将
complex_values        NONE
改为：
complex_values        gpu=1
修改队列的配置：
qconf -me 队列名（例如：cpu.q）
qsub -pe gMPI 64

添加额外的变量作为控制器

hard resource_list（默认的计数器）
如果缺乏某个计算器的话，就会报错：
Unable to run job: unknown resource "FEP_GPGPU"
# 添加变量名
cd /data/user/sam/sge/config
qconf -sc > qconf_sc.txt
cp qconf_sc.txt qconf_sc_new.txt
修改qconf_sc_new.txt文件，添加两列：
multisim            ms         INT       <=    YES         YES        0        1000
gpus                g          INT       <=    YES         YES        0        1000
然后更改默认文件
# qconf -Mc qconf_sc_new.txt

root@C01 added "multisim" to complex entry list
root@C01 added "gpus" to complex entry list
如果提示的为：
complex with name CANVAS_ELEMENTS or shortcut CANVAS_ELEMENTS already exists
complex with name CANVAS_FULL or shortcut CANVAS_FULL already exists
complex with name CANVAS_SHARED or shortcut CANVAS_SHARED already exists
这就有可能没添加成功，因为有重复的，需要去掉重复的部分哦。。
检查是否添加成功
qconf -sc | grep multisim 

3.3.2.修改队列的信息，让队列知道这些变量的存在
[root@C01 config]# qconf -sc |grep gpus
gpus g INT <= YES YES 0 1000
修改gpu.q
qconf -mq gpu.q
将：
complex_values        gpu=True
改成：
complex_values        gpu=True,gpus=16
修改cpu.q
qconf -mq cpu.q
将
complex_values        None
改为
complex_values        multisim=8
3.3.3.修改执行主机的信息
qconf -me C01
将
complex_values        NONE
改为：
complex_values        gpus=0,multisim=0
qconf -me G02
将
complex_values        gpu=TRUE
改为：
complex_values        gpu=TRUE,gpus=8,multisim=8
qconf -me G03
将
complex_values        gpu=TRUE
改为：
complex_values        gpu=TRUE,gpus=8,multisim=8

五、常用的命令

各种配置

官网说明： http://arc.liv.ac.uk/hpc_background/SGE.html
1）命令行配置执行主机
qconf -ae hostname 添加执行主机（前提：该主机首先要安装了执行进程，master主机如果要当执行主机的话，也需要安装install_execd）
qconf -de hostname 删除执行主机
qconf -sel 显示执行主机列表
2）命令行配置管理主机
qconf -ah hostname 添加管理主机
qconf -dh hostname 删除管理主机
qconf -sh 显示管理主机列表
3）命令行配置提交主机
qconf -as hostname 添加提交主机
qconf -ds hostname 删除提交主机
qconf -ss 显示提交主机列表
4）命令行配置队列
qconf -aq queuename 添加集群队列
qconf -dq queuename 删除集群队列
qconf -mq queuename 修改集群队列配置
qconf -sq queuename 显示集群队列配置
qconf -sql 显示集群队列列表
5）命令行配置用户组
qconf -ahgrp groupname 添加用户组
qconf -mhgrp groupname 修改用户组成员
qconf -shgrp groupname 显示用户组成员
6）设置并行环境
qconf -ap PE_name
    添加并行化环境
qconf -mp PE_name
    修改并行化环境
qconf -dp PE_name
    删除并行化环境
qconf -sp PE_name
    显示并行化环境
qconf -spl
    显示并行化环境名称列表
我们常常通过修改队列的配置内容和用户组的配置内容来满足我们的要 求\看一下：队列的内容 qconf -sq all.q

投递任务到指定队列main.q

方法一： qsub -cwd -l vf=*G -q main.q *.sh
方法二： qsub -cwd -S /bin/bash -l vf=*G -q main.q *.sh

-cwd 表示在当前路径下投递，sge的日志会输出到当前路径。

-l vf=*G 任务的预估内存，内存估计的值应稍微大于真实的内存，内存预估偏小可能会导致节点跑挂。

-q 指定要投递到的队列，如果不指定的话，SGE会在用户可使用的队列中选择一个满足要求的队列。 
指定 
gpu.q

注： 方法一和方法二都可以投递任务到指定队列，但是方法一可能会输出警告信息“Warning: no access to tty (Bad file descriptor). Thus no job control in this shell.” 这是因为SGE默认使用的是tcsh，而*.sh使用的是bash，所以应该在投递的时候指明命令解释器。若非要使用方法一的话，可以在脚本*.sh的开头加上#$ -S /bin/bash。
提交任务
a single core job:
qsub -q comp
a 16-core shared-memory job:
qsub -q comp -l cores=16
a 64-process parallel job:
qsub -pe MPI 64
a large shared memory job:
qsub -q himem -l cores=16
a single gpu job:
qsub -q gpu
a 4-gpu shared-memory job:
qsub -q gpu -l cores=4
a 64-gpu parallel job:
4.3. 投递任务到指定节点
qsub -cwd -l vf=*G -l h=node1 *.sh
qsub -cwd -l vf=*G -l h=node1 -P project -q main.q *.sh
-P 参数指明任务所属的项目

qsub -cwd -e /dev/null myscript.sh

查询任务

qstat -f      查看所有任务
qstat -j jobId           按任务id查看
qstat -u user            按用户查看
qstat -f -u '*'   查看所有用户的任务
qstat -a       查看所有用户的任务
主机的状态：
1） 'au' – Host is in alarm and unreachable
2）'u' – Host is unreachable. Usually SGE is down or the machine is down. Check
this out.
3） 'a' – Host is in alarm. It is normal on if the state of the node is full, it means, if
the node is using most of its resources.
4）'as' – Host is in alarm and suspended. If the node is using most of its resources,
SGE suspends this node to take any job unless resources are available.
5） 'd' – Host is disabled.
6） 'E' – ERROR, This requires the command 'qmod -c' to clear the error state
如果主机状态有问题：
Disabled state “d” will persist until cleared
如果认为节点有问题：
可以关掉这个节点上的队列
Will NOT affect any running jobs on that node
WILL block any new work from landing there
Disabled state “d” will persist until cleared
Command:
qmod -d <queue name>
To re-enable:
qmod -e <queue name>
任务状态：
qw    表示等待状态
Eqw     投递任务出错
r     表示任务正在运行
dr     节点挂了之后，删除任务就会出现这个状态，只有节点重启之后，任务才会消失
'w' – job waiting
's' – job supended
't' – job transferring and about to start
'r' – job running
'h' – job hold
'R' – job restarted
'd' – job has been marked to deletion

删除有问题的jobs

qmod -c job_id

删除任务

qdel -j 1111 删除任务号为1111的任务

其他命令

qrsh  与qsub相比，是交互式的投递任务，注意参数：

-now yes|no   默认设置为yes 
若设置为yes，立即调度作业，如果没有可用资源，则拒绝作业，任务投递失败，任务状态为Eqw。
若设置为no，调度时如果没有可用资源，则将作业排入队列，等待调度。

例子： qrsh -l vf=*G -q all.q -now no -w n *sh
qacct  从集群日志中抽取任意账户信息
qalter   更改已提交但正处于暂挂状态的作业的属性
qconf   为集群和队列配置提供用户界面
qhold   阻止已提交作业的执行
qhost   显示SGE执行主机（即各个计算节点）的状态信息
qlogin  启动telnet或类似的登录会话。
查看失败任务的原因：
qalter -w v jobid
查看SGE执行的主机：
qhost
查看GPU的使用情况
qhost -F gpu

六、测试

第一个测试

#vi uname.sge 
#!/bin/bash
uname -a

# qsub uname.sge
qsub uname.sge Your job 3557 (“uname.sge”) has been submitted
如果运行成功就会在某个执行结点的自己目录下面（我这里用的是 sam 帐号，所以是 /home/sam 目录）得到2个文件，执行结果就在 uname.sge. 这个文件里。
因为没有加 -cwd ,所以默认的生成文件和报错文件都会在用户个人文件夹下哦。。。
可是我执行了半天也没有结果
[sam@C01 test]$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@G02                      BIP   0/0/32         -NA-     -NA-          au
---------------------------------------------------------------------------------
all.q@G03                      BIP   0/0/72         -NA-     -NA-          au

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
      2 0.55500 uname.sge  sam          qw    03/16/2018 10:15:31     1        
      3 0.55500 uname.sge  sam          qw    03/16/2018 10:17:43     1        
      4 0.55500 uname.sge  sam          qw    03/16/2018 10:18:23     1        
      5 0.55500 uname.sge  sam          qw    03/16/2018 10:22:07     1        
      6 0.55500 uname.sge  sam          qw    03/16/2018 10:24:46     1 
会发邮件，告诉你出错的问题，我的问题是home下面没有对应的文件夹？？

七、讨论

SGE 与 NFS 用户管理问题
1.sge 用户管理： sge 以用户名来标志相同用户，如果节点 A 上用户 user133 提交的作业想要让节点 B 执行，节点 B 上也必须有用户 user1.当执行作业的时候，执行节点 B 将自动调用用户 user1 来进行作业的执行。
2.NFS 用户管理： NFS 以用户 id 来标志相同用户，如果节点 A 上 id 为 1000 的用户将文件放到 NFS 的目录上面，则在其余的共享了该目录的主机上，该文件的属主也是 id 为 1000 的用户。可以看到上面的一个矛盾： sge 以用户名来标志不同节点上的相同用户， nfs 以用户 id 来标志不同节点上的相同用户。现在我们的系统，需要多个节点共同工作。可能会出现这样的情况： A 机上的 user1 完成作业的一部分，在 nfs 上面生成了一个目录，而 B 机上的 use1 需要写文件到这个目录上，如果 A 机和 B 机上 user1 的用户 id 不一样， nfs 就会将它们识别为不同的用户，则 A 机上的用户 user1 建立的目录，但 B 机上的 user1 对这个目录并没有写权限。因为这个目录在 B 机上属于 id 和 user1 一样的用户。为了避免上面这个情况，我们要求所有机子上的同名用户也拥有相同的 id。

向集群中再添加一个执行节点
主机名： node4 ip： 172.16.192.200
将下面这一行添加到其余的所有机子的/etc/hosts 中
172.16.192.200 node4.bnu.edu.cn node4
master服务器中
qconf -ad node4
然后重复G02的操作，注意用户id的统一
unable to send message to qmaster using port 6444 on host

[root@g02 test]# qsub -cwd uname.sge error: commlib error: got select error (Connection refused) Unable to run job: unable to send message to qmaster using port 6444 on host “G02”: got send error Exiting.The problem can be: - the qmaster is not running - the qmaster host is down - an active firewall blocks your request

重装了一下G02,G03。刚开始是好的，后来又出问题了。后来发现原来是qhost调取的还是之前的老版本的命令。通过which就可以知道

# which qhost# /opt/sge/bin/lx-amd64/qhost# vim /etc/profile# export SGE_ROOT=/opt/sge# export PATH="${SGE_ROOT}/bin/lx-amd64:$PATH"# source /etc/profile

记得修改队列的host.list

scheduling info提示报错

cheduling info：(-l FEP_GPGPU=16,gpus=1) cannot run in queue “G02” because it offers only hc:gpus=0.000000

cheduling info 会根据已有资源，各种提示，告诉你为什么没有现在运行。这个的问题是gpus的资源已经用光了。
如果你觉得还有资源的话，可以通过修改执行主机和队列中对应的这个资源的值。

# qconf -me G02 # qconf -mq gpu.q

cannot run in PE “smp” because it only offers 0 slots
在用smp运行的时候，提示这个错误：

cannot run in queue "all.q" because it is not contained in its hard queue list (-q)                            cannot run in queue "cpu.q" because it is not contained in its hard queue list (-q)                            cannot run in PE "smp" because it only offers 16 slots

遇到这个问题的时候，困惑了很久，明明提交的是gpu.q的任务，为什么会提示all.q ，cpu.q等问题。其实通过

# qalter -w v jobid

就能明确看到问题，就是因为提交的gpu的slots不够了，sge会尝试看看其他队列是否能够被利用，然后发现没有指定其他的队列，所以会把这些问题都给列出来。那么问题来了，为什么我的slots不够呢？
1.确保PEs已安装

# qconf -spl

查看是否有smp

修改smp的配置(G02,G03服务器都需要配置)

# qconf -mp smp 修改# qconf -ap smp 新增

根据是否有，来选择是修改还是新增
将

slots  0

修改为

slots  999

查看是否修改过来：

# qconf -sp smp将allocation_rule    $pe_slots修改为：allocation_rule    $round_robin$pe_slots 指定一个任务分配的slots必须来自同一台节点；$round_robin or $fill_up 容许来自不同节点的slots

3.smp添加到队列中

# qconf -mq gpu.q修改为：pe_list               smp make mpich mpi openmpi orte lam_loose_rsh pvm matlab

4.修改提交任务中的slots个数

# qargs: -q gpu.q -l gpus=1 -pe smp 16

可以看到slots个数为16，而我的G02,G03的slots分别才为8，8；所以呢，肯定是slots不够，肯定会提示错误呀。。。
6.3 debug的过程：
任何报错分两个层面：程序层面和任务层面。下面的这谢谢内容，是在实战中需要掌握的快速定位bugs的方法
如果是程序层面的：

# qstat -f

1.记录文件：

SGE messages and logs are usually very helpful$SGE_ROOT/default/spool/qmaster/messages$SGE_ROOT/default/spool/qmaster/schedd/messagesExecd spool logs often hold job specific error dataRemember that local spooling may be used (!)$SGE_ROOT/default/spool/<node>/messagesSGE panic locationQWill log to /tmp on any node when $SGE_ROOT not found or not writable

2.快速显示问题
qsub -w v其他参数
能立马告诉你问题，例如：
qsub -w v -cwd -e error12 -q cpu.q simple.sh
邮件告知，为什么任务失败
qsub -m a user@host [rest of command]
3.查看某个job的情况
qstat -j job_id
里面的error信息会提示你报错
6.4.bash脚本运行
4.7. bash脚本与Linux环境变量

#!/bin/bash# SGE Options#$ -S/bin/bash#$ -N MyJob# Create Working DirectoryWDIR=/state/partition1/$USER/$JOB_NAME-$JOB_IDmkdir -p $WDIRif [ ! -d $WDIR ]then  echo $WDIR not created  exitficd $WDIR# Copy Data and Config Filescp $HOME/Data/FrogProject/FrogFile .# Put your Science related commands here/share/apps/runsforever FrogFile# Copy Results Back to Home DirectoryRDIR=$HOME/FrogProject/Results/$JOB_NAME-$JOB_IDmkdir -p $RDIRcp NobelPrizeWinningResults $RDIR# Cleanup

rm -rf $WDIR

为了防止脚本运行时找不到环境变量，在投递的bash脚本的前面最好加上以下两句话：(原因见1)

#! /bin/bash#$ -S /bin/bash

八、投递任务不运行

# qconf  -sconf 
#global:
execd_spool_dir              /data/gridengine/default/spool
mailer                       /bin/mail
xterm                        /usr/bin/xterm
load_sensor                  none
prolog                       none
epilog                       none
shell_start_mode             posix_compliant
login_shells                 sh,bash,ksh,csh,tcsh
min_uid                      0
min_gid                      0
user_lists                   none
xuser_lists                  none
projects                     none
xprojects                    none
enforce_project              none
enforce_user                 auto
load_report_time             00:00:40
max_unheard                  00:05:00
reschedule_unknown           00:00:00
loglevel                     log_warning
administrator_mail           none
set_token_cmd                none
pag_cmd                      none
token_extend_time            none
shepherd_cmd                 none
qmaster_params               none
execd_params                 none
reporting_params             accounting=true reporting=false \
                             flush_time=00:00:15 joblog=false sharelog=00:00:00
finished_jobs                100
gid_range                    20000-20100
qlogin_command               builtin
qlogin_daemon                builtin
rlogin_command               builtin
rlogin_daemon                builtin
rsh_command                  builtin
rsh_daemon                   builtin
max_aj_instances             2000
max_aj_tasks                 7500
max_u_jobs                   100
max_jobs                     2000
max_advance_reservations     1000
auto_user_oticket            1000
auto_user_fshare             1000
auto_user_default_project    none
auto_user_delete_time        86400
delegated_file_staging       true
reprioritize                 0
jsv_url                      none
jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w

# qconf -mc
mem_free            mf         MEMORY    <=    YES         NO         0        0
num_proc            p          INT       <=    YES         NO         0        0