[toc]
SGE-高性能集群管理
用SGE来管理多节点任务,在单一的控制节点上投放任务,不用考虑这些任务被分配到哪个节点上,方便用户调取资源。比如现在 5台机子,每台机子 8 个核,则共有 40 个核。现在,我从其中一台机子上提交了 1000 个作业,系统将自动将这 1000 个作业分配给这 40 个核来做。
更多阅读:
《N1 Grid Engine 6 用户指南》《N1 Grid Engine 6 安装指南》
*《N1 Grid Engine 6 管理指南》
一、前提条件
- 已建立好NFS /data
- 已建立好NIS
二、安装
- master服务器上的操作
vi /etc/hosts
打开/etc/hosts文件以后,有如下内容,不知道对后续的是否有影响
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
在这两行后面加入了
192.168.100.1 C01 C01.xx
192.168.100.2 G02 G02.xx
192.168.100.3 G03 G03.xx
- 修改主机名:
# 方法一:
# 修改配置文件 /etc/sysconfig/network 内容:
NETWORKING=yes
HOSTNAME=control
# 修改配置文件 /proc/sys/kernel/hostname 内容:
control
# 或者:
# echo "G02"> /proc/sys/kernel/hostname
# 方法二:
# hostnamectlset-hostname C01
- 下载安装
# yum -y install epel-release
# yum -y install jemalloc-devel openssl-devel ncurses-devel pam-devel libXmu-devel hwloc-devel hwloc hwloc-libs java-devel javacc ant-junit libdb-devel motif-devel csh ksh xterm db4-utils perl-XML-Simple perl-Env xorg-x11-fonts-ISO8859-1-100dpi xorg-x11-fonts-ISO8859-1-75dpi
- 用户权限
# groupadd -g 490 sgeadmin
# useradd -u 495 -g 490 -r -m -d /home/sgeadmin -s /bin/bash -c "SGE Admin" sgeadmin
# visudo,然后添加如下内容(相同的操作免密码)
%sgeadmin ALL=(ALL) NOPASSWD: ALL
- 安装
# cd /data/software/src
# wget -c https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/sge-8.1.9.tar.gz
# tar zxvfp sge-8.1.9.tar.gz
# cd sge-8.1.9/source/
# sh scripts/bootstrap.sh
# ./aimk
这一步的时候,报错:
BUILD FAILED
/data/src/sge-8.1.9/source/build.xml:85: The following error occurred while executing this line:
/data/src/sge-8.1.9/source/build.xml:30: Java returned: 1
Total time: 56 seconds
not done
查看文档发现:
This tries to build all the normal targets, some of which might be
problematic for various reasons (e.g. Java). Various `aimk` switches
provide selective compilation; use `./aimk -help` for the options, not
all of which may actually work, especially in combination.
Useful aimk options:
[horizontal]
`-no-qmon`:: Don't build `qmon`;
`-no-qmake`:: don't build `qmake`;
`-no-qtcsh`:: don't build `qtcsh`;
`-no-java -no-jni`:: avoid all Java-related stuff;
`-no-remote`:: don't build `rsh` etc.
(obsoleted by use of `ssh` and the SGE PAM module).
For the core system (daemons, command line clients, but not `qmon`) use
# ./aimk -only-core
意思就是不是每个模块都需要安装,于是我选择了去掉java,明显的在java这一步报错的
# ./aimk -no-java -no-jni
上一步安装成功,接着
# ./aimk -man
接着:
# export SGE_ROOT=/data/software/gridengine && mkdir $SGE_ROOT
# echo Y | ./scripts/distinst -local -allall -libs -noexit
# chown -R sgeadmin.sgeadmin /data/software/gridengine
# cd $SGE_ROOT
# ./install_qmaster
开始各种选择:
1.press enter at the intro screen
2.press “y” and then specify sgeadmin as the user id (sgeadmin)
3.leave the install dir as /BiO/gridengine (/data/software)
4.You will now be asked about port configuration for the master, normally you would choose the default (2) which uses the /etc/services file
5.accept the sge_qmaster info
6.You will now be asked about port configuration for the master, normally you would choose the default (2) which uses the /etc/services file
7.accept the sge_execd info
8.leave the cell name as “default”
9.Enter an appropriate cluster name when requested (Enter new cluster name or hit to use default [p6444] >>,这里选择的回车,出来的结果是creating directory: /data/software/gridengine/default/common, Your $SGE_CLUSTER_NAME: p6444)
10.leave the spool dir as is ( 回车,选择默认)
11.press “n” for no windows hosts! ( 这一步选择的是n,不是默认的的哈)
12.press “y” (permissions are set correctly)
13.press “y” for all hosts in one DNS domain
14.If you have Java available on your Qmaster and wish to use SGE Inspect or SDM then enable the JMX MBean server and provide the requested information - probably answer “n” at this point!(这一步选择n,不然也会报错额)
15.press enter to accept the directory creation notification
16.enter “classic” for classic spooling (berkeleydb may be more appropriate for large clusters)
17.press enter to accept the next notice
18.enter “20000-20100” as the GID range (increase this range if you have execution nodes capable of running more than 100 concurrent jobs)
19.accept the default spool dir or specify a different folder (for example if you wish to use a shared or local folder outside of SGE_ROOT
20.enter an email address that will be sent problem reports
21.press “n” to refuse to change the parameters you have just configured
报错:Command failed: ./utilbin/lx-amd64/spooldefaults Command failed: configuration Command failed: /tmp/configuration_2018-03-16_09:00:40.43362 Probably a permission problem. Please check file access permissions. Check read/write permission. Check if SGE daemons are running.
重新安装,上面的报错就没有啦
22.press enter to accept the next notice
23.press “y” to install the startup scripts
24.press enter twice to confirm the following messages
看到提示信息:cp /data/software/gridengine/default/common/sgemaster /etc/init.d/sgemaster.p6444 /usr/lib/lsb/install_initd /etc/init.d/sgemaster.p6444
25.press “n” for a file with a list of hosts
26.enter the names of your hosts who will be able to administer and submit jobs (enter alone to finish adding hosts) 输入了C01,然后enter,输入G02,然后enter,输入G03,然后enter (这里输入了一串乱码,可能对后面造成影响) 27.skip shadow hosts for now (press “n”)
27.choose “1” for normal configuration and agree with “y”
28.press enter to accept the next message and “n” to refuse to see the previous screen again and then finally enter to exit the installer You may verify your administrative hosts with the command # qconf -sh and you may add new administrative hosts with the command # qconf -ah
安装结束以后,
# cp /data/software/gridengine/default/common/settings.sh /etc/profile.d/
# source /etc/profile
# qconf -ah G02
adminhost "G02" already exists
# qconf -ah G03
adminhost "G03" already exists
- slave服务器上的安装
G02 slave服务器的操作
compute01# yum -y install hwloc-devel
compute01# hostnamectl set-hostname G02
compute01# vi /etc/hosts
192.168.100.1 C01 C01.shhrp
192.168.100.2 G02 G02.shhrp gpuserver.hengrui.com
192.168.100.3 G03 G03.shhrp
compute01# groupadd -g 490 sgeadmin
sgeadmin已经存在了,可能是因为之前安装了,vim /etc/group,把sgeadmin对应的uid从991改成了490
compute01# useradd -u 495 -g 490 -r -m -d /home/sgeadmin -s /bin/bash -c "SGE Admin" sgeadmin
提示用户已经存在了,vim /etc/passwd
sgeadmin:x:993:991:Grid Engine admin:/:/sbin/nologin
改成了
sgeadmin:x:495:490:SGE Admin:/:/bin/bash
visudo,然后添加如下内容(相同的操作免密码)
%sgeadmin ALL=(ALL) NOPASSWD: ALL
然后
compute01# export SGE_ROOT=/data/software/gridengine
compute01# export SGE_CELL=default
compute01# cd $SGE_ROOT
compute01# ./install_execd #全部选择默认即可
compute01# cp /data/software/gridengine/default/common/settings.sh /etc/profile.d/
安装过程中报错:
Checking hostname resolving
---------------------------
Cannot contact qmaster. The command failed:
./bin/lx-amd64/qconf -sh
The error message was:
denied: host "pp" is neither submit nor admin host
You can fix the problem now or abort the installation procedure.
The problem could be:
- the qmaster is not running
- the qmaster host is down
- an active firewall blocks your request
解决办法:
# qconf -ah pp # 主节点上操作
将sge路径写入到环境变量中
# vim /etc/profile
# SGE
# export SGE_ROOT=/data/software/gridengine
# export PATH="${SGE_ROOT}/bin/lx-amd64:$PATH"
# 然后:
# source /etc/profile
- 同理G03
# vim /etc/group
将
sgeadmin:x:981:
改成了
sgeadmin:x:490:
# vim /etc/passwd
将
sgeadmin:x:986:981:Grid Engine admin:/:/sbin/nologin
改成
sgeadmin:x:495:490:SGE Admin:/:/bin/bash
然后按照G02的操作
- 最后在主控结点上查看一下是否成功
qhost
HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS
----------------------------------------------------------------------------------------------
global - - - - - - - - - -
G02 lx-amd64 32 2 16 32 2.96 125.6G 10.1G 120.0G 0.0
G03 lx-amd64 72 2 36 72 0.04 94.1G 3.2G 4.0G 0.0
如果不是这样,就需要重启
如果重装失败ps -ef |grep sge
kill掉跟sge相关的东西
至此,安装结束
三、集群重启
# 控制节点:
# /etc/init.d/sgemaster.p6444 restart
# 运行节点:
# /etc/init.d/sgeexecd.p6444 restart
au 代表这个节点有问题,需要重启一下
在确认防火墙已经关掉的情况下,运行节点还是 “au”的状态,查看了sge的进程,发现是pp来启动的,果断的不对呀,kill掉以后,重新用root来启动sge
[root@g03 ~]# q
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
all.q@C01 BIP 0/0/64 -NA- lx-amd64 au
---------------------------------------------------------------------------------
all.q@G02 BIP 0/0/24 -NA- lx-amd64 au
---------------------------------------------------------------------------------
all.q@G03 BIP 0/0/64 -NA- lx-amd64 au
[root@g03 ~]# ps -ef |grep sge
pp 35025 1 0 09:46 ? 00:00:09 /data/software/gridengine/bin/lx-amd64/sge_execd
root 41269 41142 0 13:16 pts/0 00:00:00 grep --color=auto sge
[root@g03 ~]# kill -9 35025
[root@g03 ~]# /etc/init.d/sgeexecd.p6444 start
Starting Grid Engine execution daemon
[root@g03 ~]# q
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
all.q@C01 BIP 0/0/64 0.75 lx-amd64
---------------------------------------------------------------------------------
all.q@G02 BIP 0/0/24 0.01 lx-amd64
---------------------------------------------------------------------------------
all.q@G03 BIP 0/0/64 0.04 lx-amd64
四、队列的建立与管理
- 队列管理的常用命令:
更多阅读:http://www.softpanorama.org/HPC/Grid_engine/sge_queues.shtml
qconf 如下参数
查看:
-sql show queue list Show a list of all currently defined cluster queues.
-sq queue_list show queues Displays one or multiple cluster queues or queue instances.
修改
-mq queuename modify queue configuration Retrieves the current configuration for the specified queue, executes an editor, and registers the new configuration with the sge_qmaster.
删除:
# qconf -dq queue_name
新增:
-Aq fname add new queue Add the queue defined in fname to the cluster. Name of the queue creates is specified in the file. It is not define by fname
-aq queue_name add new queue. In this case qconf retrieves the default queue configuration (see queue_conf man page) and invokes an editor for customizing the queue configuration. Upon exit from the editor, the queue is registered with sge_qmaster. A minimal configuration requires only that the queue name and queue hostlist be set.
相应的配置文件:
$SGE_ROOT/$SGE_CELL/common/act_qmaster Grid Engine master host file
$SGE_ROOT/$SGE_CELL/spool/qmaster/cqueues/ Queues Configuration Directory
- GPU队列的配置
默认情况下,SGE将所有节点(and the CPU cores or slots therein) 放在同一个队列all.q下。这样,SGE就不知道每个节点的GPUs,同时也不知道怎么给GPU分配任务。
1.Make SGE aware of available GPUs;
2.set every GPU in every node in compute exclusive mode;
3.split all.q into two queues: cpu.q and gpu.q;
4.make sure a job running on cpu.q does not access GPUs;
5.make sure a job running on gpu.q uses only one CPU core and one GPU
1.让SGE知道GPUs
# cd data/backup
# qconf -sc > qconf_sc.txt
# cp qconf_sc.txt qconf_sc_gpu.txt
# 打开 qconf_sc_gpu.txt,然后加入这一行
方案一(没成功):
gpu gpu BOOL == FORCED NO 0 0
报错:
Job 64 does not request 'forced' resource "gpu" of host G02
Job 64 does not request 'forced' resource "gpu" of host G03
verification: no suitable queues
Exiting.
改成:
gpu gpu BOOL == YES NO 0 0
方案二:
然后:
qconf -Mc qconf_sc_gpu.txt
然后加入新的变量的那一行
提示:
root@G03 added "gpu" to complex entry list
检查是否有gpu了
qconf -sc | grep gpu
3台服务器都有了gpu
2.Setting GPUs in compute exclusive mode(这一步不知道怎么弄的)
rocks run host compute 'nvidia-smi -c 1'
Manual page for nvidia-smi indicates that this setting does not persist across reboots.
3.Disabling all.q
qconf -sq all.q > all.q.txt
去除allq.txt
qmod -f -d all.q
cpu队列的配置
cp all.q.txt cpu.q.txt
编辑cpu.q.txt的内容
qname cpu.q
hostlist @allhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make smp mpi
rerun FALSE
slots 1,[G02=32],[G03=72]
tmpdir /tmp
shell /bin/sh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
#配置信息说明:
每个节点最大提供的slots数目
slots 20
如果不指定slots个数,则使用1个,后面跟的是每个节点最大的使用slots数
slots 1,[b08=16],[b09=32]
qconf -mhgrp @allhosts #可以查看到@allhosts包含的节点,修改gpu.q为
hostlist G02 G03
则,我们看到队列中只含有G02,G03节点
一般需要修改的内容包括:
hostlist lus
processors 32
slots 32
shell /bin/bash
pe_list ms
qconf -Aq cpu.q.txt
gpu队列
cp all.q.txt gpu.q.txt
修改gpu.q.txt内容
qname gpu.q
hostlist @allhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make smp mpi
rerun FALSE
slots 1,[G02=8],[G03=8]
tmpdir /tmp
shell /bin/sh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values gpu=True
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
添加队列
qconf -Aq gpu.q.txt
分别修改G2和G3
qconf -me G02
将
complex_values NONE
改为:
complex_values gpu=1
修改队列的配置:
qconf -me 队列名(例如:cpu.q)
qsub -pe gMPI 64
- 添加额外的变量作为控制器
hard resource_list(默认的计数器)
如果缺乏某个计算器的话,就会报错:
Unable to run job: unknown resource "FEP_GPGPU"
# 添加变量名
cd /data/user/sam/sge/config
qconf -sc > qconf_sc.txt
cp qconf_sc.txt qconf_sc_new.txt
修改qconf_sc_new.txt文件,添加两列:
multisim ms INT <= YES YES 0 1000
gpus g INT <= YES YES 0 1000
然后更改默认文件
# qconf -Mc qconf_sc_new.txt
root@C01 added "multisim" to complex entry list
root@C01 added "gpus" to complex entry list
如果提示的为:
complex with name CANVAS_ELEMENTS or shortcut CANVAS_ELEMENTS already exists
complex with name CANVAS_FULL or shortcut CANVAS_FULL already exists
complex with name CANVAS_SHARED or shortcut CANVAS_SHARED already exists
这就有可能没添加成功,因为有重复的,需要去掉重复的部分哦。。
检查是否添加成功
qconf -sc | grep multisim
3.3.2.修改队列的信息,让队列知道这些变量的存在
[root@C01 config]# qconf -sc |grep gpus
gpus g INT <= YES YES 0 1000
修改gpu.q
qconf -mq gpu.q
将:
complex_values gpu=True
改成:
complex_values gpu=True,gpus=16
修改cpu.q
qconf -mq cpu.q
将
complex_values None
改为
complex_values multisim=8
3.3.3.修改执行主机的信息
qconf -me C01
将
complex_values NONE
改为:
complex_values gpus=0,multisim=0
qconf -me G02
将
complex_values gpu=TRUE
改为:
complex_values gpu=TRUE,gpus=8,multisim=8
qconf -me G03
将
complex_values gpu=TRUE
改为:
complex_values gpu=TRUE,gpus=8,multisim=8
五、常用的命令
各种配置
官网说明: http://arc.liv.ac.uk/hpc_background/SGE.html
1)命令行配置执行主机
qconf -ae hostname 添加执行主机(前提:该主机首先要安装了执行进程,master主机如果要当执行主机的话,也需要安装install_execd)
qconf -de hostname 删除执行主机
qconf -sel 显示执行主机列表
2)命令行配置管理主机
qconf -ah hostname 添加管理主机
qconf -dh hostname 删除管理主机
qconf -sh 显示管理主机列表
3)命令行配置提交主机
qconf -as hostname 添加提交主机
qconf -ds hostname 删除提交主机
qconf -ss 显示提交主机列表
4)命令行配置队列
qconf -aq queuename 添加集群队列
qconf -dq queuename 删除集群队列
qconf -mq queuename 修改集群队列配置
qconf -sq queuename 显示集群队列配置
qconf -sql 显示集群队列列表
5)命令行配置用户组
qconf -ahgrp groupname 添加用户组
qconf -mhgrp groupname 修改用户组成员
qconf -shgrp groupname 显示用户组成员
6)设置并行环境
qconf -ap PE_name
添加并行化环境
qconf -mp PE_name
修改并行化环境
qconf -dp PE_name
删除并行化环境
qconf -sp PE_name
显示并行化环境
qconf -spl
显示并行化环境名称列表
我们常常通过修改队列的配置内容和用户组的配置内容来满足我们的要 求\看一下:队列的内容 qconf -sq all.q
投递任务到指定队列main.q
方法一: qsub -cwd -l vf=*G -q main.q *.sh
方法二: qsub -cwd -S /bin/bash -l vf=*G -q main.q *.sh
-cwd 表示在当前路径下投递,sge的日志会输出到当前路径。
-l vf=*G 任务的预估内存,内存估计的值应稍微大于真实的内存,内存预估偏小可能会导致节点跑挂。
-q 指定要投递到的队列,如果不指定的话,SGE会在用户可使用的队列中选择一个满足要求的队列。
指定
gpu.q
注: 方法一和方法二都可以投递任务到指定队列,但是方法一可能会输出警告信息“Warning: no access to tty (Bad file descriptor). Thus no job control in this shell.” 这是因为SGE默认使用的是tcsh,而*.sh使用的是bash,所以应该在投递的时候指明命令解释器。若非要使用方法一的话,可以在脚本*.sh的开头加上#$ -S /bin/bash。
提交任务
a single core job:
qsub -q comp
a 16-core shared-memory job:
qsub -q comp -l cores=16
a 64-process parallel job:
qsub -pe MPI 64
a large shared memory job:
qsub -q himem -l cores=16
a single gpu job:
qsub -q gpu
a 4-gpu shared-memory job:
qsub -q gpu -l cores=4
a 64-gpu parallel job:
4.3. 投递任务到指定节点
qsub -cwd -l vf=*G -l h=node1 *.sh
qsub -cwd -l vf=*G -l h=node1 -P project -q main.q *.sh
-P 参数指明任务所属的项目
qsub -cwd -e /dev/null myscript.sh
查询任务
qstat -f 查看所有任务
qstat -j jobId 按任务id查看
qstat -u user 按用户查看
qstat -f -u '*' 查看所有用户的任务
qstat -a 查看所有用户的任务
主机的状态:
1) 'au' – Host is in alarm and unreachable
2)'u' – Host is unreachable. Usually SGE is down or the machine is down. Check
this out.
3) 'a' – Host is in alarm. It is normal on if the state of the node is full, it means, if
the node is using most of its resources.
4)'as' – Host is in alarm and suspended. If the node is using most of its resources,
SGE suspends this node to take any job unless resources are available.
5) 'd' – Host is disabled.
6) 'E' – ERROR, This requires the command 'qmod -c' to clear the error state
如果主机状态有问题:
Disabled state “d” will persist until cleared
如果认为节点有问题:
可以关掉这个节点上的队列
Will NOT affect any running jobs on that node
WILL block any new work from landing there
Disabled state “d” will persist until cleared
Command:
qmod -d <queue name>
To re-enable:
qmod -e <queue name>
任务状态:
qw 表示等待状态
Eqw 投递任务出错
r 表示任务正在运行
dr 节点挂了之后,删除任务就会出现这个状态,只有节点重启之后,任务才会消失
'w' – job waiting
's' – job supended
't' – job transferring and about to start
'r' – job running
'h' – job hold
'R' – job restarted
'd' – job has been marked to deletion
删除有问题的jobs
qmod -c job_id
删除任务
qdel -j 1111
删除任务号为1111的任务
其他命令
qrsh 与qsub相比,是交互式的投递任务,注意参数:
-now yes|no 默认设置为yes
若设置为yes,立即调度作业,如果没有可用资源,则拒绝作业,任务投递失败,任务状态为Eqw。
若设置为no,调度时如果没有可用资源,则将作业排入队列,等待调度。
例子: qrsh -l vf=*G -q all.q -now no -w n *sh
qacct 从集群日志中抽取任意账户信息
qalter 更改已提交但正处于暂挂状态的作业的属性
qconf 为集群和队列配置提供用户界面
qhold 阻止已提交作业的执行
qhost 显示SGE执行主机(即各个计算节点)的状态信息
qlogin 启动telnet或类似的登录会话。
查看失败任务的原因:
qalter -w v jobid
查看SGE执行的主机:
qhost
查看GPU的使用情况
qhost -F gpu
六、测试
第一个测试
#vi uname.sge
#!/bin/bash
uname -a
# qsub uname.sge
qsub uname.sge Your job 3557 (“uname.sge”) has been submitted
如果运行成功就会在某个执行结点的自己目录下面(我这里用的是 sam 帐号,所以是 /home/sam 目录)得到2个文件,执行结果就在 uname.sge. 这个文件里。
因为没有加 -cwd ,所以默认的生成文件和报错文件都会在用户个人文件夹下哦。。。
可是我执行了半天也没有结果
[sam@C01 test]$ qstat -f
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
all.q@G02 BIP 0/0/32 -NA- -NA- au
---------------------------------------------------------------------------------
all.q@G03 BIP 0/0/72 -NA- -NA- au
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
2 0.55500 uname.sge sam qw 03/16/2018 10:15:31 1
3 0.55500 uname.sge sam qw 03/16/2018 10:17:43 1
4 0.55500 uname.sge sam qw 03/16/2018 10:18:23 1
5 0.55500 uname.sge sam qw 03/16/2018 10:22:07 1
6 0.55500 uname.sge sam qw 03/16/2018 10:24:46 1
会发邮件,告诉你出错的问题,我的问题是home下面没有对应的文件夹??
七、讨论
SGE 与 NFS 用户管理问题
1.sge 用户管理: sge 以用户名来标志相同用户,如果节点 A 上用户 user133 提交的作业想要让节点 B 执行,节点 B 上也必须有用户 user1.当执行作业的时候, 执行节点 B 将自动调用用户 user1 来进行作业的执行。
2.NFS 用户管理: NFS 以用户 id 来标志相同用户,如果节点 A 上 id 为 1000 的用户将文件放到 NFS 的目录上面,则在其余的共享了该目录的主机上,该文 件的属主也是 id 为 1000 的用户。可以看到上面的一个矛盾: sge 以用户名来标志不同节点上的相同用户, nfs 以用户 id 来标志不同节点上的相同用户。现在我们的系统,需要多个节点共同 工作。可能会出现这样的情况: A 机上的 user1 完成作业的一部分,在 nfs 上面 生成了一个目录,而 B 机上的 use1 需要写文件到这个目录上,如果 A 机和 B 机 上 user1 的用户 id 不一样, nfs 就会将它们识别为不同的用户,则 A 机上的用户 user1 建立的目录,但 B 机上的 user1 对这个目录并没有写权限。因为这个目录 在 B 机上属于 id 和 user1 一样的用户。为了避免上面这个情况,我们要求所有机子上的同名用户也拥有相同的 id。
- 向集群中再添加一个执行节点
主机名:node4
ip:172.16.192.200
将下面这一行添加到其余的所有机子的/etc/hosts
中172.16.192.200 node4.bnu.edu.cn node4
master服务器中qconf -ad node4
然后重复G02的操作,注意用户id的统一 - unable to send message to qmaster using port 6444 on host
[root@g02 test]# qsub -cwd uname.sge error: commlib error: got select error (Connection refused) Unable to run job: unable to send message to qmaster using port 6444 on host “G02”: got send error Exiting.The problem can be: - the qmaster is not running - the qmaster host is down - an active firewall blocks your request
重装了一下G02,G03。刚开始是好的,后来又出问题了。后来发现原来是qhost调取的还是之前的老版本的命令。通过which就可以知道
# which qhost# /opt/sge/bin/lx-amd64/qhost# vim /etc/profile# export SGE_ROOT=/opt/sge# export PATH="${SGE_ROOT}/bin/lx-amd64:$PATH"# source /etc/profile
记得修改队列的host.list
- scheduling info提示报错
cheduling info:(-l FEP_GPGPU=16,gpus=1) cannot run in queue “G02” because it offers only hc:gpus=0.000000
cheduling info 会根据已有资源,各种提示,告诉你为什么没有现在运行。这个的问题是gpus的资源已经用光了。
如果你觉得还有资源的话,可以通过修改执行主机和队列中对应的这个资源的值。
# qconf -me G02 # qconf -mq gpu.q
- cannot run in PE “smp” because it only offers 0 slots
在用smp运行的时候,提示这个错误:
cannot run in queue "all.q" because it is not contained in its hard queue list (-q) cannot run in queue "cpu.q" because it is not contained in its hard queue list (-q) cannot run in PE "smp" because it only offers 16 slots
遇到这个问题的时候,困惑了很久,明明提交的是gpu.q的任务,为什么会提示all.q ,cpu.q等问题。其实通过
# qalter -w v jobid
就能明确看到问题,就是因为提交的gpu的slots不够了,sge会尝试看看其他队列是否能够被利用,然后发现没有指定其他的队列,所以会把这些问题都给列出来。那么问题来了,为什么我的slots不够呢?
1.确保PEs已安装
# qconf -spl
查看是否有smp
- 修改smp的配置(G02,G03服务器都需要配置)
# qconf -mp smp 修改# qconf -ap smp 新增
根据是否有,来选择是修改还是新增
将
slots 0
修改为
slots 999
查看是否修改过来:
# qconf -sp smp将allocation_rule $pe_slots修改为:allocation_rule $round_robin$pe_slots 指定一个任务分配的slots必须来自同一台节点;$round_robin or $fill_up 容许来自不同节点的slots
3.smp添加到队列中
# qconf -mq gpu.q修改为:pe_list smp make mpich mpi openmpi orte lam_loose_rsh pvm matlab
4.修改提交任务中的slots个数
# qargs: -q gpu.q -l gpus=1 -pe smp 16
可以看到slots个数为16,而我的G02,G03的slots分别才为8,8; 所以呢,肯定是slots不够,肯定会提示错误呀。。。
6.3 debug的过程:
任何报错分两个层面:程序层面和任务层面。 下面的这谢谢内容,是在实战中需要掌握的快速定位bugs的方法
如果是程序层面的:
# qstat -f
1.记录文件:
SGE messages and logs are usually very helpful$SGE_ROOT/default/spool/qmaster/messages$SGE_ROOT/default/spool/qmaster/schedd/messagesExecd spool logs often hold job specific error dataRemember that local spooling may be used (!)$SGE_ROOT/default/spool/<node>/messagesSGE panic locationQWill log to /tmp on any node when $SGE_ROOT not found or not writable
2.快速显示问题qsub -w v
其他参数
能立马告诉你问题,例如:qsub -w v -cwd -e error12 -q cpu.q simple.sh
邮件告知,为什么任务失败qsub -m a user@host [rest of command]
3.查看某个job的情况qstat -j job_id
里面的error信息会提示你报错
6.4.bash脚本运行
4.7. bash脚本与Linux环境变量
#!/bin/bash# SGE Options#$ -S/bin/bash#$ -N MyJob# Create Working DirectoryWDIR=/state/partition1/$USER/$JOB_NAME-$JOB_IDmkdir -p $WDIRif [ ! -d $WDIR ]then echo $WDIR not created exitficd $WDIR# Copy Data and Config Filescp $HOME/Data/FrogProject/FrogFile .# Put your Science related commands here/share/apps/runsforever FrogFile# Copy Results Back to Home DirectoryRDIR=$HOME/FrogProject/Results/$JOB_NAME-$JOB_IDmkdir -p $RDIRcp NobelPrizeWinningResults $RDIR# Cleanup
rm -rf $WDIR
为了防止脚本运行时找不到环境变量,在投递的bash脚本的前面最好加上以下两句话:(原因见1)
#! /bin/bash#$ -S /bin/bash
八、投递任务不运行
# qconf -sconf
#global:
execd_spool_dir /data/gridengine/default/spool
mailer /bin/mail
xterm /usr/bin/xterm
load_sensor none
prolog none
epilog none
shell_start_mode posix_compliant
login_shells sh,bash,ksh,csh,tcsh
min_uid 0
min_gid 0
user_lists none
xuser_lists none
projects none
xprojects none
enforce_project none
enforce_user auto
load_report_time 00:00:40
max_unheard 00:05:00
reschedule_unknown 00:00:00
loglevel log_warning
administrator_mail none
set_token_cmd none
pag_cmd none
token_extend_time none
shepherd_cmd none
qmaster_params none
execd_params none
reporting_params accounting=true reporting=false \
flush_time=00:00:15 joblog=false sharelog=00:00:00
finished_jobs 100
gid_range 20000-20100
qlogin_command builtin
qlogin_daemon builtin
rlogin_command builtin
rlogin_daemon builtin
rsh_command builtin
rsh_daemon builtin
max_aj_instances 2000
max_aj_tasks 7500
max_u_jobs 100
max_jobs 2000
max_advance_reservations 1000
auto_user_oticket 1000
auto_user_fshare 1000
auto_user_default_project none
auto_user_delete_time 86400
delegated_file_staging true
reprioritize 0
jsv_url none
jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w
# qconf -mc
mem_free mf MEMORY <= YES NO 0 0
num_proc p INT <= YES NO 0 0