一个引人入胜的场景:每周都涌现出大量的生物信息学数据,需要按照预定周期进行公示,以便在各个业务部门中找到相应的责任人、运营经理和生信专家,进而进行数据删除。这一流程还需要确保超期周期的可调整性。通过巧妙地运用 AWK 的 ARGIND 模块结合哈希算法,我们摆脱了以往 Bash 脚本在文件合并过程中所遭遇的缓慢、CPU 和内存资源占用过多的困扰。如今,文件合并不过是在短短几十秒内完成,而以前的 Bash 脚本在分析 279TB 的超期数据时需要耗费约 3 分钟,而 Python 脚本生成 Excel 表格并进行数据透视也只需大约 4 分钟。我们还打造了定时任务计划,确保数据的准确性,从而在自动化的同时,将网络界面上的公示与生信人员的主动获取结果巧妙地结合在一起。
[toc]
SGE+bash+awk+sed+Python+WebStackPage 实现P级数据并发扫描与数据报表公示系统
- 需求与实现初衷
- 生信类数据每周有大量的下机数据,需要按照超期周期公示到各个业务部门找对应的负责人、运营经理、生信分息等进行删除,并且要保证超期周期可调整;
- 想把下机公示的脚本写成高效扫盘的脚本,实现分发投递,与数据扫盘时数据中断的可控性;不影响每周的下机数据公示; 扫盘数据匹配人员,数据分析、数据透视的准确性,超期周期,任务进度把控与匹配,要确保每次公示数据的存在性,计算的准确性等等;
- 每次定任务删除进度时,只发送邮件,由使用各管理员自己去web 页面下载分析好的超期数据;进行实时分析与更新,主动把控数据的清除进度与任务推进。
- 实现与突破
使用rg命令结合 stat 命令、bash 并发、qsub 多任务的投递,突破了扫盘慢的问题,P级扫盘共需2-8小时左右,2小时基本可以把很大部分数据量扫下来,因个别目录文件比较碎,在8小时左右全部扫完,但是碎方件占用量不大,基本不影响数据的总量;
使用awk 的 ARGIND模块结合hash算法,突破了bash脚本两文件合并慢,占用cpu,内存资源慢的问题,可以10几秒内完成合并,现bash脚本分析(279T)超期数据,需3分钟左右的时间,python脚本 excel 表格生成,数据透视共需4分钟左右。定义扫盘定时任务计划,确认数据的准确性,自动化结合这块,web 页面公示与生信人员自主获取公示结果。
- 环境需求
- SGE 高性能集群;
- 分布式挂载存储;
实现逻辑图
扫盘命令挑选— rg 命令
- 如果说目录下文件个数较少,数据量较小时,我们大可不必考虑扫盘效率的问题;
- 但当文件数目超过几千万、甚至1亿总数据量超过P时,我们要挑选性能优的扫盘命令工具做底层是毋庸置疑的;
- 经查找资源,测试发现,rg 命令可以代替ls 和find 进行快的扫盘,大的目录扫描时效率较高;相传要比find快n倍;
- 命令如下
rg --files /PATH/DIR | xargs -I {} stat -c \"%n####%s####%U####%z\" {}
首先要统计部门-管理员等相关信息
这里可以看一下类似的信息,敏感信息已用xx隐藏
xj_id_class 文件,用于根据扫出的条目列表中的利润编号匹配管理员,部门详情,管理员邮箱等信息
02xx,TESTDPART1,leixxx,XX中心-XX中心-XX转化研究部-生信分析组,leixxx@xxx.com
19xx,TESTANI1,zhangxxx,XX中心-XX医学研究部-生信分析,zhangxxx@xxx.com
19xx,TESTCAN,leixxx,XX中心-XX作物研究部-生信分析,leixxx@xxx.com
10xx,TESTHEAL,jianxxx,XX中心-XX事业部-生信分析组,jianxxx@xxx.com
14xx,TESTHW,weixxx,海外XX-交付组,weixxx.xxx.com
定义xjscan 主脚本全局变量
source /etc/profile
source /root/.bashrc
source /root/.bash_profile
project_dir='/test/data/xjscan/'
other_storage_txxj1='/TXPORJ1/XJ'
other_storage_txxj4='/XXPROJ4/XJ'
xxxj5_name='XXPROJ5'
xxxj1_name='XXPROJ1'
id_group_file='xj_id_class'
library_file='project_library.xls'
s5_dir='/XXPROJ4/XJ/S5_production'
dir_10x='/XXPROJ4/XJ/Data_production/10xgen/'
share_dir='/XXPROJ4/XJ/department_data/shared'
s5_dir='/XXPROJ4/XJ/S5_produ'
now_time=`date +"%Y-%m"`
[ -d $now_time ] || mkdir $now_time
other_storage_dir="$now_time/other"
[ -d $other_storage_dir ] || mkdir $other_storage_dir
other_storage_data="$other_storage_dir/data/"
[ -d $other_storage_data ] || mkdir $other_storage_data
other_storage_cache="$other_storage_dir/cache/"
[ -d $other_storage_cache ] || mkdir $other_storage_cache
other_storage_shell="$other_storage_dir/shell/"
[ -d $other_storage_shell ] || mkdir $other_storage_shell
主脚本各模块实现与功能
create_data_tx1 函数模块
将XXPROJ1盘下的XJ 盘扫描生成并发执行脚本,并投递到all.q 等多个计算节点上;
为避免投递后投行完shell脚本本身的一个并发退出,需要在并发后添加& ,即{} & 的形势
;
或者自己在生成的扫盘脚本中检测进程是否存在,进程不存时,再退出投递的脚本;
create_data_xx1() {
### 将存储的二级目录的文件统计下来 ###
> $other_storage_data/$xjxx1_name.FINDF2
for i in `find $other_storage_xxxj1 -maxdepth 2 -type f `; do
echo $i | xargs -I {} stat -c "%n####%s####%U####%z" {} >> $other_storage_data/$xjxx1_name.FINDF2
done
## 将存储的二级目录的目录统计下来,并生成投递的命令 ###
for i in `find $other_storage_xjxx1 -maxdepth 2 -mindepth 2 -type d`;
do
shell_name=`echo $i |sed 's/^\///g' | sed 's/\//_/g' | awk '{print $0".rg.qsub.sh"}'`
data_name=`echo $i | sed 's/^\///g' | sed 's/\//_/g' | awk '{print $0".FINDD2"}'`
file_num=`ls -l $i | sed '1d' |wc -l`
## 判断目录个数来切分脚本中的并发数;
if [ $file_num -le 10 ] ;then
echo "#!/bin/bash" > $other_storage_shell$shell_name
echo "rg --files $i | xargs -I {} stat -c \"%n####%s####%U####%z\" {} &> $project_dir$other_storage_data$data_name" &>> $other_storage_shell$shell_name
fi
## 判断子目录个数来切分脚本中的并发数
if [ $file_num -gt 10 ] ; then
echo "#!/bin/bash " > $other_storage_shell$shell_name
echo "{" >> $other_storage_shell$shell_name
num=0
echo "> $project_dir$other_storage_data$data_name" >>$other_storage_shell$shell_name
for b in `ls -l $i| sed '1d' | awk '{print $8}' `;
do
echo "rg --files $i/$b | xargs -I {} stat -c \"%n####%s####%U####%z\" {} &>> $project_dir$other_storage_data$data_name$num & " &>> $other_storage_shell$shell_name
num=$[$num+1]
done
echo "}" >> $other_storage_shell$shell_name
echo "wait" >> $other_storage_shell$shell_name
cat <<EOF >> $other_storage_shell$shell_name
rg_ps_num=\`ps aux | grep 'rg --files' | wc -l \`
while [ \$rg_ps_num -gt 1 ] ; do
sleep 60
rg_ps_num=\`ps aux | grep 'rg --files' | wc -l \`
done
EOF
fi
done
## 将生成的脚本投递到挂载存储盘的多个节点上进行扫描(开始并发执行)
for i in ` cd $project_dir$other_storage_shell && ls $xjxx1_name*"rg"*"sh"`;do
qsub -l vf=0,p=0 $project_dir$other_storage_shell$i
sleep 2.5
done
}
data_agregation 函数模块
- 将扫盘的数据文件合并成一个大的文件,方便于分离超期数据,与数据分析;
- 执行get_proj_info.sh 将原有的下机人员记录的信息做数据全并,存入项目目录中,用于后面详情匹配;
data_aggregation () {
time_dxx1ays=`date +"%Y-%m-%d-"`
cd $project_dir && mv $other_storage_cache$xxxj1_name $other_storage_cache.$time_days$xxxj1_name
cd $project_dir && find $other_storage_data -name "$xxxj1_name*FIND*" | xargs -I {} cat {} |awk '{$NF=""}{print $0}'| awk '{$NF=""}{print $0}' &> $other_storage_cache$xxxj1_name
cd $project_dir && bash get_proj_info.sh
}
over_due_xx1 函数模块
- 实现功能
- 将合并的大的扫盘数据文件,按指定的超期周期进合切割与分离;
- 将以文件为条录的数据信息,合并为目录的形势,并按照目录提取文件信息,目录信息,及~以目录的形势来合并计算目录的大小;
- 再次处理数据,将目录中的信息与定义好的xj_id_class中的部门等信息做匹配存放到临时的缓存文件中;
- 分离匹配不到部门与特殊目录的信息,存入其它缓存中,做二次处理;
over_due_xx1 () {
### Separate overdue files from all data sources ###
awk -v ntime="$ntime" -v cycle="$cycle" -v stop_cycle="$stop_cycle" -F '####' '{split($4,ti,"-");filetime=mktime(ti[1]" "ti[2]" "ti[3]" ""0 0 0")}{moretime=(ntime-filetime)/86400}{if(moretime > cycle && moretime <stop_cycle){print $0}}' $other_storage_cache$xxxj1_name &> $other_storage_cache$xxxj1_name$over_due_name
### Merge file size to directory size ###
cat $other_storage_cache$xxxj1_name$over_due_name | grep -v "$s5_dir" | awk -F '####' '{OFS="####"}{dirname = gensub("/*[^/]*/*$", "", "", $1);print dirname,$2,$3,$4}' | awk -F '####' '{a[$1"####"$3"####"$4]+=$2}END{for(i in a) print i"####"a[i]}' &> $other_storage_cache.$xxxj1_name$over_due_name
cat $other_storage_cache$xxxj1_name$over_due_name | grep "$s5_dir" | awk -F '####' -v OFS='####' '{print $1,$3,$4,$2}' &>> $other_storage_cache.$xxxj1_name$over_due_name
#awk -F '####' '{OFS="####"}{dirname = gensub("/*[^/]*/*$", "", "", $1);print dirname,$2,$3,$4}' $other_storage_cache$xxxj1_name$over_due_name | awk -F '####' '{OFS="####"}{dirname = gensub("/*[^/]*/*$", "", "", $1);print dirname,$2,$3,$4}' | awk -F '####' '{a[$1"####"$3"####"$4]+=$2}END{for(i in a) print i"####"a[i]}' &> $other_storage_cache.$xxxj1_name$over_due_name
mv $other_storage_cache.$xxxj1_name$over_due_name $other_storage_cache$xxxj1_name$over_due_name
### Add the group and ID project Numbers ###
awk -F '####' -v OFS='####' '{gsub(/ /,"",$3)}{print $0}' $other_storage_cache$xxxj1_name$over_due_name &> $other_storage_cache.$xxxj1_name$over_due_name
mv $other_storage_cache.$xxxj1_name$over_due_name $other_storage_cache$xxxj1_name$over_due_name
grep_group_id=''
for i in `cat $id_group_file`; do
group_id=`echo $i | awk -F ',' '{print $1}'`
group_name=`echo $i | awk -F ',' '{print $2}'`
awk -v group_id="$group_id" -v group_name="$group_name" -v OFS="\t" -F '####' '{split($1,wenku,"/");wenku_num=length(wenku)}{if($1 ~ "/"group_id"/"){print $4,$3,group_id,$2,$1,wenku[wenku_num],group_name}}' $other_storage_cache$xxxj1_name$over_due_name &>> $other_storage_cache$libproj
grep_group_id="$grep_group_id/$group_id/|"
done
grep_group_id_v=`echo $grep_group_id | sed s'/.$//'`
cat $other_storage_cache$xxxj1_name$over_due_name | egrep -v "$grep_group_id_v" | awk -F "####" -v OFS="\t" '{print $1,$2,$3,$4}' &>> $other_storage_cache$other_file
awk -F '\t' -v cycle="$cycle" '{sum+=$4}END{print "Expiration date "cycle,sum}' $other_storage_cache$other_file &>> $other_storage_cache$other_file
}
add_over_date 函数模块
计算并添加目录条目录的超期天数项
add_over_date() {
awk -F '\t' -v ntime="$ntime" -v OFS='\t' '{split($2,ti,"-");filetime=mktime(ti[1]" "ti[2]" "ti[3]" ""0 0 0")}{moretime=(ntime-filetime)/86400}{print $0,moretime}' $other_storage_cache$libproj &> $other_storage_cache.$libproj
mv $other_storage_cache.$libproj $other_storage_cache$libproj
awk -F '\t' -v ntime="$ntime" -v OFS='\t' '{split($3,ti,"-");filetime=mktime(ti[1]" "ti[2]" "ti[3]" ""0 0 0")}{moretime=(ntime-filetime)/86400}{print $4,$3,$2,$1,moretime}' $other_storage_cache$other_file &> $other_storage_cache.$other_file
mv $other_storage_cache.$other_file $other_storage_cache$other_file
}
overall_xx 函数模块
- 首先根据目录中的分期号等信息等值关联合并查询;
- 注
a[s]=$4"\t"$6"\t"$7"\t"
这个就是建哈希,a是哈希名。以s键,后边为值,num就是第一列/分割长度,s=wenku[num]就是取最后一个, c就是另一个文件的key值,如果在哈希里有值,就执行下边的语句。 此项分析也是脚本实现高效运算的核心所在; - 详情页数据大小为B, 汇总分析页数据大小合并运算为G。
> $other_storage_cache$overall
## 详情页的生成
## 匹配到相关信息的人员数据精确匹配与合并
awk -F '\t' -v OFS='\t' 'ARGIND==1{split($1,wenku,"/");num=length(wenku);s=wenku[num]""$5;t=wenku[num]"-"$2""$5;a[s]=$6"\t"$7"\t"$8"\t"$9"\t"$10;b[t]=$6"\t"$7"\t"$8"\t"$9"\t"$10}ARGIND==2{c=$6""$3;if(a[c]){print $0,a[c]} else if(b[c]){print $0,b[c]}}' $library_file $other_storage_cache$libproj &> $other_storage_cache.$libproj_location
## 未匹配到相关信息人员的数据输出与合并
awk -F '\t' -v OFS='\t' 'ARGIND==1{split($1,wenku,"/");num=length(wenku);s=wenku[num]""$5;t=wenku[num]"-"$2""$5;a[s]=$6"\t"$7"\t"$8"\t"$9;b[t]=$6"\t"$7"\t"$8"\t"$9}ARGIND==2{c=$6""$3;if(!a[c] && ! b[c]){print $0,"-","-","-","-","-"}}' $library_file $other_storage_cache$libproj &>> $other_storage_cache.$libproj_location
mv $other_storage_cache.$libproj_location $other_storage_cache$libproj_location
## 汇总页的数据生成与分析并匹配获取部门与管理人员相关信息
bash xjsan_nodepartment.sh $cycle $stop_cycle
cat $id_group_file | awk -F ',' -v OFS='\t' '{print $1,$2,$3,$4,$5}' &> .$id_group_file
awk -F '\t' -v OFS='\t' 'ARGIND==1{s=$1;a[s]=$4"-"$2"-"$3}ARGIND==2{c=$3;if(a[c]){print $0,a[c]} else {print $0,"-"}}' .$id_group_file $other_storage_cache$libproj_location | awk -F '\t' -v OFS='\t' '{print $14,$3,$6,$1,$9,$10,$11,$12,$8}' | awk -F'\t' -v OFS='\t' '{a[$1"\t"$2"\t"$6"\t"$7"\t"$8"\t"$5]+=$4;b[$1"\t"$2"\t"$6"\t"$7"\t"$8"\t"$5]=b[$1"\t"$2"\t"$6"\t"$7"\t"$8"\t"$5]$3",";c[$1"\t"$2"\t"$6"\t"$7"\t"$8"\t"$5]+=$9;d[$1"\t"$2"\t"$6"\t"$7"\t"$8"\t"$5]++}END{for (i in a)print i,a[i],c[i]/d[i],b[i]}' | sed 's/,$//g' | awk -F '\t' '{printf "%s\t%s\t%s\t%s\t%s\t%.2f\t%s\t%s\t%s\n",$1,$2,$3,$4,$6,$7/1024/1024/1024,$8,$5,$9}' &> $other_storage_cache.$overall
echo $other_storage_cache.$overall $other_storage_cache$overall
awk -F '\t' -v OFS='\t' 'ARGIND==1{split($1,wenku,"/");num=length(wenku);s=wenku[num]""$5;t=wenku[num]"-"$2""$5;a[s]=$6"\t"$7"\t"$8"\t"$9;b[t]=$6"\t"$7"\t"$8"\t"$9}ARGIND==2{c=$6""$3;if(!a[c] && ! b[c]){print $0,"-","-","-","-","-"}}' $library_file $other_storage_cache$libproj &>> $other_storage_cache.$libproj_location
mv $other_storage_cache.$overall $other_storage_cache$overall
xjsan_nodepartment.sh 无法精确匹配、使用正则
有一些特殊情况的数据需要使用正则来特殊处理
## 项目相关环境变量这里不再重复,可以调用, 主shell 脚本,也可以在这里重新定义。
stop_cycle="$2"
time_year_month_day=`date -d -$cycle\day +'%Y-%m-%d'`
libproj="able_lib.xls"
libproj_location="able.$time_year_month_day.lib.xls "
overall="$time_year_month_day.ProjInfo.XJpath.xls"
other_file="$time_year_month_day.other.xls"
pub_other_file="pub_$other_file"
work_other_file="work_$other_file"
dir_file="$time_year_month_day.dir.xls"
over_due_name='_overdue'
libproj_location="able.$time_year_month_day.lib.xls "
> $other_storage_cache$dir_file
dir10x_func() {
grep $dir_10x $other_storage_cache$pub_other_file | egrep "F[[:upper:]]{1,4}[[:digit:]]{6,13}-[[:alnum:]][[:alnum:]]" | sed -r 's/(.*)(F[[:upper:]]{1,4}[[:digit:]]{6,13}-[[:alnum:]][[:alnum:]])(.*)/\1\2\3\t\2/g' &> $other_storage_cache.$dir_file
awk -F '\t' -v OFS='\t' 'ARGIND==1{split($1,wenku,"/");num=length(wenku);s=wenku[num];a[s]=$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10;}ARGIND==2{c=$6;if(a[c]){print $0,a[c]}}' $library_file $other_storage_cache.$dir_file &>> $other_storage_cache$dir_file
rm -f $other_storage_cache.$dir_file
grep $dir_10x $other_storage_cache$pub_other_file | egrep -v "F[[:upper:]]{1,4}[[:digit:]]{6,13}-[[:alnum:]][[:alnum:]]" |egrep "/.*[[:digit:]]{4,8}_[[:alnum:]]{4,8}_[[:digit:]]{3,5}_[[:alnum:]]{8,12}-?[[:alnum:]]{1,4}?-?[[:alnum:]]{1,4}?-?[[:alnum:]]{1,4}?" | sed -r 's#(.*)(/.*/.*/.*/.*/.*[[:digit:]]{4,8}_[[:alnum:]]{4,8}_[[:digit:]]{3,5}_[[:alnum:]]{8,12}-?[[:alnum:]]{1,4}?-?[[:alnum:]]{1,4}?-?[[:alnum:]]{1,4}?)(.*)#\1\2\3\t\2#g' &> $other_storage_cache.$dir_file
awk -F '\t' -v OFS='\t' 'ARGIND==1{split($1,wenku,"/");s="/"wenku[2]"/"wenku[3]"/"wenku[4]"/"wenku[5]"/"wenku[6];a[s]=$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10;}ARGIND==2{c=$6;if(a[c]){print $0,a[c]}}' $library_file $other_storage_cache.$dir_file &>> $other_storage_cache$dir_file
rm -f $other_storage_cache.$dir_file
}
dirs5_func() {
grep $s5_dir $other_storage_cache$pub_other_file | egrep "F[[:upper:]]{1,4}[[:alnum:]]{6,13}-?[[:alnum:]][[:alnum:]]" | sed -r 's/(.*)(F[[:upper:]]{1,4}[[:alnum:]]{6,13}-?[[:alnum:]][[:alnum:]])(.*)/\1\2\3\t\2/g' &> $other_storage_cache.$dir_file
awk -F '\t' -v OFS='\t' 'ARGIND==1{split($1,wenku,"/");num=length(wenku);s=wenku[num];a[s]=$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10;}ARGIND==2{c=$6;if(a[c]){print $0,a[c]}}' $library_file $other_storage_cache.$dir_file &>> $other_storage_cache$dir_file
rm -f $other_storage_cache.$dir_file
}
share_func() {
grep $share_dir $other_storage_cache$pub_other_file | egrep "F[[:upper:]]{1,4}[[:alnum:]]{6,13}-?[[:alnum:]][[:alnum:]]" | sed -r 's/(.*)(F[[:upper:]]{1,4}[[:alnum:]]{6,13}-?[[:alnum:]][[:alnum:]])(.*)/\1\2\3\t\2/g' &> $other_storage_cache.$dir_file
awk -F '\t' -v OFS='\t'
'ARGIND==1{split($1,wenku,"/");num=length(wenku);s=wenku[num];a[s]="share\t-\t-\t-\t"$9"\t-\t-";} ARGIND==2{c=$6;if(a[c]){print $0,a[c]}}' $library_file $other_storage_cache.$dir_file &>> $other_storage_cache$dir_file
rm -f $other_storage_cache.$dir_file
}
create_func() {
awk -F '\t' -v OFS='\t' '{print $1,$2,$7,$3,$4,$6,"-",$5,$8,$9,$10,$11,$12}' $other_storage_cache$dir_file &> $other_storage_cache.$dir_file
mv $other_storage_cache.$dir_file $other_storage_cache$dir_file
cat $other_storage_cache$dir_file &>> $other_storage_cache$libproj_location
}
dir10x_func
dirs5_func
share_func
create_func
delete_old 函数模块
将精匹配的详情数据,切割大小,并发检测目录的存在性;
注: 因为数据的特殊性,下机类的数据删除后,目录为空或者目录不存在;
因此,这里使用以下函数来实现数据的时效性。
delete_old () {
rm -f $project_dir$other_storage_cache\ablelib-split*
rm -f $project_dir$other_storage_cache$libproj\_cache*
cd $project_dir$other_storage_cache && split $libproj ablelib-split -l 2000
num=0
for i in `ls $project_dir$other_storage_cache\ablelib-split*`; do
awk '{cmd="ls -A " $5 " 2> /dev/null | wc -w"}{cmd| getline dir;close(cmd)}{if(dir>0){print $0}}' $i &> $project_dir$other_storage_cache$libproj\_cache$num &
num=$[$num+1]
sleep 0.5
done
awk_ps_num=`ps aux | grep 'awk' | grep 'cmd' | grep 'getline' |wc -l`
while [ $awk_ps_num -gt 1 ] ; do
sleep 10
awk_ps_num=`ps aux | grep 'awk' | grep 'cmd' | grep 'getline' |wc -l`
done
cat $project_dir$other_storage_cache$libproj\_cache* &> $project_dir$other_storage_cache$libproj
rm -f $project_dir$other_storage_cache\ablelib-split*
rm -f $project_dir$other_storage_cache$libproj\_cache*
}
delete_no_exists 函数模块
确保没有匹配到部门的数据,使用正则二次匹配的数据条目的实效性;
delete_no_exists () {
cat $project_dir$other_storage_cache$other_file | egrep "($dir_10x|$share_dir|$s5_dir)" &> $project_dir$other_storage_cache$pub_other_file
cat $project_dir$other_storage_cache$other_file | egrep -v "($dir_10x|$share_dir|$s5_dir)" &> $project_dir$other_storage_cache$work_other_file
rm -f $project_dir$other_storage_cache\other-file-split*
rm -f $project_dir$other_storage_cache$other_file\_other\_cache*
cd $project_dir$other_storage_cache && split $pub_other_file other-file-split -l 2000
num=0
for i in `ls $project_dir$other_storage_cache\other-file-split*`; do
awk '{cmd="ls -A " $4 " 2> /dev/null | wc -w"}{cmd| getline dir;close(cmd)}{if(dir>0){print $0}}' $i &> $project_dir$other_storage_cache$other_file\_other\_cache$num &
num=$[$num+1]
sleep 0.5
done
awk_ps_num=`ps aux | grep 'awk' | grep 'cmd' | grep 'getline' |wc -l`
while [ $awk_ps_num -gt 1 ] ; do
sleep 10
awk_ps_num=`ps aux | grep 'awk' | grep 'cmd' | grep 'getline' |wc -l`
done
cat $project_dir$other_storage_cache$other_file\_other\_cache* &> $project_dir$other_storage_cache$pub_other_file
rm -f $project_dir$other_storage_cache\other-file-split*
rm -f $project_dir$other_storage_cache$other_file\_other\_cache*
}
函数的整合
if [ ! $1 ];then
echo -e "Please input (delivery: 'Post generated data' OR merge: 'Merge the generated data' OR 38:'Are days overdue')"
exit 1
fi
if [ $1 == "delivery" ] ; then
if_days=`date +"%d"`
if [ $if_days -ge 27 ]; then
now_time=`date +"%Y-%m"`
now_time=`date -d +1\Month +'%Y-%m'`
[ -d $now_time ] || mkdir $now_time
other_storage_dir="$now_time/other"
[ -d $other_storage_dir ] || mkdir $other_storage_dir
other_storage_data="$other_storage_dir/data/"
[ -d $other_storage_data ] || mkdir $other_storage_data
other_storage_cache="$other_storage_dir/cache/"
[ -d $other_storage_cache ] || mkdir $other_storage_cache
other_storage_shell="$other_storage_dir/shell/"
[ -d $other_storage_shell ] || mkdir $other_storage_shell
fi
cd $project_dir && rm -f $other_storage_data$XXxj5_name*FIND*2
cd $project_dir && rm -f $other_storage_data$XXxj4_name*FIND*2
cd $project_dir && rm -f $other_storage_data$XX5test_name*FIND*2
cd $project_dir && rm -f $other_storage_shell$XXxj5_name*sh
cd $project_dir && rm -f $other_storage_shell$XXxj4_name*sh
cd $project_dir && rm -f $other_storage_shell$XX5test_name*sh
create_data_xx1
fi
if [ $1 == "merge" ] ; then
data_aggregation
fi
if [[ $1 =~ [0-9] ]] && [[ $2 =~ [0-9] ]] ; then
over_due_name='_overdue'
cycle="$1"
stop_cycle="$2"
time_year_month_day=`date -d -$cycle\day +'%Y-%m-%d'`
libproj="able_lib.xls"
libproj_location="able.$time_year_month_day.lib.xls "
overall="$time_year_month_day.ProjInfo.XJpath.xls"
other_file="$time_year_month_day.other.xls"
ntime=`date +%s`
pub_other_file="pub_$other_file"
work_other_file="work_$other_file"
dir_file="$time_year_month_day.dir.xls"
over_due_name='_overdue'
> $other_storage_cache$dir_file
> $other_storage_cache$libproj
> $other_storage_cache$other_file
over_due_xx1
add_over_date
delete_old
delete_no_exists
fi
if [ $1 == "create" ] && [[ $2 =~ [0-9] ]] ;then
over_due_name='_overdue'
cycle=$2
time_year_month_day=`date -d -$cycle\day +'%Y-%m-%d'`
echo $time_year_month_day
libproj="able_lib.xls"
libproj_location="able.$time_year_month_day.lib.xls "
overall="$time_year_month_day.ProjInfo.XJpath.xls"
other_file="$time_year_month_day.other.xls"
ntime=`date +%s`
overall_XX
fi
补充获取信息获取
#!/bin/bash
#!/bin/bash
project_dir='/test/data/xjscan/'
library_file='project_library.xls'
lms_dir_01='/home/xxx/xxx/'
lms_dir_02='/home/xxx/xxx/'
> $project_dir$library_file
for i in `ls $lms_dir_02[0-9][0-9][0-9][0-9].[0-9][0-9]`;do
awk -F '\t' -v OFS='\t' '{gsub(/;/,"-",$13)}{gsub(/;/,"-",$12)}{print $20"/"$10,$12,$13,$1,$24,$3,$22,$23,$4,$"there is over days num" }' $i &>> $project_dir$library_file
done
for j in `ls $lms_dir_01[0-9][0-9][0-9][0-9].[0-9][0-9]`;do
awk -F '\t' -v OFS='\t' '{gsub(/;/,"-",$13)}{gsub(/;/,"-",$12)}{print $20"/"$10,$12,$13,$1,$24,$3,$22,$23,$4,$"there is over days num" }' $j &>> $project_dir$library_file
done
各报表生成脚本
截止目前为止,数据扫描、数据合并、数据分离、数据分析、生成文档报表等模块已完成;
下面实现的,是使用python生成web页面表格、excel表格、vlookup数据透视(任务推进)
各项表格的生成包涵以下内容
概况 : 部门-属组-集群管理人员 信息分析人员 涉及数据量 平均超期天数;
总体情况: 包涵部门名称 部门编号 运营 信息分析人员 项目编号 涉及数据量 项目名称 等信息;
未匹配到相关负责任人的数据:包涵 部门编号和数据量信息;
详情页 即所有目录条录与之对应的人员 部门 编号 数据量 超期天数等相关信息。
web页面表格生成
生成web页面形势的表格最后可以发送nginx httpd 等相关服务器的uri路径;
使用WebStackPage开源javascript静态页url路径调用至同步生成的数据表格静态页;
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import numpy as np
import os
import datetime
import sys
import pandas as pd
import html
report_dir = "./report/"
cycle=float(sys.argv[1])
time_year_month_day = str(datetime.date.today()-datetime.timedelta(days=cycle))
def proj_xls_to_html(file_path,sheet_name,save_name,table_title):
lst1 = []
with open(file_path) as f:
data = f.readlines()
for d in data:
value = d.split('\t')
value[-1] = value[-1].replace('\n','')
try:
value[5]=float(value[5])
except:
pass
lst1.append(value)
pd.set_option('display.width', 1000)
pd.set_option('colheader_justify', 'center')
frame = pd.DataFrame(lst1,index=None,columns=table_title)
frame = pd.DataFrame(frame)
#print(frame['超期天数'])
frame['超期天数'] = frame['超期天数'].map(lambda x: f'<font color="red">{x}<font>' if float(x) >= 44 else x)
pd.set_option('colheader_justify', 'center') # FOR TABLE <th>
html_string = '''
<html>
<head>
<title>####</title>
<link rel="stylesheet" type="text/css" href="gs_style.css"/>
<script type="text/javascript" src="jquery.min.js"></script>
<style>
.df tbody tr:last-child { background-color: #FF0000;}
</style>
<body>
<p align="center"> <br><input id="myInput" name="myInput"> <span class="search"> 搜索 </span> <br> </p>
<div>
<canvas width="1777" height="841" style="position: fixed; left: 0px; top: 0px; z-index: 2147483647; pointer-events: none;"></canvas>
<script src="maodian.js"></script>
<canvas width="1390" height="797" style="position: fixed; left: 0px; top: 0px; z-index: 2147483647; pointer-events: none;"></canvas>
##table##
</div>
<script>
$(function() {
$('.search').on('click', function() {
// console.log($('#myInput').val());
$('table tbody tr').hide()
.filter(":contains('" + ($('#myInput').val()) + "')")
.show();
})
})
</script>
</body>
</html>
'''.replace('####',sheet_name)
with open(save_name, 'w') as f:
f.write(html.unescape(html_string.replace("##table##",frame.to_html(classes='mystyle',table_id='mytable'))))
def proj_xls2_to_html(file_path,sheet_name,save_name,table_title):
lst1 = []
with open(file_path) as f:
data = f.readlines()
for d in data:
value = d.split('\t')
value[-1] = value[-1].replace('\n','')
value = value[0:-1]
try:
value[5]=float(value[5])
except:
pass
lst1.append(value)
pd.set_option('display.width', 1000)
pd.set_option('colheader_justify', 'center')
frame = pd.DataFrame(lst1,index=None,columns=table_title)
frame = pd.DataFrame(frame)
#print(frame['超期天数'])
pd.set_option('colheader_justify', 'center') # FOR TABLE <th>
html_string = '''
<html>
<head>
<title>####</title>
<link rel="stylesheet" type="text/css" href="gs_style.css"/>
<script type="text/javascript" src="jquery.min.js"></script>
<style>
.df tbody tr:last-child { background-color: #FF0000;}
</style>
<body>
<p align="center"> <br><input id="myInput" name="myInput"> <span class="search"> 搜索 </span> <br> </p>
<div>
<canvas width="1777" height="841" style="position: fixed; left: 0px; top: 0px; z-index: 2147483647; pointer-events: none;"></canvas>
<script src="maodian.js"></script>
<canvas width="1390" height="797" style="position: fixed; left: 0px; top: 0px; z-index: 2147483647; pointer-events: none;"></canvas>
##table##
</div>
<script>
$(function() {
$('.search').on('click', function() {
// console.log($('#myInput').val());
$('table tbody tr').hide()
.filter(":contains('" + ($('#myInput').val()) + "')")
.show();
})
})
</script>
</body>
</html>
'''.replace('####',sheet_name)
with open(save_name, 'w') as f:
f.write(html.unescape(html_string.replace("##table##",frame.to_html(classes='mystyle',table_id='mytable'))))
def projother_xls_to_html(file_path,sheet_name,save_name,table_title):
lst1 = []
with open(file_path) as f:
data = f.readlines()
for d in data:
value = d.split('\t')
value[-1] = value[-1].replace('\n','')
try:
value[5]=float(value[5])
except:
pass
lst1.append(value)
pd.set_option('display.width', 1000)
pd.set_option('colheader_justify', 'center')
frame = pd.DataFrame(lst1,index=None,columns=table_title)
frame = pd.DataFrame(frame)
pd.set_option('colheader_justify', 'center') # FOR TABLE <th>
html_string = '''
<html>
<head>
<title>####</title>
<link rel="stylesheet" type="text/css" href="gs_style.css"/>
<script type="text/javascript" src="jquery.min.js"></script>
<style>
.df tbody tr:last-child { background-color: #FF0000;}
</style>
<body>
<p align="center"> <br><input id="myInput" name="myInput"> <span class="search"> 搜索 </span> <br> </p>
<div>
<canvas width="1777" height="841" style="position: fixed; left: 0px; top: 0px; z-index: 2147483647; pointer-events: none;"></canvas>
<script src="maodian.js"></script>
<canvas width="1390" height="797" style="position: fixed; left: 0px; top: 0px; z-index: 2147483647; pointer-events: none;"></canvas>
##table##
</div>
<script>
$(function() {
$('.search').on('click', function() {
// console.log($('#myInput').val());
$('table tbody tr').hide()
.filter(":contains('" + ($('#myInput').val()) + "')")
.show();
})
})
</script>
</body>
</html>
'''.replace('####',sheet_name)
with open(save_name, 'w') as f:
f.write(html.unescape(html_string.replace("##table##",frame.to_html(classes='mystyle',table_id='mytable'))))
def proj_ztxls_to_html(file_path,sheet_name,save_name,table_title):
lst1 = []
with open(file_path) as f:
data = f.readlines()
for d in data:
value = d.split('\t')
value[-1] = value[-1].replace('\n','')
try:
value[5]=float(value[5])
except:
pass
lst1.append(value)
pd.set_option('display.width', 1000)
pd.set_option('colheader_justify', 'center')
frame = pd.DataFrame(lst1,index=None,columns=table_title)
frame = pd.DataFrame(frame)
pd.set_option('colheader_justify', 'center') # FOR TABLE <th>
html_string = '''
<html>
<head>
<title>####</title>
<link rel="stylesheet" type="text/css" href="gs_style.css"/>
<script type="text/javascript" src="jquery.min.js"></script>
<style>
.df tbody tr:last-child { background-color: #FF0000;}
</style>
<body>
<p align="center"> <br><input id="myInput" name="myInput"> <span class="search"> 搜索 </span> <br> </p>
<div>
<canvas width="1777" height="841" style="position: fixed; left: 0px; top: 0px; z-index: 2147483647; pointer-events: none;"></canvas>
<script src="maodian.js"></script>
<canvas width="1390" height="797" style="position: fixed; left: 0px; top: 0px; z-index: 2147483647; pointer-events: none;"></canvas>
##table##
</div>
<script>
$(function() {
$('.search').on('click', function() {
// console.log($('#myInput').val());
$('table tbody tr').hide()
.filter(":contains('" + ($('#myInput').val()) + "')")
.show();
})
})
</script>
</body>
</html>
'''.replace('####',sheet_name)
with open(save_name, 'w') as f:
f.write(html.unescape(html_string.replace("##table##",frame.to_html(classes='mystyle',table_id='mytable'))))
def crate_shuju_toushi(file_path,sheet_name2,index_name,values_name,
department_name,new_department_name,personnel_name,table_title,values_name_over,over_name):
f = pd.read_table(file_path,sep='\t',header=None)
f.columns=table_title
res = pd.pivot_table(f, index=[index_name], values=[values_name, values_name_over],
aggfunc={values_name: np.sum, values_name_over: np.mean}, margins=True)
all_xinxi_title = list(zip(list((res.index)), list(res[values_name]),list(res[values_name_over])))
# print(all_xinxi_title)
department_personnel = list(zip(f[department_name], f[index_name],f[values_name_over]))
department_personnel_size_list = []
#
#
for user in all_xinxi_title:
for department_personnel_list in department_personnel:
if user[0] in department_personnel_list:
if user[0] == '-':
continue
if user[0] == ' ':
continue
department_personnel_size_list.append((department_personnel_list[0],user[0],user[1],user[2]))
end_department_personnel_size_list = sorted(list(set(department_personnel_size_list)), key=lambda x:x[2] ,reverse=True)
all_xinxi_title_end = all_xinxi_title[-1]
all_xinxi_title_end = list(all_xinxi_title_end)
all_xinxi_title_end.insert(0,'')
end_department_personnel_size_list.append(all_xinxi_title_end)
#end_department_personnel_size_list.insert(0,[new_department_name,personnel_name,values_name])
end_department_personnel_size_list.pop()
input_list=[]
user_volue = []
for user2 in end_department_personnel_size_list:
user2 = list(user2)
if user2[1] == user_volue:
continue
user_volue = user2[1]
input_list.append(user2)
sum_list = np.array(input_list)
all_size = sum(list(map(float,sum_list[:,-2][:])))
input_list.append([" ","ALL size",all_size])
pd.set_option('display.width', 1000)
pd.set_option('colheader_justify', 'center')
#frame = pd.DataFrame(input_list,index=None)
frame = pd.DataFrame(input_list,columns=[new_department_name,personnel_name,values_name,over_name])
pd.set_option('colheader_justify', 'center') # FOR TABLE <th>
html_string = '''
<html>
<head>
<title>####</title>
<link rel="stylesheet" type="text/css" href="gs_style.css"/>
<script type="text/javascript" src="jquery.min.js"></script>
<style>
.df tbody tr:last-child { background-color: #FF0000;}
</style>
<body>
<p align="center"> <br><input id="myInput" name="myInput"> <span class="search"> 搜索 </span> <br> </p>
<div>
<canvas width="1777" height="841" style="position: fixed; left: 0px; top: 0px; z-index: 2147483647; pointer-events: none;"></canvas>
<script src="maodian.js"></script>
<canvas width="1390" height="797" style="position: fixed; left: 0px; top: 0px; z-index: 2147483647; pointer-events: none;"></canvas>
##table##
</div>
<script>
$(function() {
$('.search').on('click', function() {
// console.log($('#myInput').val());
$('table tbody tr').hide()
.filter(":contains('" + ($('#myInput').val()) + "')")
.show();
})
})
</script>
</body>
</html>
'''.replace('####',sheet_name2)
with open(save_name, 'w') as f:
f.write(html.unescape(html_string.replace("##table##",frame.to_html(classes='mystyle',table_id='mytable'))))
def create_space_toushi(save_name,sheet_name2,department_num,
sum_name,data_total,index_name,values_name,
file_path,table_title,values_name_over,over_name):
f = pd.read_table(file_path,sep='\t',header=None,dtype={'业务部门利润编号':str})
f.columns = table_title
res = pd.pivot_table(f, index=index_name, values=[values_name,values_name_over], aggfunc={values_name:np.sum,values_name_over:np.mean},margins=True)
all_list = list(zip(list(res.index), list(res[values_name]),list(res[values_name_over])))
space_list = []
for user in all_list:
user2 = list(user)[1]
user4 = list(user)[2]
user3 = list(list(user)[0])
if user3[0] == '-':
space_list.append((user3[1],user2,user4))
if user3[0] == ' ':
space_list.append((user3[1],user2,user4))
sum_list = []
space_data = pd.DataFrame(space_list)
space_data_group = space_data.groupby([0]).agg({[1][0]:'sum',[2][0]:'mean'})
space_list_list = list(zip(space_data_group.index.tolist(), space_data_group.values.tolist()))
space_list = []
for space in space_list_list:
print(space)
space_index_value = space[0], space[1][0], space[1][1]
space_list.append(space_index_value)
for slist in space_list:
sum_list.append(slist[1])
sum_list=sum(sum_list)
sort_space_list = sorted(list(set(space_list)), key=lambda x:x[1] ,reverse=True)
sort_space_list.append((sum_name, sum_list))
#sort_space_list.insert(0,(department_num,data_total))
pd.set_option('display.width', 1000)
pd.set_option('colheader_justify', 'center')
frame = pd.DataFrame(sort_space_list,index=None,columns=[department_num,data_total,over_name])
#frame = pd.DataFrame(sort_space_list,index=None)
#frame = pd.DataFrame(frame)
pd.set_option('colheader_justify', 'center') # FOR TABLE <th>
html_string = '''
<html>
<head>
<title>####</title>
<link rel="stylesheet" type="text/css" href="gs_style.css"/>
<script type="text/javascript" src="jquery.min.js"></script>
<style>
.df tbody tr:last-child { background-color: #FF0000;}
</style>
<body>
<p align="center"> <br><input id="myInput" name="myInput"> <span class="search"> 搜索 </span> <br> </p>
<div>
<canvas width="1777" height="841" style="position: fixed; left: 0px; top: 0px; z-index: 2147483647; pointer-events: none;"></canvas>
<script src="maodian.js"></script>
<canvas width="1390" height="797" style="position: fixed; left: 0px; top: 0px; z-index: 2147483647; pointer-events: none;"></canvas>
##table##
</div>
<script>
$(function() {
$('.search').on('click', function() {
// console.log($('#myInput').val());
$('table tbody tr').hide()
.filter(":contains('" + ($('#myInput').val()) + "')")
.show();
})
})
</script>
</body>
</html>
'''.replace('####',sheet_name2)
with open(save_name, 'w') as f:
f.write(html.unescape(html_string.replace("##table##",frame.to_html(classes='mystyle',table_id='mytable'))))
# for all_space_list in sort_space_list:
# sheet.append(all_space_list)
# wb.save(save_name)
if __name__=='__main__':
sheet_name = '详情'
save_name = report_dir + 'XX-XJ-DATASHOW-' + sheet_name + '.html'
file_path = report_dir + 'able.' + time_year_month_day +'.lib.xls'
table_title = ['数据大小B','文件生成时间','部门编号',
'文件属主','项目目录','文库编号','所属组','超期天数',
'项目编号', '运营经理', '信息分析', '项目名称','分期号']
proj_xls_to_html(file_path=file_path,
sheet_name=sheet_name,
table_title=table_title,
save_name=save_name)
sheet_name2 = '概况'
file_path = report_dir + time_year_month_day + '.ProjInfo.XJpath.xls'
save_name = report_dir + 'XX-XJ-DATASHOW-' + sheet_name2 + '.html'
table_title = ['业务部门名称','业务部门XX编号',
'运营经理','信息分析','项目编号',
'涉及数据量大小(G)','项目平均超期(天)','项目名称','文库编号',]
index_name = '信息分析'
values_name = '涉及数据量大小(G)'
department_name = '业务部门名称'
new_department_name = '部门'
personnel_name = '人员'
values_name_over = '项目平均超期(天)'
over_name = '总体平均超期(天)'
crate_shuju_toushi(
sheet_name2=sheet_name2,
index_name=index_name,
values_name=values_name,
department_name=department_name,
new_department_name=new_department_name,
file_path = file_path,
personnel_name=personnel_name,
table_title=table_title,
over_name=over_name,
values_name_over=values_name_over,)
sheet_name2 = '未匹配人员数据量'
index_name = ['信息分析','业务部门XX编号']
file_path = report_dir + time_year_month_day + '.ProjInfo.XJpath.xls'
save_name = report_dir + 'XX-XJ-DATASHOW-' + sheet_name2 + '.html'
department_num = '部门编号'
data_total = '数据量(G)'
values_name = '涉及数据量大小(G)'
table_title = ['XX部门名称','XX部门利润编号',
'运营XX','XX信息分析','项目编号',
'涉及数据量大小(G)','项目平均超期(天)','项目名称','文库XX编号',]
sum_name = 'ALL'
values_name_over = '项目平均超期(天)'
over_name = '总体平均超期(天)'
create_space_toushi(index_name=index_name,
save_name=save_name,
department_num=department_num,
data_total=data_total,
file_path=file_path,
table_title=table_title,
values_name=values_name,
sheet_name2=sheet_name2,
sum_name=sum_name,
over_name=over_name,
values_name_over=values_name_over, )
sheet_name = '总体情况'
file_path = report_dir + time_year_month_day + '.ProjInfo.XJpath.xls'
save_name = report_dir + 'XX-XJ-DATASHOW-' + sheet_name + '.html'
table_title = ['XX部门名称','XX部门XX编号',
'运营XX','信息分析XX','XX项目编号',
'涉及数据量大小(G)','项目平均超期(天)','项目名称',]
proj_xls2_to_html(file_path=file_path,
sheet_name=sheet_name,
table_title=table_title,
save_name=save_name)
excel 表格生成
这个表格的生成,包涵页面中的所有数据和信息,由于详情页较大,不便上线上过滤,提供多种形势的分析与数据下载;
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import openpyxl as xl
import pandas as pd
import numpy as np
import os
import sys
import datetime
import re
from openpyxl.cell.cell import ILLEGAL_CHARACTERS_RE
report_dir = './report/'
cycle = float(sys.argv[1])
time_year_month_day = str(datetime.date.today()-datetime.timedelta(days=cycle))
def proj_xls_to_xlsx(file_path,sheet_name,save_name,tableTitle=None):
with open(file_path) as f:
data = f.readlines()
if not os.path.exists(save_name):
wb = xl.Workbook()
wb.save(save_name)
else:
wb = xl.load_workbook(save_name)
ws1 = wb.create_sheet(0)
ws1.title = sheet_name
if tableTitle != None:
for n in range(len(tableTitle)):
c = n + 1
ws1.cell(row=1, column=c).value = tableTitle[n]
for d in data:
d = ILLEGAL_CHARACTERS_RE.sub(r'', str(d))
value = d.split('\t')
try:
# print(value)
value[5]=float(value[5])
except:
pass
ws1.append(value)
wb.save(save_name)
def crate_shuju_toushi(save_name, sheet_name, sheet_name2, index_name, values_name, department_name,values_name_over,
new_department_name, personnel_name,over_name):
f = pd.read_excel(io=save_name, sheet_name=sheet_name)
res = pd.pivot_table(f, index=[index_name], values=[values_name,values_name_over], aggfunc={values_name:np.sum, values_name_over:np.mean},margins=True)
wb = xl.load_workbook(save_name)
old_title = wb.worksheets[0]
old_title.title = sheet_name2
#
all_xinxi_title = list(zip(list((res.index)), list(res[values_name]),list(res[values_name_over])))
# print(all_xinxi_title)
department_personnel = list(zip(f[department_name], f[index_name],f[values_name_over]))
# print(department_personnel)
department_personnel_size_list = []
#
for user in all_xinxi_title:
for department_personnel_list in department_personnel:
if user[0] in department_personnel_list:
if user[0] == '-':
continue
if user[0] == ' ':
continue
department_personnel_size_list.append((department_personnel_list[0], user[0], user[1],user[2]))
end_department_personnel_size_list = sorted(list(set(department_personnel_size_list)), key=lambda x: x[2],
reverse=True)
all_xinxi_title_end = all_xinxi_title[-1]
all_xinxi_title_end = list(all_xinxi_title_end)
all_xinxi_title_end.insert(0, '')
end_department_personnel_size_list.append(all_xinxi_title_end)
end_department_personnel_size_list.insert(0, [new_department_name, personnel_name, values_name,over_name])
end_department_personnel_size_list.pop()
user_volue = []
for user2 in end_department_personnel_size_list:
user2 = list(user2)
if user2[1] == user_volue:
continue
user_volue = user2[1]
old_title.append(user2)
wb.save(save_name)
def create_space_toushi(save_name, sheet_name, sheet_name2, department_num, sum_name, data_total, index_name,over_name,values_name_over,
values_name):
f = pd.read_excel(io=save_name, sheet_name=sheet_name, dtype={'业务部门利润编号': str})
res = pd.pivot_table(f, index=index_name, values=[values_name,values_name_over], aggfunc={values_name:np.sum,values_name_over:np.mean},margins=True)
all_list = list(zip(list(res.index), list(res[values_name]),list(res[values_name_over])))
wb = xl.load_workbook(save_name)
sheet = wb.create_sheet(sheet_name2)
space_list = []
for user in all_list:
user2 = list(user)[1]
user4 = list(user)[2]
user3 = list(list(user)[0])
if user3[0] == '-':
space_list.append((user3[1], user2,user4))
if user3[0] == ' ':
space_list.append((user3[1], user2,user4))
sum_list = []
space_data = pd.DataFrame(space_list)
space_data_group = space_data.groupby([0]).agg({[1][0]:'sum',[2][0]:'mean'})
space_list_list = list(zip(space_data_group.index.tolist(), space_data_group.values.tolist()))
space_list = []
for space in space_list_list:
print(space)
space_index_value = space[0], space[1][0], space[1][1]
space_list.append(space_index_value)
for slist in space_list:
print(slist)
sum_list.append(slist[1])
sum_list = sum(sum_list)
sort_space_list = sorted(list(set(space_list)), key=lambda x: x[1], reverse=True)
sort_space_list.append((sum_name, sum_list))
sort_space_list.insert(0, (department_num, data_total,over_name))
for all_space_list in sort_space_list:
print(all_space_list)
sheet.append(all_space_list)
wb.save(save_name)
if __name__=='__main__':
sheet_name = '详情'
SAVE_name = report_dir + 'XX-XJ-DATASHOW-' + time_year_month_day + '.xlsx'
save_name = report_dir + 'XX-XJ-DATASHOW-' + time_year_month_day + '.xlsx'
file_path = report_dir + 'able.' + time_year_month_day +'.lib.xls'
table_title = ['数据大小B','文件生成时间','部门编号',
'文件属主','项目目录','文库编号','所属组','超期天数',
'项目编号','运营XX','信息分析','项目名称','分期号']
proj_xls_to_xlsx(file_path=file_path,
sheet_name=sheet_name,
save_name=save_name,
tableTitle=table_title,)
sheet_name = '总体情况'
save_name = report_dir + 'XX-XJ-DATASHOW-' + time_year_month_day + '.xlsx'
file_path = report_dir + time_year_month_day + '.ProjInfo.XJpath.xls'
table_title = ['业务部门名称','业务部门利润编号',
'运营经理','信息分析','项目编号',
'涉及数据量大小(G)','项目平均超期(天)','项目名称','文库编号',]
proj_xls_to_xlsx(file_path=file_path,
sheet_name=sheet_name,
tableTitle=table_title,
save_name=save_name)
save_name = report_dir + 'XX-XJ-DATASHOW-' + time_year_month_day + '.xlsx'
sheet_name = '总体情况'
sheet_name2 = '概况'
index_name = '信息分析'
values_name = '涉及数据量大小(G)'
department_name = '业务部门名称'
new_department_name = '部门'
personnel_name = '人员'
values_name_over = '项目平均超期(天)'
over_name = '总体平均超期(天)'
crate_shuju_toushi(save_name=save_name,
sheet_name=sheet_name,
sheet_name2=sheet_name2,
index_name=index_name,
values_name=values_name,
department_name=department_name,
new_department_name=new_department_name,
personnel_name=personnel_name,
values_name_over=values_name_over,
over_name=over_name
)
index_name = ['信息分析','业务部门利润编号']
save_name = report_dir + 'XX-XJ-DATASHOW-' + time_year_month_day + '.xlsx'
sheet_name = '总体情况'
department_num = '部门编号'
data_total = '数据量(G)'
values_name = '涉及数据量大小(G)'
sheet_name2 = '未匹配人员数据量'
sum_name = 'ALL'
values_name_over = '项目平均超期(天)'
over_name = '总体平均超期(天)'
create_space_toushi(index_name=index_name,
save_name=save_name,
sheet_name=sheet_name,
department_num=department_num,
data_total=data_total,
values_name=values_name,
sheet_name2=sheet_name2,
sum_name=sum_name,
values_name_over=values_name_over,
over_name=over_name)
数据提交、删除进度把控
多表格vlookup进度分析;
有一部份变量采用主脚本的环境变量,这里不再列出;
把主脚本公示当天生成的数据拷贝过来,按删除的进度再进行二次表格生成;
id_group_file='xj_id_class'
library_file='project_library.xls'
over_due_name='_overdue'
now_time=`date +"%Y-%m"`
report_dir='./report/'
send_dir='./send/'
[ -d $now_time ] || mkdir $now_time
other_storage_dir="$now_time/other"
[ -d $other_storage_dir ] || mkdir $other_storage_dir
other_storage_data="$other_storage_dir/data/"
[ -d $other_storage_data ] || mkdir $other_storage_data
other_storage_cache="$other_storage_dir/cache/"
[ -d $other_storage_cache ] || mkdir $other_storage_cache
other_storage_shell="$other_storage_dir/shell/"
[ -d $other_storage_shell ] || mkdir $other_storage_shell
cycle="$1"
time_year_month_day=`date -d -$cycle\day +'%Y-%m-%d'`
libproj="able_lib.xls"
libproj_location="able.$time_year_month_day.lib.xls "
overall="$time_year_month_day.ProjInfo.XJpath.xls"
other_file="$time_year_month_day.other.xls"
ntime=`date +%s`
vlook_up () {
if ! [ -f $send_dir/XX-XJ-DATASHOW-$time_year_month_day-old.xlsx ] ; then
cp -a $other_storage_cache$libproj_location $send_dir
cp -a $other_storage_cache$overall $send_dir
LANG='en_US.UTF-8' && source py3_new/bin/activate && python re_create.py $cycle
mv $send_dir/XX-XJ-DATASHOW-$time_year_month_day.xlsx $send_dir/XX-XJ-DATASHOW-$time_year_month_day-old.xlsx
rm -f $send_dir$libproj_location
rm -f $send_dir$overall
fi
if [ -f $send_dir/XX-XJ-DATASHOW-$time_year_month_day\.xlsx ] ; then
rm -f $send_dir/XX-XJ-DATASHOW-$time_year_month_day\.xlsx
fi
if ! [ -f $send_dir/XX-XJ-DATASHOW-$time_year_month_day\.xlsx ] ; then
cp -a $other_storage_cache$libproj_location $send_dir
(cd $send_dir && split $libproj_location ablelib-split -l 2000)
num=0
for i in `ls $send_dir\ablelib-split*`; do
awk '{cmd="ls -A " $5 " 2> /dev/null | wc -w"}{cmd| getline dir;close(cmd)}{if(dir>0){print $0}}' $i &> $project_dir$send_dir\_cache$num &
num=$[$num+1]
sleep 0.5
done
awk_ps_num=`ps aux | grep 'awk' | grep 'cmd' | grep 'getline' |wc -l`
while [ $awk_ps_num -gt 1 ] ; do
sleep 10
awk_ps_num=`ps aux | grep 'awk' | grep 'cmd' | grep 'getline' |wc -l`
done
cat $send_dir\_cache* | awk -F ='\t' -v OFS='\t' '{print $1,$2,$3,$4,$5,$6,$7,$8}' &> $send_dir$libproj
rm -f $send_dir\ablelib-split*
rm -f $send_dir\_cache*
> $project_dir$send_dir$overall
awk -F '\t' -v OFS='\t' 'ARGIND==1{split($1,wenku,"/");num=length(wenku);s=wenku[num]""$5;t=wenku[num]"-"$2""$5;a[s]=$6"\t"$7"\t"$8"\t"$9;b[t]=$6"\t"$7"\t"$8"\t"$9}ARGIND==2{c=$6""$3;if(a[c]){print $0,a[c]} else if(b[c]){print $0,b[c]}}' $project_dir$library_file $project_dir$send_dir$libproj &> $project_dir$send_dir.$libproj_location
awk -F '\t' -v OFS='\t' 'ARGIND==1{split($1,wenku,"/");num=length(wenku);s=wenku[num]""$5;t=wenku[num]"-"$2""$5;a[s]=$6"\t"$7"\t"$8"\t"$9;b[t]=$6"\t"$7"\t"$8"\t"$9}ARGIND==2{c=$6""$3;if(!a[c] && ! b[c]){print $0,"-","-","-","-"}}' $project_dir$library_file $project_dir$send_dir$libproj &>> $project_dir$send_dir.$libproj_location
mv $project_dir$send_dir.$libproj_location $project_dir$send_dir$libproj_location
cat $project_dir$id_group_file | awk -F ',' -v OFS='\t' '{print $1,$2,$3,$4,$5}' &> $project_dir$send_dir.$id_group_file
awk -F '\t' -v OFS='\t' 'ARGIND==1{s=$1;a[s]=$4"-"$2"-"$3}ARGIND==2{c=$3;if(a[c]){print $0,a[c]} else {print $0,"-"}}' $project_dir$send_dir.$id_group_file $project_dir$send_dir$libproj_location | awk -F '\t' -v OFS='\t' '{print $14,$3,$6,$1,$9,$10,$11,$12,$8}' | awk -F'\t' -v OFS='\t' '{a[$1"\t"$2"\t"$6"\t"$7"\t"$8"\t"$5]+=$4;b[$1"\t"$2"\t"$6"\t"$7"\t"$8"\t"$5]=b[$1"\t"$2"\t"$6"\t"$7"\t"$8"\t"$5]$3",";c[$1"\t"$2"\t"$6"\t"$7"\t"$8"\t"$5]+=$9;d[$1"\t"$2"\t"$6"\t"$7"\t"$8"\t"$5]++}END{for (i in a)print i,a[i],c[i]/d[i],b[i]}' | sed 's/,$//g' | awk -F '\t' '{printf "%s\t%s\t%s\t%s\t%s\t%.2f\t%s\t%s\t%s\n",$1,$2,$3,$4,$6,$7/1024/1024/1024,$8,$5,$9}' &> $project_dir$send_dir.$overall
mv $project_dir$send_dir.$overall $project_dir$send_dir$overall
LANG='en_US.UTF-8' && source py3_new/bin/activate && python re_create.py $cycle
LANG='en_US.UTF-8' && source py3_new/bin/activate && python vlook_XXxc.py $cycle
fi
}
vlook_up
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import pandas as pd
import sys
import openpyxl as xl
import datetime
report_dir = './send/'
difference_cycle=float(sys.argv[1])
time_year_month_day = str(datetime.date.today()-datetime.timedelta(days=difference_cycle))
def add_last_excel(old_sheet,new_sheet,gaikuang):
pd01 = pd.read_excel(old_sheet,sheet_name=gaikuang,encoding='utf-8')
pd02 = pd.read_excel(new_sheet,sheet_name=gaikuang,encoding='utf-8')
pd11 = pd.read_excel(old_sheet,sheet_name='未匹配人员数据量',encoding='utf-8')
pd12 = pd.read_excel(new_sheet,sheet_name='未匹配人员数据量',encoding='utf-8')
result = pd.merge(pd01,pd02[['人员','涉及数据量大小(G)']],on='人员')
result_list = list(zip(list((result['部门'])),list((result['人员'])),list((result['涉及数据量大小(G)_x'])),list((result['涉及数据量大小(G)_y']))))
not_match = pd.merge(pd11,pd12[['部门编号','数据量(G)']],left_on='部门编号',right_on='部门编号')
print(not_match)
not_match_list = list(zip(list((not_match['部门编号'])),list((not_match['数据量(G)_x'])),list((not_match['数据量(G)_y']))))
wb2 = xl.load_workbook(new_sheet)
remove_sheet1 = wb2[gaikuang]
remove_sheet2 = wb2['未匹配人员数据量']
wb2.remove(remove_sheet1)
wb2.remove(remove_sheet2)
wb2.save(new_sheet)
wb2 = xl.load_workbook(new_sheet)
sheet21 = wb2.create_sheet(gaikuang,0)
sheet22 = wb2.create_sheet('未匹配人员数据量')
result_head = ['部门','人员','涉及数据量(G)','第二次涉及数据量(G)','任务额差']
result_list.insert(0,result_head)
result_for_num = 1
for i in result_list:
result_i = list(i)
if result_for_num != 1:
result_chae=i[2]-i[3]-i[2]*0.2
result_i.append(float('%.2f'% result_chae))
result_for_num = result_for_num + 1
sheet21.append(result_i)
not_match_head = ['部门编号','数据量(G)','第二次数据量(G)','任务额差']
not_match_list.insert(0,not_match_head)
not_match_for_num = 1
for j in not_match_list:
not_match_j = list(j)
if not_match_for_num != 1:
not_match_chae=j[1]-j[2]-j[1]*0.2
not_match_j.append(float('%.2f'% not_match_chae))
not_match_for_num = not_match_for_num + 1
sheet22.append(not_match_j)
wb2.save(new_sheet)
if __name__ == '__main__':
old_sheet=report_dir + 'XX-XJ-DATASHOW-' + time_year_month_day + '-old.xlsx'
new_sheet=report_dir + 'XX-XJ-DATASHOW-' + time_year_month_day + '.xlsx'
gaikuang='概况'
add_last_excel(old_sheet=old_sheet,new_sheet=new_sheet,gaikuang=gaikuang)
到此为止,主要核心部分已经完成;
这里这只是测试的形势列出了一个存储、一个地区的形势,按生产环境时的需求可以多地部署,多存储扫描。
下面呢,可以定义定时投递扫描盘,screen 中加入执行更新自动化
定义每月投递扫盘
定义每月投递扫盘,并记录执行日志
# crontab -e
47 15 26,14,1 * * cd /test/data/xjscan && bash -x xjsan_other.sh delivery &> /test/data/xjscan/logfile
定义整合数据分析脚本、自动结果同步;
#!/bin/bash
id_group_file='xj_id_class'
library_file='project_library.xls'
over_due_name='_overdue'
now_time=`date +"%Y-%m"`
report_dir='./report/'
[ -d $now_time ] || mkdir $now_time
other_storage_dir="$now_time/other"
[ -d $other_storage_dir ] || mkdir $other_storage_dir
other_storage_data="$other_storage_dir/data/"
[ -d $other_storage_data ] || mkdir $other_storage_data
other_storage_cache="$other_storage_dir/cache/"
[ -d $other_storage_cache ] || mkdir $other_storage_cache
other_storage_shell="$other_storage_dir/shell/"
[ -d $other_storage_shell ] || mkdir $other_storage_shell
cycle="# "there is over days num""
time_year_month_day=`date -d -$cycle\day +'%Y-%m-%d'`
libproj="able_lib.xls"
libproj_location="able.$time_year_month_day.lib.xls "
overall="$time_year_month_day.ProjInfo.XJpath.xls"
other_file="$time_year_month_day.other.xls"
ntime=`date +%s`
cd $project_dir && bash -x xjsan_other.sh $cycle 100000 &> real_time.log
cd $project_dir && bash -x xjsan_other.sh create $cycle &>> real_time.log
sleep 5
cd $project_dir && cp -a $other_storage_cache$libproj_location $report_dir &>> real_time.log
cd $project_dir && cp -a $other_storage_cache$overall $report_dir &>> real_time.log
cd $project_dir && cp -a $other_storage_cache$libproj_location $xiangqing_cache &>> real_time.log
LANG='en_US.UTF-8' && source py3_new/bin/activate && python html_create.py $cycle &>> real_time.log
LANG='en_US.UTF-8' && source py3_new/bin/activate && python create.py $cycle &>> real_time.log
ssh root@xx.xx.xx.10 "sed -i "s/XX-XJ-DATASHOW-20.*xlsx/XX-XJ-DATASHOW-$time_year_month_day.xlsx/g" /var/www/html/public/index.html" &>> real_time.log
scp $report_dir\XX-XJ-DATASHOW* root@xx.xx.xx.10:/var/www/html/public &>> real_time.log
cycle=""there is over days num" "
time_year_month_day=`date -d -$cycle\day +'%Y-%m-%d'`
libproj="able_lib.xls"
libproj_location="able.$time_year_month_day.lib.xls "
overall="$time_year_month_day.ProjInfo.XJpath.xls"
other_file="$time_year_month_day.other.xls"
ntime=`date +%s`
cd $project_dir && bash -x xjsan_other.sh $cycle "there is over days num" &> real_time.log
cd $project_dir && bash -x xjsan_other.sh create $cycle &>> real_time.log
sleep 5
cd $project_dir && cp -a $other_storage_cache$libproj_location $report_dir &>> real_time.log
cd $project_dir && cp -a $other_storage_cache$overall $report_dir &>> real_time.log
LANG='en_US.UTF-8' && source py3_new/bin/activate && python add_7days_create.py $cycle &>> real_time.log
scp $report_dir\XX-XJ-DATASHOW* root@xx.xx.xx.10:/var/www/html/public &>> real_time.log
update_time=`date +"%Y年%m月%d日%H时%M分"`
ssh root@xx.xx.xx.10 "sed -i "s/##.*##/##$update_time##/g" /var/www/html/public/index.html" &>> real_time.log
ssh root@xx.xx.xx.10 "sed -i "s/XX-XJ-DATASHOW-INTERVAL-.*xlsx/XX-XJ-DATASHOW-INTERVAL-$time_year_month_day.xlsx/g" /var/www/html/public/index.html" &>> real_time.log
cd $project_dir && rm -f $report_dir*xls &>> real_time.log
cd $project_dir && rm -f $report_dir\XX-XJ-DATASHOW* &>> real_time.log
找一台计算性质的节点,进行后台运算与结果同步
运算置入会话后台
# screen -S xjscan
# while sleep 1 ; do bash -x run_create.sh ; done
# ctrl + a,d #置入后台
WebStackPage开源地址与链接修改
web页面部署到httpd或(nginx)环境后修改链接地址与公示提示信息内容即可;
因这里结果涉及部门信息不再展示。
补充awk 统计相关
一条命令分类标签、并相加、标签对应的目录个数
格试如下
254522 oss://gentype-hz/hangzhou_project/xxx.depth.stat.xls “nj_project” “1”
对应size dir tags tag_num
实现功能: tags与 tag_num 列对应的位置相同,则size相加,统计目录个数,最后一列是各个目录明细用,号分隔
awk '{b[$3" "$4]=a[$3" "$4]++?b[$3" "$4]","$2:$2;c[$3" "$4]+=$1}END{for(i in b){if(a[i]>1)print i,a[i],c[i],b[i]}}'
最后输出tags tags_num dir_num size_sum dir_detail,dir_detail,dir_detail…
过滤超期的进程并清除
过滤超期的一样名称的进程 ,未自动消毁的,保留30天的周期的进程,并将以往的销毁
进程名以contabServertest 为例,天数以30天为例
ps axo pid,etimes,command | grep ' crontabServertest' | awk '{if($2>2592000) print $1}' | xargs -I {} kill -9 {}
20240429补充统计
- kubernetes container计算节点未正在使用有镜像统计;
- 可以投递一条命令直接抓取该差异信息,为智能运维助分析数据告警自恢复做铺垫。
awk 'BEGIN{ "crictl ps -a &> /tmp/crictl-cache-ps.txt" | getline; close("crictl ps -a &> /tmp/crictl-cache-ps.txt"); "crictl images --no-trunc &> /tmp/crictl-cache-images.txt" | getline; close("crictl images --no-trunc &> /tmp/crictl-cache-images.txt") } ARGIND==1{s=$2;a[$2]=$3"-"$4}ARGIND==2{split($3,imid,":");num=length(imid);iimid=imid[num];c=substr(iimid,1,13);if(!a[c]){print $0}}' /tmp/crictl-cache-ps.txt /tmp/crictl-cache-images.txt
20240625补充统计
- pre 和 pro 统计下来的两张表格做合并一列,并生成配置项
awk -F '\t' 'ARGIND==1{s=$5;a[s]="pre\t"$7"\t"$19}ARGIND==2{c=$5;if(a[c]){print $2"\t"$3"\t"$5"\tpro\t"$7"\t"$13"\t"a[c]}else{print $2"\t"$3"\t"$5"\tpro\t"$7"\t"$13"\tpre\t"$7"\t"$13}}' pre.txt prod.txt | awk -F '\t' -v OFS='\t' '{ printf "%s\tcpu: {{if eq .Env.FLOW_DIST_ENV \"pre\"}}\"%sm\"{{else}}\"%sm\"{{end}}\t memory: {{if eq .Env.FLOW_DIST_ENV \"pre\"}}\"%dMi\"{{else}}\"%dMi\"{{end}}\n", $0, $9, $6, $8, $5 }' &> all.txt
# 补Pre差集
awk -F '\t' 'ARGIND==1{s=$5;a[s]="pre\t"$7"\t"$13}ARGIND==2{c=$5;if(!a[c]){print $2"\t"$3"\t"$5"\tpro\t"$7"\t"$19"\tpre\t"$7"\t"$19}}' prod.txt pre.txt | awk -F '\t' -v OFS='\t' '{ printf "%s\tcpu: {{if eq .Env.FLOW_DIST_ENV \"pre\"}}\"%sm\"{{else}}\"%sm\"{{end}}\t memory: {{if eq .Env.FLOW_DIST_ENV \"pre\"}}\"%dMi\"{{else}}\"%dMi\"{{end}}\n", $0, $9, $6, $8, $5 }' &>> all.txt
- 为了加深理解和记还是补充一下详解
这两条命令使用 awk
和重定向操作从两个文件(pre.txt
和prod.txt
)读取数据,处理数据并将结果保存到 all.txt
文件中。为了详细理解每个命令,我们需要分解并解释每个部分的作用。
第一条命令
awk -F '\t' 'ARGIND==1{s=$5;a[s]="pre\t"$7"\t"$19}ARGIND==2{c=$5;if(a[c]){print $2"\t"$3"\t"$5"\tpro\t"$7"\t"$13"\t"a[c]}else{print $2"\t"$3"\t"$5"\tpro\t"$7"\t"$13"\tpre\t"$7"\t"$13}}' pre.txt prod.txt | awk -F '\t' -v OFS='\t' '{ printf "%s\tcpu: {{if eq .Env.FLOW_DIST_ENV \"pre\"}}\"%sm\"{{else}}\"%sm\"{{end}}\t memory: {{if eq .Env.FLOW_DIST_ENV \"pre\"}}\"%dMi\"{{else}}\"%dMi\"{{end}}\n", $0, $9, $6, $8, $5 }' &> all.txt
第一部分
awk -F '\t' 'ARGIND==1{s=$5;a[s]="pre\t"$7"\t"$19}ARGIND==2{c=$5;if(a[c]){print $2"\t"$3"\t"$5"\tpro\t"$7"\t"$13"\t"a[c]}else{print $2"\t"$3"\t"$5"\tpro\t"$7"\t"$13"\tpre\t"$7"\t"$13}}' pre.txt prod.txt
-F '\t'
:指定输入字段分隔符为制表符(Tab)。ARGIND==1
:当处理第一个文件pre.txt
时执行以下操作:s=$5
:将第5列的值存储在变量 s 中。a[s]="pre\t"$7"\t"$19
:在数组 a 中,以 s 为键,存储值 pre、第7列和第19列,用制表符分隔。ARGIND==2
:当处理第二个文件prod.txt
时执行以下操作:c=$5
:将第5列的值存储在变量 c 中。if(a[c]){print ...}
:如果数组 a 中存在键 c,则打印以下内容:$2"\t"$3"\t"$5"\tpro\t"$7"\t"$13"\t"a[c]
else{print ...}
:如果数组 a 中不存在键 c,则打印以下内容:$2"\t"$3"\t"$5"\tpro\t"$7"\t"$13"\tpre\t"$7"\t"$13
第二部分
awk -F '\t' -v OFS='\t' '{ printf "%s\tcpu: {{if eq .Env.FLOW_DIST_ENV \"pre\"}}\"%sm\"{{else}}\"%sm\"{{end}}\t memory: {{if eq .Env.FLOW_DIST_ENV \"pre\"}}\"%dMi\"{{else}}\"%dMi\"{{end}}\n", $0, $9, $6, $8, $5 }' &> all.txt
-F '\t'
:指定输入字段分隔符为制表符(Tab)。-v OFS='\t'
:指定输出字段分隔符为制表符(Tab)。{ printf ... }
:格式化打印每行,使用模板字符串:%s:当前行的完整内容 $0
cpu: {{if eq .Env.FLOW_DIST_ENV \"pre\"}}\"%sm\"{{else}}\"%sm\"{{end}}
:条件语句模板,插入第9列和第6列的值。memory: {{if eq .Env.FLOW_DIST_ENV \"pre\"}}\"%dMi\"{{else}}\"%dMi\"{{end}}
:条件语句模板,插入第8列和第5列的值。&> all.txt
:将输出重定向到文件 all.txt,覆盖现有内容。
第二条命令
awk -F '\t' 'ARGIND==1{s=$5;a[s]="pre\t"$7\t"$13}ARGIND==2{c=$5;if(!a[c]){print $2"\t"$3"\t"$5"\tpro\t"$7"\t"$19"\tpre\t"$7"\t"$19}}' prod.txt pre.txt | awk -F '\t' -v OFS='\t' '{ printf "%s\tcpu: {{if eq .Env.FLOW_DIST_ENV \"pre\"}}\"%sm\"{{else}}\"%sm\"{{end}}\t memory: {{if eq .Env.FLOW_DIST_ENV \"pre\"}}\"%dMi\"{{else}}\"%dMi\"{{end}}\n", $0, $9, $6, $8, $5 }' &>> all.txt
第一部分
awk -F '\t' 'ARGIND==1{s=$5;a[s]="pre\t"$7"\t"$13}ARGIND==2{c=$5;if(!a[c]){print $2"\t"$3"\t"$5"\tpro\t"$7"\t"$19"\tpre\t"$7"\t"$19}}' prod.txt pre.txt
-F '\t'
:指定输入字段分隔符为制表符(Tab)。ARGIND==1
:当处理第一个文件 prod.txt 时执行以下操作:s=$5
:将第5列的值存储在变量 s 中。a[s]="pre\t"$7"\t"$13
:在数组 a 中,以 s 为键,存储值 pre、第7列和第13列,用制表符分隔。ARGIND==2
:当处理第二个文件 pre.txt 时执行以下操作:c=$5
:将第5列的值存储在变量 c 中。if(!a[c]){print ...}
:如果数组 a 中不存在键 c,则打印以下内容:$2"\t"$3"\t"$5"\tpro\t"$7"\t"$19"\tpre\t"$7"\t"$19
第二部分
awk -F '\t' -v OFS='\t' '{ printf "%s\tcpu: {{if eq .Env.FLOW_DIST_ENV \"pre\"}}\"%sm\"{{else}}\"%sm\"{{end}}\t memory: {{if eq .Env.FLOW_DIST_ENV \"pre\"}}\"%dMi\"{{else}}\"%dMi\"{{end}}\n", $0, $9, $6, $8, $5 }' &>> all.txt
-F '\t'
:指定输入字段分隔符为制表符(Tab)。-v OFS='\t'
:指定输出字段分隔符为制表符(Tab)。{ printf ... }
:格式化打印每行,使用模板字符串:%s
:当前行的完整内容$0
cpu: {{if eq .Env.FLOW_DIST_ENV \"pre\"}}\"%sm\"{{else}}\"%sm\"{{end}}
:条件语句模板,插入第9列和第6列的值。memory: {{if eq .Env.FLOW_DIST_ENV \"pre\"}}\"%dMi\"{{else}}\"%dMi\"{{end}}
:条件语句模板,插入第8列和第5列的值。&>> all.txt
:将输出追加到文件 all.txt。
总结
这两条命令从两个文件中读取数据,合并和处理这些数据,然后将结果保存到 all.txt 文件中。每条命令的第一部分使用 awk 根据特定条件匹配和处理数据,而第二部分使用 awk 格式化输出结果并重定向到文件。
格言
主气常静,客气常动,客气先盛而后衰,主气先微而后壮