Exit codes and kill-job signals

退出代号和杀死作业的信号

The exit code from a batch job is a standard Unix termination status, the same sort of number you get in a shell script from checking the "$?" variable after executing a command.

一个批处理作业的退出代号就是一个标准的Unix结束状态,和你在执行一个命令后在shell中从"$?"变量得到的数字一样

Typically, exit code 0 (zero) means successful completion.Codes 1-127 are typically generated by your job itself calling exit() with a non-zero value to terminate itself and indicate an error. In BaBar we don't make very much use of this. The most common such value you might see is 64, which is the value used by Framework to say that its event loop is being stopped before all the requested data have been read, typically because time ran out. In recent BaBar releases you might also see 125, which we use as a code for a generic "severe error"; the job log should contain a message stating what the error was.

特别的,退出代号0(zero)表示成功的完成了.代号1-127是你自己的程序中用一个非0值调用了exit()表示发生了错误.在BaBar这个我们用的不多.你可能看见最常用的值是64,它表示框架在读取全部请求信息之前事件循环被停止,尤其是时间用完了.在最近BaBar的发行版中你可能看见125,我们用它表示一般性的"服务器错误";作业的日志应该包含一条那是什么错误的信息.

Exit codes in the range 129-255 represent jobs terminated by Unix "signals". Each type of signal has a number, and what's reported as the job exit code is the signal number plus 128. Signals can arise from within the process itself (as for SEGV, see below) or be sent to the process by some external agent (such as the batch control system, or your using the "bkill" command).

退出代号在129-255之间的表示作业是被Unix停止的信号.每种类型的信号都有一个数字,信号数字加128作为退出码.信号可能产生于进程内部(比如SEGV,看下面)或者一些外部代理(比如批处理控制系统,或者你用了"bkill"命令)

By way of example, then, exit code 64 means that the job deliberately terminated its execution by calling "exit(64)", exit code 137 means that the job received a signal 9, and exit code 140 represents signal 12.

举个例子,退出码64表示作业通过调用"exit(64)"故意终止,退出码137表示作业收到了信号9,退出码140表示信号12

The specific meaning of the signal numbers is platform-dependent. If you are trying to figure out a problem that was seen on Linux, you have to run the following commands on Linux. We don't have Solaris or Mac OS batch resources in BaBar at the moment, but if we did, you would have to match platforms similarly when debugging.

信号数字的具体意义是依赖平台的.如果你想弄清楚在Linux上遇到的问题,你必须在Linux上执行下面的命令.现在在BaBar我们没有Solaris或Mac OS批处理资源,但是如果你有,你必须匹配平台,类似的调试的时候也是.

terminationDecoder

BaBar provides a little program that will take your exit code and spit out an explanation. The program is called terminationDecoder. Examples:

BaBar提供了一个小程序,它能用你的退出码输出一个说明.这个程序叫terminationDecoder.例如:

[yakut] terminationDecoder 137
terminated by signal 9 (Killed)

[yakut] terminationDecoder 64
exited with code 64 (in Framework: stop requested, e.g., by CpuCheck)

More details

更多细节

You can also look this up yourself, if you know the signal number, then you can find out why the job was killed using the command "kill -l":

你也可以自己查找,如果你知道信号数字,你可以使用命令"kill -l"找到作业被杀死的原因

[yakut] kill -l

HUP INT QUIT ILL TRAP ABRT BUS FPE KILL USR1 SEGV USR2 PIPE ALRM TERM STKFLT
CHLD CONT STOP TSTP TTIN TTOU URG XCPU XFSZ VTALRM PROF WINCH POLL PWR SYS
RTMIN RTMIN+1 RTMIN+2 RTMIN+3 RTMAX-3 RTMAX-2 RTMAX-1 RTMAX

So for example, if your job was killed by signal 6, then it got an "ABRT", which is short for ABORT.

例如,如果你的程序被信号6杀死,你会得到"ABRT",ABORT的缩写

To find out what all the "kill -l" words mean, you can use the command:

为了找出所有"kill -l"关键字的意义,你可以用这个命令:

man 7 signal

(or, on Solaris, "man -s 3HEAD signal"). This will give you the man page for SIGNAL(7). Scroll down a bit and you will get a list of the kill-signal words with a short explanation. Here is a sample:

(在Solaris上用"man -s 3HEAD signal").你会得到SIGNAL(7)的帮助页面.往下滚动一点你会得到一个带有简短说明的kill- signal关键字的列表,这有一个例子:

SIGHUP 1 Term Hangup detected on controlling terminal
or death of controlling process
SIGINT 2 Term Interrupt from keyboard
SIGQUIT 3 Core Quit from keyboard
SIGILL 4 Core Illegal Instruction
SIGABRT 6 Core Abort signal from abort(3)
SIGFPE 8 Core Floating point exception
SIGKILL 9 Term Kill signal
SIGSEGV 11 Core Invalid memory reference
SIGPIPE 13 Term Broken pipe: write to pipe with no readers
SIGALRM 14 Term Timer signal from alarm(2)
SIGTERM 15 Term Termination signal

(Obviously, these are just the "kill -l" words, but with a "SIG" in front of them.)

(很明显,这就是"kill -l"的关键字,只是在他们前面有"SIG")

You may also find it useful to look at the file signal.h. On a Linux machine, the location is:

你也可以看signal.h文件.在Linux机器上,它的位置是:

/usr/include/asm/signal.h

Hypernews examples

Here are some specific exit codes that came up in Hypernews. Here I have recorded the HN responses. However, they might not be correct in all cases. (Maybe the exit codes can mean other things, too.)

有一些特殊的退出码出现在Hypernews.我记录了一些HN响应,在一些情况下他们可能不正确.(或许这些退出码有其他意义)

Exit code 9: Ran out of CPU time.

退出码 9:CPU时间片用尽

Exit code 64: The framework ended the job nicely for you, most likely because the job was running out of CPU time. But it means you did not Go through all the data requested. The solution is to submit the job to a queue with more resources (bigger CPU time limit).

Exit code 125: An ErrMsg(severe) was reached in your job.

Exit code 127: Something wrong with the machine?

Exit code 130: The job ran out of CPU or swap time. If swap time is the culprit, check for memory leaks.

Exit code 131: The job ran out of CPU or swap time. If swap time is the culprit, check for memory leaks.

Exit code 134: The job is killed with an abort signal, and you probably got core dumped. Often this is caused either by an assert() or an ErrMsg(fatal) being hit in your job. There may be a run-time bug in your code. Use a debugger like gdb or dbx to find out what's wrong.

Exit code 137: The job was killed because it exceeded the time limit.

Exit code 139: Segmentatation violation.

Exit code 140: The job exceeded the "wall clock" time limit (as opposed to the CPU time limit).

HOWTO's guide to job-kill signals

The following is copied from HOWTO-Basic-Debugging, which you should definitely consult to learn how to interpret, report, and deal with errors and crashes:

SEGV
A segmentation violation or segmentation fault typically means that something is trying to access memory that it shouldn't be accessing. One common example of this is trying to access memory through a NULL pointer, for example:
sunprompt> cat main.c

include

main()
{
int bunk(0);
cout <<
bunk << endl;
}
sunprompt> CC main.c
sunprompt> ./a.out
Segmentation fault (core dumped)
ABRT
asserts are one common source of the "abort" signal, for example:
sunprompt> cat main.c

include

main()
{
int i=0;
assert(i!=0);
}
sunprompt> CC main.c
sunprompt> ./a.out
Assertion failed: i!=0, file main.c, line 5
Abort (core dumped)
Note that the actual assertion which was failed and the location is also printed. An ABRT can also be generated from the BaBar ErrMsg(fatal) construct, in which case your job log should contain a message explaining the error.
FPE
A "Floating Point Error" usually indicates a numerical problem such as a division by zero or an overflow. One example would be:
osfprompt> cat main.c
main()
{
float a = 1.;
float b = 0.;
float c = a/b;
}
osfprompt> g++ main.c
osfprompt> ./a.out
Floating exception (core dumped)
ILL
If you receive a signal like this ("Illegal Instruction"), means that, while running, your program has tried to execute a machine "instruction" which does not exist. This can happen for a variety of reasons, including:
a memory overwrite that happens to overwrite part of the program stored in memory. This may result in the program trying, for example, to execute data as if it is a machine instruction.
an attempt to take an executable compiled on one platform for use on another, for example on an earlier version of the same chip.
a truncated or corrupted executable is loaded for execution
incomplete recompilation of source code, i.e. you changed one C++ class and didn't recompile all other code affected by that change.
BUS
A "Bus Error" may come, for example, from accessing unaligned data (i.e. like trying to access a 4 byte integer with a pointer to the middle of it). What this means will vary from platform to platform. (I haven't come up with a good example of this one yet.)
A "Bus Error" can also often indicate a memory overwrite, e.g. somebody wrote a number where a pointer is kept. Often caused by going past the end of an array and into the system pointers at the start of the next memory block.

How do you know if you've exceeded your CPU time?

To find out whether your job has exceeded the CPU time limit, you have to do 3 things:

Look at your log file to get the job's CPU time.
Use the machine-dependent CPUF to convert the CPU time to SLAC time. The formula is: SLAC time = CPU time * CPUF.
Compare this to the time allowed by the queue in which the job was run.
Here is an example.

First, look at the end of your log file:

Job was submitte
d from host by user .
Job was executed on host(s) , in queue , as user .
</u/br/penguin> was used as the home directory.
</u/br/penguin/vubrecoil/vub30/workdir> was used as the working directory.
Started at Wed Feb 8 17:25:33 2006
Results reported at Wed Feb 8 19:27:28 2006

Your job looked like:


LSBATCH: User input

VubRecoilUserApp VubXlnu.tcl SP-1237-BSemiExcl-Run5-R18b-1 MC

Exited with exit code 134.

Resource usage summary:

CPU time   :   7058.71 sec.
Max Memory :      2863 MB
Max Swap   :      2968 MB

Max Processes  :         3
Max Threads    :         3

The job was run on the machine cob0313.

bhosts -l cob0313
This tells you (among other things) that the CPUF for cob0313 is 7.65.

The SLAC time for your job is thus:

SLAC time = (CPU time) CPUF = (7058.71 sec) 7.65 = 53999.1 sec = 900 min
The next step to find out if this exceeds the CPU limit of the queue in which the job was run. In this example, the job was the xlong queue:

bqueues -l xlong
Among other things, this tells you the CPU limit for the queue:

CPULIMIT
2900.0 min of slac
The job used only 900 minutes of SLAC time, less than the 2900 allowed by the xlong queue. So the job did not exceed its CPU time limit. It must have crashed for some other reason.

Memory Leaks

Jobs can also crash because of memory leaks --- things like dangling pointers or array overruns. The following links may be helpful for tracking down memory leaks:

Memory leaks webpage
Valgrind at BABAR

Author: Sheila Mclachlin
Created: Feb 09, 2006.
Last updated: Feb 13, 2006 by Gregory Dubois-Felsman

原文