Search Mailing List Archives


Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort
Limit to: All This Week Last Week This Month Last Month
Select Date Range     through    

[farmshare-discuss] trying to submit job, but it always goes to error state

André Filgueiras de Araujo afaraujo at stanford.edu
Mon Jul 9 16:57:06 PDT 2012


Hello,

First time trying to use barley. My submitted jobs always go to error
state, and are not run.
I tried doing "qmod -cj JOBID" to remove the error, but I get:

"senpai1:~> qmod -cj 286885
afaraujo at senpai1.stanford.edu cleared error state of job 286885"

and then the job immediately gets to an error state. (I'm pasting what I
get from qstat below)

I'm submitting a job from my home directory, by typing:

qsub -cwd precomputespm2012d_densesift.script

The script "precomputespm2012d_densesift.script" is very simple. It
consists only of:

"#!bin/bash

/afs/ir.stanford.edu/users/a/f/afaraujo/trecvid2012/code/svm/precompute-kerneltype
2 -numberthreads 24 16384 400289
/mnt/glusterfs/afaraujo/densesiftspm_2012d.bin 400289
/mnt/glusterfs/afaraujo/densesiftspm_2012d.bin"

That's it.

I thought this should work. When I get the status of the hosts, by doing
"qhost -q", many precise.q queues appear not to be completely used, if I
understand correctly. So I dont know why it's not running.

Your help would be VERY much appreciated..

Here's what I get from "qstat -f -j ":
"senpai1:~> qstat -f -j 286885
==============================================================
job_number:                 286885
exec_file:                  job_scripts/286885
submission_time:            Mon Jul  9 16:08:31 2012
owner:                      afaraujo
uid:                        30844
group:                      operator
gid:                        37
sge_o_home:                 /afs/ir/users/a/f/afaraujo
sge_o_log_name:             afaraujo
sge_o_path:
/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/sbin:/sbin:/usr/games:/usr/sweet/bin:/usr/pubsw/bin:/usr/pubsw/X/bin:/afs/ir/users/a/f/afaraujo/bin:/afs/ir/users/a/f/afaraujo
sge_o_shell:                /bin/tcsh
sge_o_workdir:              /afs/ir.stanford.edu/users/a/f/afaraujo
sge_o_host:                 senpai1.stanford.edu
account:                    sge
cwd:                        /afs/ir/users/a/f/afaraujo
mail_list:                  afaraujo at senpai1.stanford.edu
notify:                     FALSE
job_name:                   precomputespm2012d_densesift.script
jobshare:                   0
env_list:                   KRB5CCNAME=FILE:/tmp/krb5cc_30844
script_file:                precomputespm2012d_densesift.script
error reason    1:          07/09/2012 16:08:46 [30844:17844]:
execvp(/var/spool/gridengine/execd/barley20/job_scripts/286885, "
scheduling info:            queue instance "precise.q at barley04.stanford.edu"
dropped because it is overloaded: np_load_avg=2.130000 (= 2.130000 + 0.50 *
0.000000 with nproc=24) >= 1.75
                            queue instance "precise.q at barley05.stanford.edu"
dropped because it is overloaded: np_load_avg=1.790833 (= 1.590833 + 0.50 *
9.600000 with nproc=24) >= 1.75
                            queue instance "
precise-long.q at barley04.stanford.edu" dropped because it is overloaded:
np_load_avg=2.130000 (= 2.130000 + 0.50 * 0.000000 with nproc=24) >= 1.75
                            queue instance "
precise-long.q at barley05.stanford.edu" dropped because it is overloaded:
np_load_avg=1.790833 (= 1.590833 + 0.50 * 9.600000 with nproc=24) >= 1.75
                            queue instance "
precise-long.q at barley02.stanford.edu" dropped because it is full
                            queue instance "
precise-long.q at barley11.stanford.edu" dropped because it is full
                            queue instance "
precise-long.q at barley12.stanford.edu" dropped because it is full
                            queue instance "
precise-long.q at barley09.stanford.edu" dropped because it is full
                            queue instance "
precise-long.q at barley10.stanford.edu" dropped because it is full
                            queue instance "
precise-long.q at barley07.stanford.edu" dropped because it is full
                            queue instance "
precise-long.q at barley03.stanford.edu" dropped because it is full
                            queue instance "
precise-long.q at barley06.stanford.edu" dropped because it is full
                            queue instance "
precise-long.q at barley01.stanford.edu" dropped because it is full
                            queue instance "
precise-long.q at barley08.stanford.edu" dropped because it is full
                            Job is in error state
"

Thanks

Andre Filgueiras de Araujo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.stanford.edu/pipermail/farmshare-discuss/attachments/20120709/b93078ec/attachment.html>


More information about the farmshare-discuss mailing list