Search Mailing List Archives
[farmshare-discuss] Barley AFS dies in a job with large task array
louyang at stanford.edu
Tue Jan 10 10:11:54 PST 2012
Yes, my job errored out. The error log reports that every single error but
three was due to barley18 AFS.
My job array actually had closer to 40,000 tasks and I successfully got
through 15,000, which means that the remaining 25,000 were all assigned to
barley18. It makes sense that this would happen -- a node that quickly
halts with an error is going to have plenty of free resources.
On Tue, Jan 10, 2012 at 10:03 AM, Alex Chekholko <chekh at stanford.edu> wrote:
> Hi Long,
> Yesterday, there was some AFS server congestion that affected barley18
> specifically, and caused AFS access to time out on that machine.
> It may be easiest for you to just copy your data and software to
> /mnt/glusterfs so you don't have the dependency on AFS for those jobs (and
> explicitly specify your job input/output files).
> I don't see any jobs under your name right now, did your job error out?
> On 01/10/2012 09:48 AM, Long Ouyang wrote:
>> Hi everyone,
>> I'm submitting a job with ~50,000 tasks to Barley but 15,000 tasks
>> through, it looks like AFS is breaking - even trying to cd to a folder
>> in my home directory gives a "Connection timed out" error. Is there
>> something I can do to fix this? Would just running aklog and kinit in my
>> script fix things?
>> farmshare-discuss mailing list
>> farmshare-discuss at lists.**stanford.edu<farmshare-discuss at lists.stanford.edu>
> Alex Chekholko chekh at stanford.edu 347-401-4860
> farmshare-discuss mailing list
> farmshare-discuss at lists.**stanford.edu<farmshare-discuss at lists.stanford.edu>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the farmshare-discuss