log in

WU's failing with EXIT_TIME_LIMIT_EXCEEDED

Message boards : Questions/Problems/Bugs : WU's failing with EXIT_TIME_LIMIT_EXCEEDED
Message board moderation

To post messages, you must log in.

AuthorMessage
Ruud van der Kroef

Send message
Joined: 16 Sep 21
Posts: 3
Credit: 1,705,584
RAC: 0
Message 2364 - Posted: 15 Feb 2023, 12:09:40 UTC
Last modified: 15 Feb 2023, 12:10:39 UTC

I have this system: https://escatter11.fullerton.edu/nfs/show_host_detail.php?hostid=7189872.
It is a VM using 18 out of 24 CPU's.
I have found that WU's of application 16e Lattice Sieve for smaller numbers fail with EXIT_TIME_LIMIT_EXCEEDED when CPU time goes over approx. 4,600 seconds, see this list:
https://escatter11.fullerton.edu/nfs/results.php?hostid=7189872&offset=0&show_names=0&state=6&appid=11

Does anybody else see this problem?
ID: 2364 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ruud van der Kroef

Send message
Joined: 16 Sep 21
Posts: 3
Credit: 1,705,584
RAC: 0
Message 2365 - Posted: 16 Feb 2023, 8:31:53 UTC - in response to Message 2364.  

The score is now 99 failing WU's with the above symptoms.
ID: 2365 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Gigacruncher [TSBTs Pirate]
Volunteer moderator

Send message
Joined: 26 Sep 09
Posts: 218
Credit: 22,841,893
RAC: 2
Message 2366 - Posted: 16 Feb 2023, 12:59:46 UTC

Hi.

Go to https://escatter11.fullerton.edu/nfs/prefs.php?subset=global and on the memory section set all to 95%.
ID: 2366 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ruud van der Kroef

Send message
Joined: 16 Sep 21
Posts: 3
Credit: 1,705,584
RAC: 0
Message 2367 - Posted: 17 Feb 2023, 10:48:31 UTC - in response to Message 2366.  

Thank you for your reply; seems OK now.
ID: 2367 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Gigacruncher [TSBTs Pirate]
Volunteer moderator

Send message
Joined: 26 Sep 09
Posts: 218
Credit: 22,841,893
RAC: 2
Message 2368 - Posted: 17 Feb 2023, 11:13:07 UTC

Please let me know if it still stands, will have to get admin Greg involved.
ID: 2368 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
joeybuddy96

Send message
Joined: 22 Feb 23
Posts: 2
Credit: 150,998
RAC: 0
Message 2369 - Posted: 23 Feb 2023, 2:27:16 UTC
Last modified: 23 Feb 2023, 3:06:43 UTC

I was getting a lot of these with 15e Lattice Sieve v1.08 (notphenomiix6) windows_x86_64, with code 197. I had CPU and RAM usage set to 100%. I lowered RAM and page swap and CPU usage and time down to 96%, though later after tasks started completing with "no new tasks" set in BOINC manager, the usage went lower and I was seeing compute errors anyway. I'd suspect that it's more from the page/swap file than total RAM usage. I've got enough tasks going to hit 100% usage on my CPU, but enough RAM that it's only hitting 53% capacity. The swap file on my system is set to 8GB. The peak swap size mentioned in the errors is about 1GB. It's doing it for every application but 16e Lattice Sieve V5, from what I can tell. I'm new to the project, but I wonder if it's possible that having my preferences set to "only get work from 16e Lattice Sieve V5" has anything to do with it. I'd changed that setting after the units had already downloaded and started running. The BOINC log has several entries like this one with "2023-02-22 9:36:36 PM | NFS@Home | Aborting task 7m3_317_129760_0: exceeded elapsed time limit 4315.13 (86400.00G/20.02G)". Once the number of NFS tasks died down from 100% usage (and I'm not sure the exact usage percentage when this event happened), a few of the "15e smaller number" tasks started to complete successfully, but not many. The run time is consistently 4300 seconds, and is consistently longer than the CPU time which averages about 3700 seconds.

https://escatter11.fullerton.edu/nfs/result.php?resultid=306760861
https://escatter11.fullerton.edu/nfs/result.php?resultid=306759018
https://escatter11.fullerton.edu/nfs/result.php?resultid=306759259
https://escatter11.fullerton.edu/nfs/result.php?resultid=306759322
https://escatter11.fullerton.edu/nfs/result.php?resultid=306759353
https://escatter11.fullerton.edu/nfs/result.php?resultid=306759363
https://escatter11.fullerton.edu/nfs/result.php?resultid=306759371

15e for smaller numbers:
https://escatter11.fullerton.edu/nfs/result.php?resultid=306757866

16e for smaller numbers:
https://escatter11.fullerton.edu/nfs/result.php?resultid=306749053
ID: 2369 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
joeybuddy96

Send message
Joined: 22 Feb 23
Posts: 2
Credit: 150,998
RAC: 0
Message 2370 - Posted: 23 Feb 2023, 4:37:16 UTC - in response to Message 2369.  

The Gridcoin server had a very experienced user, Keith, who said, "@joeybuddy96 The NFS admins have misconfigured the work unit generator script with too low a rsc_fpops_est value. Have them increase it by 10-100X."
ID: 2370 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg
Project administrator

Send message
Joined: 26 Jun 08
Posts: 645
Credit: 473,009,118
RAC: 261,026
Message 2371 - Posted: 26 Feb 2023, 2:28:06 UTC - in response to Message 2370.  

Certainly possible. We are stretching these further than I originally intended when I set those values. I have increased it by a factor of 10 for newly generated wus.
ID: 2371 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
STE\/E

Send message
Joined: 9 Oct 09
Posts: 30
Credit: 26,809,482
RAC: 0
Message 2372 - Posted: 27 Feb 2023, 4:12:57 UTC
Last modified: 27 Feb 2023, 4:13:21 UTC

I'm Getting mostly Time Limit computation errors ...

AWareM17R5AMD1

45651 NFS@Home 2/26/2023 11:11:38 PM Aborting task C203_147_104_315687_1: exceeded elapsed time limit 369.72 (86400.00G/233.69G)
45652 NFS@Home 2/26/2023 11:11:39 PM Computation for task C203_147_104_315687_1 finished
45653 NFS@Home 2/26/2023 11:11:39 PM Starting task C203_147_104_315964_1
45654 NFS@Home 2/26/2023 11:11:41 PM Started upload of C203_147_104_315687_1_r774756475_0
45655 NFS@Home 2/26/2023 11:11:45 PM Finished upload of C203_147_104_315687_1_r774756475_0
45656 NFS@Home 2/26/2023 11:11:52 PM Aborting task C203_147_104_315795_1: exceeded elapsed time limit 369.72 (86400.00G/233.69G)
45657 NFS@Home 2/26/2023 11:11:53 PM Computation for task C203_147_104_315795_1 finished
45658 NFS@Home 2/26/2023 11:11:53 PM Starting task C203_147_104_318535_0
45659 NFS@Home 2/26/2023 11:11:55 PM Started upload of C203_147_104_315795_1_r331725269_0
45660 NFS@Home 2/26/2023 11:11:58 PM Finished upload of C203_147_104_315795_1_r331725269_0
45661 NFS@Home 2/26/2023 11:11:59 PM Aborting task C203_147_104_318404_0: exceeded elapsed time limit 369.72 (86400.00G/233.69G)
ID: 2372 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Gigacruncher [TSBTs Pirate]
Volunteer moderator

Send message
Joined: 26 Sep 09
Posts: 218
Credit: 22,841,893
RAC: 2
Message 2374 - Posted: 28 Feb 2023, 7:40:25 UTC

Now I'm getting the same issue. This happens with all Wu's with a completion forecast higher than before. It changed from 28 mins to 2h on my case. So on my queue I have a mix of the two. Do we need to rerun the computer Benchmark?!
ID: 2374 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Gibson Praise
Avatar

Send message
Joined: 14 Sep 16
Posts: 3
Credit: 52,095,424
RAC: 37,649
Message 2375 - Posted: 28 Feb 2023, 11:35:35 UTC

I'll add my two cents. For me it is only happening on 16es units.
ID: 2375 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Gigacruncher [TSBTs Pirate]
Volunteer moderator

Send message
Joined: 26 Sep 09
Posts: 218
Credit: 22,841,893
RAC: 2
Message 2376 - Posted: 28 Feb 2023, 12:47:28 UTC

If you have spare memory per thread core I recommend just allowing for 16e Lattice Sieve V5 tasks until Greg sorts this out on his end for the 16e Lattice Sieve for smaller numbers aka .lasievef_small.

Under preferences:

lasieved - very small numbers, uses less than 0.5 GB memory, work may be infrequently available: no
lasievee_small - small numbers, uses up to 0.8 GB memory: no
lasievee - medium numbers, uses up to 1 GB memory: no
lasievef_small - large numbers, uses up to 1 GB memory: no
lasieve5f - huge numbers, uses up to 1.25 GB memory: yes
ID: 2376 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg
Project administrator

Send message
Joined: 26 Jun 08
Posts: 645
Credit: 473,009,118
RAC: 261,026
Message 2377 - Posted: 1 Mar 2023, 0:06:01 UTC - in response to Message 2376.  

I'm a bit confused why increasing rsc_fpops_est leads to a shorter time limit. But I've changed it to slightly above the previous value and added a much larger rsc_fpops_bound. I hope that helps!
ID: 2377 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 1 Jan 15
Posts: 18
Credit: 10,902,664
RAC: 0
Message 2378 - Posted: 1 Mar 2023, 19:04:52 UTC

I noticed a pattern on all of the work units that error out on my computer with TIME_LIMIT_EXCEEDED errors. Their work unit names all start with "C203_147_104_". See https://escatter11.fullerton.edu/nfs/results.php?hostid=7191920&offset=0&show_names=1&state=6&appid= to see that all of these bad tasks are for "C203_147_104". Is this the case with everyone who has this error? Maybe that run needs to be cancelled, reconfigured, and sent out again after its parameters are fixed.
ID: 2378 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Gibson Praise
Avatar

Send message
Joined: 14 Sep 16
Posts: 3
Credit: 52,095,424
RAC: 37,649
Message 2379 - Posted: 1 Mar 2023, 22:37:09 UTC

I have / am running a few of the lasievef_small tasks since Greg posted. Most (9) have succeeded, one has failed with the time exceeded error. Strangely the failure errored at 29 1/2 minutes .. and most of the successful units ran longer than that time, some considerably longer. and yes all were "C203_147_104_"

Hope this helps
g
ID: 2379 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Gibson Praise
Avatar

Send message
Joined: 14 Sep 16
Posts: 3
Credit: 52,095,424
RAC: 37,649
Message 2380 - Posted: 3 Mar 2023, 13:22:40 UTC
Last modified: 3 Mar 2023, 13:23:22 UTC

The problem continues. Ran 137 units. 12 failed, 125 success. So about a ten percent error rate.
Every error was an exceeded time limit

Messages:
exceeded elapsed time limit 2010.19 (86400.00G/42.98G) - This was the exact message on the last 4 or 5. Earlier messages had varied time limits as such:

exceeded elapsed time limit 1164.97 (86400.00G/74.17G)
exceeded elapsed time limit 2043.35 (86400.00G/43.00G)

I've pulled off running lasievef_small while this gets sorted.

Cheers
ID: 2380 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Gigacruncher [TSBTs Pirate]
Volunteer moderator

Send message
Joined: 26 Sep 09
Posts: 218
Credit: 22,841,893
RAC: 2
Message 2381 - Posted: 4 Mar 2023, 7:57:48 UTC

Back to normal here, just aborting the ones with the highest estimate time. Back to 100% on 16e small, background feeders need some push in there, lots of integers to factor.
ID: 2381 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Questions/Problems/Bugs : WU's failing with EXIT_TIME_LIMIT_EXCEEDED


Home | My Account | Message Boards