WU's failing with EXIT_TIME_LIMIT_EXCEEDED
Message boards :
Questions/Problems/Bugs :
WU's failing with EXIT_TIME_LIMIT_EXCEEDED
Message board moderation
Author | Message |
---|---|
Send message Joined: 16 Sep 21 Posts: 3 Credit: 1,705,584 RAC: 0 |
I have this system: https://escatter11.fullerton.edu/nfs/show_host_detail.php?hostid=7189872. It is a VM using 18 out of 24 CPU's. I have found that WU's of application 16e Lattice Sieve for smaller numbers fail with EXIT_TIME_LIMIT_EXCEEDED when CPU time goes over approx. 4,600 seconds, see this list: https://escatter11.fullerton.edu/nfs/results.php?hostid=7189872&offset=0&show_names=0&state=6&appid=11 Does anybody else see this problem? |
Send message Joined: 16 Sep 21 Posts: 3 Credit: 1,705,584 RAC: 0 |
The score is now 99 failing WU's with the above symptoms. |
Send message Joined: 26 Sep 09 Posts: 218 Credit: 22,841,893 RAC: 2 |
Hi. Go to https://escatter11.fullerton.edu/nfs/prefs.php?subset=global and on the memory section set all to 95%. |
Send message Joined: 16 Sep 21 Posts: 3 Credit: 1,705,584 RAC: 0 |
Thank you for your reply; seems OK now. |
Send message Joined: 26 Sep 09 Posts: 218 Credit: 22,841,893 RAC: 2 |
Please let me know if it still stands, will have to get admin Greg involved. |
Send message Joined: 22 Feb 23 Posts: 2 Credit: 150,998 RAC: 0 |
I was getting a lot of these with 15e Lattice Sieve v1.08 (notphenomiix6) windows_x86_64, with code 197. I had CPU and RAM usage set to 100%. I lowered RAM and page swap and CPU usage and time down to 96%, though later after tasks started completing with "no new tasks" set in BOINC manager, the usage went lower and I was seeing compute errors anyway. I'd suspect that it's more from the page/swap file than total RAM usage. I've got enough tasks going to hit 100% usage on my CPU, but enough RAM that it's only hitting 53% capacity. The swap file on my system is set to 8GB. The peak swap size mentioned in the errors is about 1GB. It's doing it for every application but 16e Lattice Sieve V5, from what I can tell. I'm new to the project, but I wonder if it's possible that having my preferences set to "only get work from 16e Lattice Sieve V5" has anything to do with it. I'd changed that setting after the units had already downloaded and started running. The BOINC log has several entries like this one with "2023-02-22 9:36:36 PM | NFS@Home | Aborting task 7m3_317_129760_0: exceeded elapsed time limit 4315.13 (86400.00G/20.02G)". Once the number of NFS tasks died down from 100% usage (and I'm not sure the exact usage percentage when this event happened), a few of the "15e smaller number" tasks started to complete successfully, but not many. The run time is consistently 4300 seconds, and is consistently longer than the CPU time which averages about 3700 seconds. https://escatter11.fullerton.edu/nfs/result.php?resultid=306760861 https://escatter11.fullerton.edu/nfs/result.php?resultid=306759018 https://escatter11.fullerton.edu/nfs/result.php?resultid=306759259 https://escatter11.fullerton.edu/nfs/result.php?resultid=306759322 https://escatter11.fullerton.edu/nfs/result.php?resultid=306759353 https://escatter11.fullerton.edu/nfs/result.php?resultid=306759363 https://escatter11.fullerton.edu/nfs/result.php?resultid=306759371 15e for smaller numbers: https://escatter11.fullerton.edu/nfs/result.php?resultid=306757866 16e for smaller numbers: https://escatter11.fullerton.edu/nfs/result.php?resultid=306749053 |
Send message Joined: 22 Feb 23 Posts: 2 Credit: 150,998 RAC: 0 |
The Gridcoin server had a very experienced user, Keith, who said, "@joeybuddy96 The NFS admins have misconfigured the work unit generator script with too low a rsc_fpops_est value. Have them increase it by 10-100X." |
Send message Joined: 26 Jun 08 Posts: 645 Credit: 473,009,118 RAC: 261,026 |
Certainly possible. We are stretching these further than I originally intended when I set those values. I have increased it by a factor of 10 for newly generated wus. |
Send message Joined: 9 Oct 09 Posts: 30 Credit: 26,809,482 RAC: 0 |
I'm Getting mostly Time Limit computation errors ... AWareM17R5AMD1 45651 NFS@Home 2/26/2023 11:11:38 PM Aborting task C203_147_104_315687_1: exceeded elapsed time limit 369.72 (86400.00G/233.69G) 45652 NFS@Home 2/26/2023 11:11:39 PM Computation for task C203_147_104_315687_1 finished 45653 NFS@Home 2/26/2023 11:11:39 PM Starting task C203_147_104_315964_1 45654 NFS@Home 2/26/2023 11:11:41 PM Started upload of C203_147_104_315687_1_r774756475_0 45655 NFS@Home 2/26/2023 11:11:45 PM Finished upload of C203_147_104_315687_1_r774756475_0 45656 NFS@Home 2/26/2023 11:11:52 PM Aborting task C203_147_104_315795_1: exceeded elapsed time limit 369.72 (86400.00G/233.69G) 45657 NFS@Home 2/26/2023 11:11:53 PM Computation for task C203_147_104_315795_1 finished 45658 NFS@Home 2/26/2023 11:11:53 PM Starting task C203_147_104_318535_0 45659 NFS@Home 2/26/2023 11:11:55 PM Started upload of C203_147_104_315795_1_r331725269_0 45660 NFS@Home 2/26/2023 11:11:58 PM Finished upload of C203_147_104_315795_1_r331725269_0 45661 NFS@Home 2/26/2023 11:11:59 PM Aborting task C203_147_104_318404_0: exceeded elapsed time limit 369.72 (86400.00G/233.69G) |
Send message Joined: 26 Sep 09 Posts: 218 Credit: 22,841,893 RAC: 2 |
Now I'm getting the same issue. This happens with all Wu's with a completion forecast higher than before. It changed from 28 mins to 2h on my case. So on my queue I have a mix of the two. Do we need to rerun the computer Benchmark?! |
Send message Joined: 14 Sep 16 Posts: 3 Credit: 52,098,324 RAC: 37,651 |
I'll add my two cents. For me it is only happening on 16es units. |
Send message Joined: 26 Sep 09 Posts: 218 Credit: 22,841,893 RAC: 2 |
If you have spare memory per thread core I recommend just allowing for 16e Lattice Sieve V5 tasks until Greg sorts this out on his end for the 16e Lattice Sieve for smaller numbers aka .lasievef_small. Under preferences: lasieved - very small numbers, uses less than 0.5 GB memory, work may be infrequently available: no lasievee_small - small numbers, uses up to 0.8 GB memory: no lasievee - medium numbers, uses up to 1 GB memory: no lasievef_small - large numbers, uses up to 1 GB memory: no lasieve5f - huge numbers, uses up to 1.25 GB memory: yes |
Send message Joined: 26 Jun 08 Posts: 645 Credit: 473,009,118 RAC: 261,026 |
I'm a bit confused why increasing rsc_fpops_est leads to a shorter time limit. But I've changed it to slightly above the previous value and added a much larger rsc_fpops_bound. I hope that helps! |
Send message Joined: 1 Jan 15 Posts: 18 Credit: 10,902,664 RAC: 0 |
I noticed a pattern on all of the work units that error out on my computer with TIME_LIMIT_EXCEEDED errors. Their work unit names all start with "C203_147_104_". See https://escatter11.fullerton.edu/nfs/results.php?hostid=7191920&offset=0&show_names=1&state=6&appid= to see that all of these bad tasks are for "C203_147_104". Is this the case with everyone who has this error? Maybe that run needs to be cancelled, reconfigured, and sent out again after its parameters are fixed. |
Send message Joined: 14 Sep 16 Posts: 3 Credit: 52,098,324 RAC: 37,651 |
I have / am running a few of the lasievef_small tasks since Greg posted. Most (9) have succeeded, one has failed with the time exceeded error. Strangely the failure errored at 29 1/2 minutes .. and most of the successful units ran longer than that time, some considerably longer. and yes all were "C203_147_104_" Hope this helps g |
Send message Joined: 14 Sep 16 Posts: 3 Credit: 52,098,324 RAC: 37,651 |
The problem continues. Ran 137 units. 12 failed, 125 success. So about a ten percent error rate. Every error was an exceeded time limit Messages: exceeded elapsed time limit 2010.19 (86400.00G/42.98G) - This was the exact message on the last 4 or 5. Earlier messages had varied time limits as such: exceeded elapsed time limit 1164.97 (86400.00G/74.17G) exceeded elapsed time limit 2043.35 (86400.00G/43.00G) I've pulled off running lasievef_small while this gets sorted. Cheers |
Send message Joined: 26 Sep 09 Posts: 218 Credit: 22,841,893 RAC: 2 |
Back to normal here, just aborting the ones with the highest estimate time. Back to 100% on 16e small, background feeders need some push in there, lots of integers to factor. |