Thread 'WU's failing with EXIT_TIME_LIMIT

Author	Message
Ruud van der Kroef Send message Joined: 16 Sep 21 Posts: 3 Credit: 1,705,714 RAC: 0	Message 2364 - Posted: 15 Feb 2023, 12:09:40 UTC Last modified: 15 Feb 2023, 12:10:39 UTC I have this system: https://escatter11.fullerton.edu/nfs/show_host_detail.php?hostid=7189872. It is a VM using 18 out of 24 CPU's. I have found that WU's of application 16e Lattice Sieve for smaller numbers fail with EXIT_TIME_LIMIT_EXCEEDED when CPU time goes over approx. 4,600 seconds, see this list: https://escatter11.fullerton.edu/nfs/results.php?hostid=7189872&offset=0&show_names=0&state=6&appid=11 Does anybody else see this problem? ID: 2364 · Rating: 0 · rate: / Reply Quote

Ruud van der Kroef Send message Joined: 16 Sep 21 Posts: 3 Credit: 1,705,714 RAC: 0	Message 2365 - Posted: 16 Feb 2023, 8:31:53 UTC - in response to Message 2364. The score is now 99 failing WU's with the above symptoms. ID: 2365 · Rating: 0 · rate: / Reply Quote

Carlos Pinho Volunteer moderator Send message Joined: 26 Sep 09 Posts: 235 Credit: 27,645,213 RAC: 0	Message 2366 - Posted: 16 Feb 2023, 12:59:46 UTC Hi. Go to https://escatter11.fullerton.edu/nfs/prefs.php?subset=global and on the memory section set all to 95%. ID: 2366 · Rating: 0 · rate: / Reply Quote

Ruud van der Kroef Send message Joined: 16 Sep 21 Posts: 3 Credit: 1,705,714 RAC: 0	Message 2367 - Posted: 17 Feb 2023, 10:48:31 UTC - in response to Message 2366. Thank you for your reply; seems OK now. ID: 2367 · Rating: 0 · rate: / Reply Quote

Carlos Pinho Volunteer moderator Send message Joined: 26 Sep 09 Posts: 235 Credit: 27,645,213 RAC: 0	Message 2368 - Posted: 17 Feb 2023, 11:13:07 UTC Please let me know if it still stands, will have to get admin Greg involved. ID: 2368 · Rating: 0 · rate: / Reply Quote

joeybuddy96 Send message Joined: 22 Feb 23 Posts: 2 Credit: 150,998 RAC: 0	Message 2369 - Posted: 23 Feb 2023, 2:27:16 UTC Last modified: 23 Feb 2023, 3:06:43 UTC I was getting a lot of these with 15e Lattice Sieve v1.08 (notphenomiix6) windows_x86_64, with code 197. I had CPU and RAM usage set to 100%. I lowered RAM and page swap and CPU usage and time down to 96%, though later after tasks started completing with "no new tasks" set in BOINC manager, the usage went lower and I was seeing compute errors anyway. I'd suspect that it's more from the page/swap file than total RAM usage. I've got enough tasks going to hit 100% usage on my CPU, but enough RAM that it's only hitting 53% capacity. The swap file on my system is set to 8GB. The peak swap size mentioned in the errors is about 1GB. It's doing it for every application but 16e Lattice Sieve V5, from what I can tell. I'm new to the project, but I wonder if it's possible that having my preferences set to "only get work from 16e Lattice Sieve V5" has anything to do with it. I'd changed that setting after the units had already downloaded and started running. The BOINC log has several entries like this one with "2023-02-22 9:36:36 PM \| NFS@Home \| Aborting task 7m3_317_129760_0: exceeded elapsed time limit 4315.13 (86400.00G/20.02G)". Once the number of NFS tasks died down from 100% usage (and I'm not sure the exact usage percentage when this event happened), a few of the "15e smaller number" tasks started to complete successfully, but not many. The run time is consistently 4300 seconds, and is consistently longer than the CPU time which averages about 3700 seconds. https://escatter11.fullerton.edu/nfs/result.php?resultid=306760861 https://escatter11.fullerton.edu/nfs/result.php?resultid=306759018 https://escatter11.fullerton.edu/nfs/result.php?resultid=306759259 https://escatter11.fullerton.edu/nfs/result.php?resultid=306759322 https://escatter11.fullerton.edu/nfs/result.php?resultid=306759353 https://escatter11.fullerton.edu/nfs/result.php?resultid=306759363 https://escatter11.fullerton.edu/nfs/result.php?resultid=306759371 15e for smaller numbers: https://escatter11.fullerton.edu/nfs/result.php?resultid=306757866 16e for smaller numbers: https://escatter11.fullerton.edu/nfs/result.php?resultid=306749053 ID: 2369 · Rating: 0 · rate: / Reply Quote

joeybuddy96 Send message Joined: 22 Feb 23 Posts: 2 Credit: 150,998 RAC: 0	Message 2370 - Posted: 23 Feb 2023, 4:37:16 UTC - in response to Message 2369. The Gridcoin server had a very experienced user, Keith, who said, "@joeybuddy96 The NFS admins have misconfigured the work unit generator script with too low a rsc_fpops_est value. Have them increase it by 10-100X." ID: 2370 · Rating: 0 · rate: / Reply Quote

Greg Project administrator Send message Joined: 26 Jun 08 Posts: 655 Credit: 543,562,430 RAC: 257,300	Message 2371 - Posted: 26 Feb 2023, 2:28:06 UTC - in response to Message 2370. Certainly possible. We are stretching these further than I originally intended when I set those values. I have increased it by a factor of 10 for newly generated wus. ID: 2371 · Rating: 0 · rate: / Reply Quote

STE\/E Send message Joined: 9 Oct 09 Posts: 30 Credit: 42,718,914 RAC: 12,423	Message 2372 - Posted: 27 Feb 2023, 4:12:57 UTC Last modified: 27 Feb 2023, 4:13:21 UTC I'm Getting mostly Time Limit computation errors ... AWareM17R5AMD1 45651 NFS@Home 2/26/2023 11:11:38 PM Aborting task C203_147_104_315687_1: exceeded elapsed time limit 369.72 (86400.00G/233.69G) 45652 NFS@Home 2/26/2023 11:11:39 PM Computation for task C203_147_104_315687_1 finished 45653 NFS@Home 2/26/2023 11:11:39 PM Starting task C203_147_104_315964_1 45654 NFS@Home 2/26/2023 11:11:41 PM Started upload of C203_147_104_315687_1_r774756475_0 45655 NFS@Home 2/26/2023 11:11:45 PM Finished upload of C203_147_104_315687_1_r774756475_0 45656 NFS@Home 2/26/2023 11:11:52 PM Aborting task C203_147_104_315795_1: exceeded elapsed time limit 369.72 (86400.00G/233.69G) 45657 NFS@Home 2/26/2023 11:11:53 PM Computation for task C203_147_104_315795_1 finished 45658 NFS@Home 2/26/2023 11:11:53 PM Starting task C203_147_104_318535_0 45659 NFS@Home 2/26/2023 11:11:55 PM Started upload of C203_147_104_315795_1_r331725269_0 45660 NFS@Home 2/26/2023 11:11:58 PM Finished upload of C203_147_104_315795_1_r331725269_0 45661 NFS@Home 2/26/2023 11:11:59 PM Aborting task C203_147_104_318404_0: exceeded elapsed time limit 369.72 (86400.00G/233.69G) ID: 2372 · Rating: 0 · rate: / Reply Quote

Carlos Pinho Volunteer moderator Send message Joined: 26 Sep 09 Posts: 235 Credit: 27,645,213 RAC: 0	Message 2374 - Posted: 28 Feb 2023, 7:40:25 UTC Now I'm getting the same issue. This happens with all Wu's with a completion forecast higher than before. It changed from 28 mins to 2h on my case. So on my queue I have a mix of the two. Do we need to rerun the computer Benchmark?! ID: 2374 · Rating: 0 · rate: / Reply Quote

Gibson Praise Send message Joined: 14 Sep 16 Posts: 3 Credit: 64,934,654 RAC: 18,527	Message 2375 - Posted: 28 Feb 2023, 11:35:35 UTC I'll add my two cents. For me it is only happening on 16es units. ID: 2375 · Rating: 0 · rate: / Reply Quote

Carlos Pinho Volunteer moderator Send message Joined: 26 Sep 09 Posts: 235 Credit: 27,645,213 RAC: 0	Message 2376 - Posted: 28 Feb 2023, 12:47:28 UTC If you have spare memory per thread core I recommend just allowing for 16e Lattice Sieve V5 tasks until Greg sorts this out on his end for the 16e Lattice Sieve for smaller numbers aka .lasievef_small. Under preferences: lasieved - very small numbers, uses less than 0.5 GB memory, work may be infrequently available: no lasievee_small - small numbers, uses up to 0.8 GB memory: no lasievee - medium numbers, uses up to 1 GB memory: no lasievef_small - large numbers, uses up to 1 GB memory: no lasieve5f - huge numbers, uses up to 1.25 GB memory: yes ID: 2376 · Rating: 0 · rate: / Reply Quote

Greg Project administrator Send message Joined: 26 Jun 08 Posts: 655 Credit: 543,562,430 RAC: 257,300	Message 2377 - Posted: 1 Mar 2023, 0:06:01 UTC - in response to Message 2376. I'm a bit confused why increasing rsc_fpops_est leads to a shorter time limit. But I've changed it to slightly above the previous value and added a much larger rsc_fpops_bound. I hope that helps! ID: 2377 · Rating: 0 · rate: / Reply Quote

Jesse Viviano Send message Joined: 1 Jan 15 Posts: 18 Credit: 10,902,664 RAC: 0	Message 2378 - Posted: 1 Mar 2023, 19:04:52 UTC I noticed a pattern on all of the work units that error out on my computer with TIME_LIMIT_EXCEEDED errors. Their work unit names all start with "C203_147_104_". See https://escatter11.fullerton.edu/nfs/results.php?hostid=7191920&offset=0&show_names=1&state=6&appid= to see that all of these bad tasks are for "C203_147_104". Is this the case with everyone who has this error? Maybe that run needs to be cancelled, reconfigured, and sent out again after its parameters are fixed. ID: 2378 · Rating: 0 · rate: / Reply Quote

Gibson Praise Send message Joined: 14 Sep 16 Posts: 3 Credit: 64,934,654 RAC: 18,527	Message 2379 - Posted: 1 Mar 2023, 22:37:09 UTC I have / am running a few of the lasievef_small tasks since Greg posted. Most (9) have succeeded, one has failed with the time exceeded error. Strangely the failure errored at 29 1/2 minutes .. and most of the successful units ran longer than that time, some considerably longer. and yes all were "C203_147_104_" Hope this helps g ID: 2379 · Rating: 0 · rate: / Reply Quote

Gibson Praise Send message Joined: 14 Sep 16 Posts: 3 Credit: 64,934,654 RAC: 18,527	Message 2380 - Posted: 3 Mar 2023, 13:22:40 UTC Last modified: 3 Mar 2023, 13:23:22 UTC The problem continues. Ran 137 units. 12 failed, 125 success. So about a ten percent error rate. Every error was an exceeded time limit Messages: exceeded elapsed time limit 2010.19 (86400.00G/42.98G) - This was the exact message on the last 4 or 5. Earlier messages had varied time limits as such: exceeded elapsed time limit 1164.97 (86400.00G/74.17G) exceeded elapsed time limit 2043.35 (86400.00G/43.00G) I've pulled off running lasievef_small while this gets sorted. Cheers ID: 2380 · Rating: 0 · rate: / Reply Quote

Carlos Pinho Volunteer moderator Send message Joined: 26 Sep 09 Posts: 235 Credit: 27,645,213 RAC: 0	Message 2381 - Posted: 4 Mar 2023, 7:57:48 UTC Back to normal here, just aborting the ones with the highest estimate time. Back to 100% on 16e small, background feeders need some push in there, lots of integers to factor. ID: 2381 · Rating: 0 · rate: / Reply Quote