All 16e tasks named "G6p*" error out on max elapsed?
Message boards :
Questions/Problems/Bugs :
All 16e tasks named "G6p*" error out on max elapsed?
Message board moderation
Author | Message |
---|---|
Send message Joined: 29 Nov 09 Posts: 1 Credit: 2,786,853 RAC: 0 |
Since today, all 16e tasks whose names start with "G6p" get a computation error with the message "maximum elapsed time exceeded" after about 2000 seconds. Anybody else seeing this? Any solutions, other than disallowing any new work or only accepting 14e/15e tasks? |
Send message Joined: 16 May 12 Posts: 1 Credit: 36,651,725 RAC: 0 |
I'm getting this too though it doesn't appear to be every Gp49b** work unit, some are completing. |
Send message Joined: 3 Apr 15 Posts: 2 Credit: 6,237,746 RAC: 0 |
The same on every G2m1285b task when runnig on very fast (i7) machines. |
Send message Joined: 26 Sep 09 Posts: 218 Credit: 22,841,893 RAC: 2 |
Greg has been warned about these errors. |
Send message Joined: 3 Apr 15 Posts: 2 Credit: 6,237,746 RAC: 0 |
Tried again, happens to all new generated WUs. I have to quit NFS until this issue has been fixed. |
Send message Joined: 19 Mar 10 Posts: 3 Credit: 1,828,730 RAC: 0 |
I'm having this same issue: Stderr output <core_client_version>7.6.9</core_client_version> <![CDATA[ <message> exceeded elapsed time limit 1969.46 (86400.00G/43.87G) </message> <stderr_txt> 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED In my BOINC client, the estimated time of completion for these workunits are 05:19. Seems rather small, as 14e and 15e estimated times appear near the 40-50 minute range. Perhaps increasing the estimated time to completion (is that done via FLOPS count estimate?) will fix things? Would that be rsc_fpops_est? But it's the same between 15e and 16 so I don't know. Is someone looking into this? I can help debug, if that's possible. |
Send message Joined: 19 Mar 10 Posts: 3 Credit: 1,828,730 RAC: 0 |
Just doing some more digging (I have an intel core i5-2400) (86400.00G/43.87G) 86400 refers to <rsc_fpops_bound> in init_data.xml. or in client_state.xml: <app_version> <app_name>lasieve5f</app_name> <version_num>111</version_num> <platform>windows_x86_64</platform> <avg_ncpus>1.000000</avg_ncpus> <max_ncpus>1.000000</max_ncpus> <flops>43869921209.976120</flops> <api_version>7.1.0</api_version> <file_ref> <file_name>lasieve5f_1.11_windows_x86_64.exe</file_name> <main_program/> </file_ref> </app_version> Maybe flops needs to be smaller? Currently 86400 / 43.87 = 1969 seconds. For 14e and 15e: (snipped for clarity) client_state.xml <app_version> <app_name>lasieved</app_name> <flops>3877241497.655855</flops> </app_version> <app_version> <app_name>lasievee</app_name> <flops>6054631322.103277</flops> </app_version> <app_version> <app_name>lasieve5f</app_name> <flops>43869921209.976120</flops> </app_version> I don't know why the flops is so big for lasieve5f. Can this be easily modified? |
Send message Joined: 26 Jun 08 Posts: 645 Credit: 473,009,118 RAC: 261,026 |
You can close BOINC and manually edit client_state.xml to reduce the flops value, perhaps to match that from the other apps. It is likely a side effect resulting from the short wu's. In any case, to try to mitigate this I reset the server statistics for that app, and I regenerated all workunits for 1290000 and higher using a larger bound. Hopefully that'll help in the meantime. |
Send message Joined: 19 Mar 10 Posts: 3 Credit: 1,828,730 RAC: 0 |
Thanks Greg, this has seemed to work. I shut down the BOINC client. Then I modified the <flops> value for 16e to be the same as 15e: client_state.xml <app_version> <app_name>lasieve5f</app_name> <version_num>111</version_num> <platform>windows_x86_64</platform> <avg_ncpus>1.000000</avg_ncpus> <max_ncpus>1.000000</max_ncpus> <flops>3395101616.607331</flops> <==== changed to same as laaievee <api_version>7.1.0</api_version> <file_ref> <file_name>lasieve5f_1.11_windows_x86_64.exe</file_name> <main_program/> </file_ref> </app_version> Then restarted my client. This appears to have worked (although it seems the BOINC app has changed the value to another smaller value), but the flops value is nowhere as high as it originally was. Things are working and my workunits seem to take around 45 mins on a Intel core i5-2400. Thanks. |