Thread 'Tasks that Error while computing'

Author	Message
Speedy51 Send message Joined: 16 Oct 09 Posts: 46 Credit: 918,752 RAC: 4	Message 522 - Posted: 11 Aug 2010, 7:06:30 UTC 6866403, 3rd result unsent 6933291 other result unsent & 6933288 other result unsent. I think these errors was caused after a restart but I'm not 100% sure ID: 522 · Rating: 0 · rate: / Reply Quote

Greg Project administrator Send message Joined: 26 Jun 08 Posts: 656 Credit: 553,872,080 RAC: 331,889	Message 523 - Posted: 12 Aug 2010, 0:39:20 UTC - in response to Message 522. Yes, it appears the checkpoint file was corrupt. It's not a problem. The server will reissue the workunits shortly. ID: 523 · Rating: 0 · rate: / Reply Quote

Speedy51 Send message Joined: 16 Oct 09 Posts: 46 Credit: 918,752 RAC: 4	Message 524 - Posted: 12 Aug 2010, 3:25:04 UTC Thanks for clearing that clearing that up. ID: 524 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Send message Joined: 10 Oct 09 Posts: 2 Credit: 1,000,031 RAC: 0	Message 525 - Posted: 13 Aug 2010, 7:09:36 UTC - in response to Message 524. Last modified: 13 Aug 2010, 7:11:04 UTC All tasks of the new 5,409- batch are erroring out with SIGSEGV: segmentation violation - exit code 193 and only on that machine. http://escatter11.fullerton.edu/nfs/results.php?hostid=10656&offset=0&show_names=1&state=5 Very strange because hundreds of tasks from the former batch 12,254+ were all valid even the repair jobs I still get now and then are valid too. http://escatter11.fullerton.edu/nfs/results.php?hostid=10656&offset=0&show_names=1&state=3 So pse send me only repairs from the 12,254+ ;) Regards CP ID: 525 · Rating: 0 · rate: / Reply Quote

Greg Project administrator Send message Joined: 26 Jun 08 Posts: 656 Credit: 553,872,080 RAC: 331,889	Message 526 - Posted: 14 Aug 2010, 16:04:26 UTC - in response to Message 525. That is strange. You have other computers, both AMD and Intel, running the same operating system and completing 5,409- tasks just fine. And that is not an old processor. I would normally suspect a flaky computer, but it has successfully completed MANY 12,254+ tasks, and the 3,563+ tasks are running fine. If it is a bug in the client, I'm not sure why it is getting triggered on only that one computer. I'm puzzled... ID: 526 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Send message Joined: 10 Oct 09 Posts: 2 Credit: 1,000,031 RAC: 0	Message 527 - Posted: 14 Aug 2010, 19:59:02 UTC - in response to Message 526. Don't make it a problem. I just tried again 2 of those tasks on that machine and yes: computation error again. It should have to do something with the kind of tasks. All those tasks errored out exactly after 118 - 120 seconds. Perhaps something to do with the kind of OS. It's a Linux 32bit. ID: 527 · Rating: 0 · rate: / Reply Quote

Jeff Blank Send message Joined: 9 Mar 10 Posts: 3 Credit: 2,529,966 RAC: 0	Message 532 - Posted: 29 Aug 2010, 6:16:48 UTC I just noticed that I'm having the same problem, same WUs and application. Also 32-bit Linux. (My Linux is actually emulated by FreeBSD, which I assumed to be the problem, but maybe it's not.) 15e v1.07 WUs run fine. http://escatter11.fullerton.edu/nfs/results.php?userid=3990&offset=20&show_names=1&state=5 (scroll past all the aborts) FWIW, the time of the crash seems to be about when the first checkpoint would be made. ID: 532 · Rating: 0 · rate: / Reply Quote

Greg Project administrator Send message Joined: 26 Jun 08 Posts: 656 Credit: 553,872,080 RAC: 331,889	Message 533 - Posted: 29 Aug 2010, 6:35:08 UTC - in response to Message 532. A batch of bad workunits went out that are causing problems on Windows and 32-bit Linux, although they run fine with 64-bit Linux. (Guess which system I tested them on. Ugh!) I have canceled all that weren't already handed out when I noticed the problem, but those that had already been assigned need to work their way through the system. On Windows and 32-bit Linux systems only, you can abort any workunits that begin with S5m409b. S5m409 and S5m409c workunits will run fine; only those that have the letter "b" are causing problems. I apologize for the inconvenience! ID: 533 · Rating: 0 · rate: / Reply Quote

Greg Project administrator Send message Joined: 26 Jun 08 Posts: 656 Credit: 553,872,080 RAC: 331,889	Message 534 - Posted: 30 Aug 2010, 19:09:58 UTC - in response to Message 533. I just updated the 32-bit applications to hopefully fix this problem. ID: 534 · Rating: 0 · rate: / Reply Quote

Sorceress Send message Joined: 25 Oct 09 Posts: 7 Credit: 48,843 RAC: 0	Message 568 - Posted: 7 Sep 2010, 4:39:42 UTC Last modified: 7 Sep 2010, 4:47:33 UTC I am having a lot of my WUs erroring out. In fact most of them. Could somehelp me with this? Task ID Work unit ID Sent Time reported Run time(sec) CPU time(sec) Claimed credit Granted credit Application 8190457 7487246 7 Sep 2010 2:23:40 UTC 10 Sep 2010 14:23:40 UTC In progress --- --- --- --- 15e Lattice Sieve v1.08 8190285 7487075 7 Sep 2010 1:24:49 UTC 10 Sep 2010 13:24:49 UTC In progress --- --- --- --- 15e Lattice Sieve v1.08 8172584 7469388 7 Sep 2010 0:36:10 UTC 7 Sep 2010 0:42:24 UTC Error while computing 143.70 133.89 0.49 --- 16e Lattice Sieve v1.09 8172513 7469317 7 Sep 2010 0:42:24 UTC 7 Sep 2010 1:24:49 UTC Error while computing 262.61 142.69 0.52 --- 16e Lattice Sieve v1.09 8172370 7469174 7 Sep 2010 0:20:59 UTC 7 Sep 2010 0:36:10 UTC Error while computing 239.31 140.75 0.51 --- 16e Lattice Sieve v1.09 8172225 7469029 6 Sep 2010 23:55:02 UTC 7 Sep 2010 0:20:59 UTC Error while computing 194.33 141.69 0.52 --- 16e Lattice Sieve v1.09 8168805 7362830 6 Sep 2010 20:23:42 UTC 7 Sep 2010 3:43:34 UTC Completed and validated 6,559.67 6,448.20 17.62 44.00 15e Lattice Sieve v1.08 8168468 7452841 6 Sep 2010 15:07:37 UTC 6 Sep 2010 23:55:02 UTC Error while computing 169.41 139.67 0.51 --- 16e Lattice Sieve v1.09 8167099 7365208 6 Sep 2010 16:23:52 UTC 6 Sep 2010 20:23:42 UTC Error while computing 2.52 0.02 0.00 --- 15e Lattice Sieve v1.08 8165174 7346316 6 Sep 2010 12:05:59 UTC 6 Sep 2010 21:39:13 UTC Completed and validated 8,131.03 6,691.11 18.29 44.00 15e Lattice Sieve v1.08 8162101 7465840 5 Sep 2010 23:33:07 UTC 5 Sep 2010 23:57:02 UTC Error while computing 148.08 135.98 0.49 --- 16e Lattice Sieve v1.09 8161937 7465676 5 Sep 2010 23:05:40 UTC 5 Sep 2010 23:33:07 UTC Error while computing 157.77 135.41 0.49 --- 16e Lattice Sieve v1.09 8161531 7465271 5 Sep 2010 21:56:41 UTC 5 Sep 2010 22:26:30 UTC Error while computing 135.91 132.52 0.48 --- 16e Lattice Sieve v1.09 8161477 7465219 5 Sep 2010 22:26:30 UTC 5 Sep 2010 23:05:40 UTC Error while computing 150.64 133.50 0.49 --- 16e Lattice Sieve v1.09 8161288 7465030 5 Sep 2010 21:03:02 UTC 5 Sep 2010 21:48:15 UTC Error while computing 148.48 131.13 0.48 --- 16e Lattice Sieve v1.09 8161234 7464976 5 Sep 2010 21:48:16 UTC 5 Sep 2010 21:56:41 UTC Error while computing 152.95 134.77 0.49 --- 16e Lattice Sieve v1.09 8157355 7461103 5 Sep 2010 9:02:17 UTC 5 Sep 2010 21:03:02 UTC Error while computing 135.98 131.89 0.48 --- 16e Lattice Sieve v1.09 8152505 7456259 4 Sep 2010 18:35:09 UTC 5 Sep 2010 7:13:46 UTC Error while computing 152.11 130.31 0.47 --- 16e Lattice Sieve v1.09 8146960 7450749 4 Sep 2010 2:08:04 UTC 4 Sep 2010 3:26:47 UTC Error while computing 135.36 131.83 0.49 --- 16e Lattice Sieve v1.09 8146615 7450404 4 Sep 2010 0:28:04 UTC 4 Sep 2010 2:08:04 UTC Error while computing 153.53 134.47 0.50 --- 16e Lattice Sieve v1.09 What the heck is going on? I just started up my laptop and its doing the same thing. ID: 568 · Rating: 0 · rate: / Reply Quote

Greg Project administrator Send message Joined: 26 Jun 08 Posts: 656 Credit: 553,872,080 RAC: 331,889	Message 569 - Posted: 7 Sep 2010, 5:46:19 UTC - in response to Message 568. The computer has insufficient memory (for Windows XP at least) for the current large memory lasievef tasks. Disable the lasievef application in your project preferences. The lasievee application will run well on that computer. ID: 569 · Rating: 0 · rate: / Reply Quote

Sorceress Send message Joined: 25 Oct 09 Posts: 7 Credit: 48,843 RAC: 0	Message 570 - Posted: 7 Sep 2010, 13:38:41 UTC - in response to Message 569. Last modified: 7 Sep 2010, 13:54:11 UTC The computer has insufficient memory (for Windows XP at least) for the current large memory lasievef tasks. Disable the lasievef application in your project preferences. The lasievee application will run well on that computer. Hmmm.. it says for apps up to 1Gb. I have 4gb on the main box but only 2 in my laptop. Anyway, I have disabled it. But looking at the other failed WUs, I see also that there are lasievee apps as well and with a lot of compute time before they failed. I have wasted a lot of time on your WUs. Can't be all just for low memory, is it? I am having no problems with the other projects I'm running. Please advise. ID: 570 · Rating: 0 · rate: / Reply Quote

Sorceress Send message Joined: 25 Oct 09 Posts: 7 Credit: 48,843 RAC: 0	Message 571 - Posted: 7 Sep 2010, 17:30:56 UTC Why would this computer have a problem with the lasievef apps? The preferences says it uses 'up to 1GB' memory. I have plenty of memory to run this app. Is the app using more than 3.5gb? If so. you need to change the project preferences to reflect a high memory requirement than 'up to 1gb' for the lasievef app. CPU type GenuineIntel Intel(R) Core(TM)2 Duo CPU E4500 @ 2.20GHz [Family 6 Model 15 Stepping 13] Number of processors 2 Coprocessors NVIDIA GeForce 8600 GTS (255MB) driver: 25896 Operating System Microsoft Windows XP Professional x86 Edition, Service Pack 3, (05.01.2600.00) BOINC client version 6.10.56 Memory 3455.17 MB Cache 2048 KB Swap space 6752.14 MB Total disk space 698.63 GB Free Disk Space 531.97 GB Measured floating point speed 2124.6 million ops/sec Measured integer speed 4157.82 million ops/sec ID: 571 · Rating: 0 · rate: / Reply Quote

Nelson Send message Joined: 17 Jun 10 Posts: 2 Credit: 1,815 RAC: 0	Message 580 - Posted: 17 Sep 2010, 22:22:44 UTC - in response to Message 569. I have also had several work units error out and I think I know what the problem is. I have 1.5G of memory and the only time I get an error processing an NFS work unit is when I have another project's work unit using 600MB or more memory either running or in memory waiting to run. (I have BOINC set to keep work in memory.) This happen when reporting tasks and downloading new work and if I get an NFS work unit it just takes over without regard to the current memory useage or tasks that are running. NFS doesn't wait for other tasks to checkpoint and if the other tasks are using more than 500 or 600MB the NFS work unit errors out within a couple of minuets. It appears that if the NFS software was more polite and aware this problem would go away. BTW: the Lattice Project typically uses over 1.2GB on my machine without any errors. Hope this helps ID: 580 · Rating: 0 · rate: / Reply Quote

Sorceress Send message Joined: 25 Oct 09 Posts: 7 Credit: 48,843 RAC: 0	Message 581 - Posted: 17 Sep 2010, 22:46:46 UTC - in response to Message 580. Nelson, The following was sent to me reguarding the NFS lasievef WU. The computer has insufficient memory (for Windows XP at least) for the current large memory lasievef tasks. Disable the lasievef application in your project preferences. The lasievee application will run well on that computer. Once I disabled the lasievef WU, I've had no problems. With the low memory you have, that will fix it. The lasievef WU can take up to 1gb of memory or more per core! 2 cores is 2gb usage. The NFS lasievef WU does not work well with other WUs and can cause them to error out. I lost 18 other projects WUs because of the NFS WU. Hope this helps. ID: 581 · Rating: 0 · rate: / Reply Quote

Nelson Send message Joined: 17 Jun 10 Posts: 2 Credit: 1,815 RAC: 0	Message 582 - Posted: 21 Sep 2010, 2:44:46 UTC - in response to Message 581. Sorceress I have not lost work from any other project as a result of of the NFS lasievef memory usage. When a lasievef WU is downloaded while another project's WU is running and using a lot of memory the lasievef WU simply starts running immediately, which stops any other WUs (greedy little bugger). The lasievef WU starts off using little memory, the increases memory usage after a couple of minutes and errors out when there is not enough. After that that my other WUs start running again from where they left off, without issue. The lasievef WUs should NOT be forcing other WUs to stop like that, it needs to wait in line like everybody else. The NFS software needs some work to fix this issue. Loseing 2-3 minutes per errored lasievef WU isn't so bad. It's intolerable if it causes another projrct's WU to error out. That needs to be fixed, not ignored! ID: 582 · Rating: 0 · rate: / Reply Quote

Sorceress Send message Joined: 25 Oct 09 Posts: 7 Credit: 48,843 RAC: 0	Message 583 - Posted: 21 Sep 2010, 4:57:38 UTC I agree that there is a problem with the NFS WUs. From what several other admins told me, they may be having a memory leak issue, which caused the failures. I have had a WU run in high mem before and put the other WUs in a 'waiting for memory' position but this one caused computation errors in 9 other projects WUs when it kicked in. They was running on my laptop which only has 1gb of mem. I switched to the lasievee WUs and have had no more problems since. I could set separate preferences for each 'puter, but since there is only 12-20 credits difference in running the lasievef and the laisevee on my systems, I just left them as they are. I feel it was a fluke WU but I'm not taking anymore chances. Since you are losing lasievef WUs, it might be prudent to switch to the lasievee instead. Just a thought. Sorceress ID: 583 · Rating: 0 · rate: / Reply Quote

bdodson* Send message Joined: 2 Oct 09 Posts: 50 Credit: 111,128,218 RAC: 0	Message 584 - Posted: 21 Sep 2010, 14:48:53 UTC - in response to Message 582. Sorceress I have not lost work from any other project as a result of of the NFS lasievef memory usage. When a lasievef WU is downloaded while another project's WU is running and using a lot of memory the lasievef WU simply starts running immediately, which stops any other WUs (greedy little bugger). The lasievef WU starts off using little memory, the increases memory usage after a couple of minutes and errors out when there is not enough. After that that my other WUs start running again from where they left off, without issue. The lasievef WUs should NOT be forcing other WUs to stop like that, it needs to wait in line like everybody else. The NFS software needs some work to fix this issue. Loseing 2-3 minutes per errored lasievef WU isn't so bad. It's intolerable if it causes another projrct's WU to error out. That needs to be fixed, not ignored! I'm having a _very_ difficult time understanding why people with memory issues won't switch _and_stay_switched to lasievee. I run lasievee on my lower memory linux machines, even though I'm reasonably sure that they would be able to run lasievef. Perhaps it's a credit issue? If so, then maybe the differential credit ought to be adjusted. I do 100% agree that distributed computing jobs that are intended to make use of idle cycles should not be interferring with other user jobs. One point that hasn't been emphasized is that the lasievee jobs (referred to as using the "15e siever", rather than the "16e siever", among non-boinc users) are also working on a very interesting project that's different from the large memory jobs. The latter are heading towards record sized "snfs" numbers, for a public project; and are expected to make use of top-of-the-line national computing resouces (for the postprocessing). But the lower memory jobs will soon be working on a sequence of "gnfs" numbers, for which the pre-computation uses state-of-the-art GPU-computing. If one is thinking about breaking RSA-keys while running these WUs; that's gnfs, not snfs. Greg's relying upon linux/nvidia tessla GPU cards to find the so-called "gnfs polynomial" sent with the lasievee jobs; found using a massively parallel search, and state-of-the-art CUDA programming. If there's even the slighest possibility of a lasievef job being in the way on your machine, _please_, switch already! And happy computing. -bdodson ID: 584 · Rating: 0 · rate: / Reply Quote