Last check: 3.1.2012 == March 2012 == * 01.03.2012: tcx090, tcx110, tcx130 rebooted by IT: CVMFS installation == February 2012 == * 28.02.2012: tcx080, tcx120, tcx110 rebooted by IT: CVMFS installation * 10.02.2012: tcx120 rebooted by IT: [09:06] memory problem (janetd) * 09.02.2012: tcx110 rebooted by IT: [11:43] memory problem (janetd?) * 09.02.2012: tcx120 rebooted by IT: [08:48] memory problem (janetd) == January 2012 == * 18.01.2012: all HH rebooted by IT: ??? * 12.01.2012: tcx090 rebooted by IT: [08:52] maintenance * 10.01.2012: all rebooted by IT: [18:00] maintenance * 09.01.2012: tcx130 rebooted by IT: [20:03] lustre errors, no login possible * 06.01.2012: tcx080 rebooted by IT: [15:26] == December 2011 == * 16.12.2011: tcx080 rebooted by IT: [09:45] * 15.12.2011: tcx080 rebooted by IT: memory problems * 05.12.2011: tcx110 rebooted by IT [15:08]: lustre errors, no login possible * 05.12.2011: tcx120 rebooted by IT [11:14]: lustre errors, no login possible == November 2011 == * 04.11.2011: tcx110 rebooted by IT [13:29]: out of memory (swapping) (janetd?, proof master) == September 2011 == * 01.09.2011: tcx090 rebooted by IT: Zeuthen downtime (cooling) == August 2011 == * 25.08.2011: all rebooted by IT: downtime * 02.08.2011: tcx100 rebooted by IT [14:37]: out of memory (luzgomez, sframe) == June 2011 == * 23.06.2011: tcx090 rebooted by IT [08:33]: problems with installation (AFS affected) * 03.06.2011: tcx080 rebooted by IT [07:24]: out of memory? * 07.06.2011: tcx120 rebooted by IT [20:09]: out of memory? * 09.06.2011: tcx110 rebooted by IT [15:32]: out of memory? == May 2011 == * 10.05.2011: tcx080,tcx090: AFS client upgrade * 11.05.2011: tcx100,tcx110: AFS client upgrade * 12.05.2011: tcx120,tcx130: AFS client upgrade * 20.05.2011: tcx110 reboot by IT [12:16]: Lustre(?) problems in the night == March 2011 == * 28.03.2011: tcx100 reboot by IT [04:28] sar log stopped at 11:02pm of 15.3.2011, memory problem, gfischer? {{{ gfischer pts/6 tcsh16-vm1.naf.d Fri Mar 25 22:14 - crash (2+07:13) luzgomez pts/3 tcsh15-vm1.naf.d Fri Mar 25 19:37 - crash (2+09:49) luzgomez pts/1 tcsh15-vm1.naf.d Fri Mar 25 19:25 - crash (2+10:01) katzy pts/5 localhost:14.0 Thu Mar 24 09:37 - crash (3+19:50) katzy pts/4 tcsh16-vm1.naf.d Thu Mar 24 09:36 - crash (3+19:50) leffhalm pts/0 tcsh16-vm1.naf.d Tue Mar 22 09:02 - crash (5+20:24) }}} * 22.3.2011: all interactive by IT [06:20] maintenance slot * 21.3.2011: tcx130 reboot by IT [06:58] sar log stopped at 8:50pm of 19.3.2011, memory problem, gfischer? {{{ gfischer pts/21 tcsh5-vm1.naf.de Sat Mar 19 20:26 - crash (1+10:31) wildt pts/19 tcsh6-vm1.naf.de Sat Mar 19 20:12 - crash (1+10:46) wasicki pts/18 tcsh6-vm1.naf.de Sat Mar 19 20:02 - crash (1+10:55) glazov pts/7 tcsh5-vm1.naf.de Sat Mar 19 15:45 - crash (1+15:12) luzgomez pts/4 tcsh5-vm1.naf.de Sat Mar 19 13:39 - crash (1+17:19) glazov pts/0 tcsh5-vm1.naf.de Sat Mar 19 08:23 - crash (1+22:35) leyton pts/2 tcsh5-vm1.naf.de Sat Mar 19 07:14 - crash (1+23:44) leyton pts/1 tcsh5-vm1.naf.de Sat Mar 19 07:14 - crash (1+23:44) leyton pts/22 tcsh6-vm1.naf.de Fri Mar 18 10:06 - crash (2+20:52) leffhalm pts/17 tcsh6-vm1.naf.de Fri Mar 18 09:50 - crash (2+21:08) mbecking pts/43 tcsh5-vm1.naf.de Thu Mar 17 16:03 - crash (3+14:55) mbecking pts/24 tcsh6-vm1.naf.de Thu Mar 17 11:40 - crash (3+19:18) tkohno pts/11 tcsh5-vm1.naf.de Thu Mar 17 10:00 - crash (3+20:58) tkohno pts/13 tcsh6-vm1.naf.de Tue Mar 15 14:30 - crash (5+16:28) warsinsk pts/5 tcsh5-vm1.naf.de Fri Mar 11 21:33 - crash (9+09:25) }}} * 18.3.2011: tcx100 reboot by IT [06:46] sar log stopped at 10pm of 17.3.2011, memory problem, gfischer? {{{ luzgomez pts/27 tcsh6-vm1.naf.de Thu Mar 17 21:56 - crash (08:50) gfischer pts/42 tcsh5-vm1.naf.de Thu Mar 17 21:41 - crash (09:05) stanescu pts/41 tcsh6-vm1.naf.de Thu Mar 17 21:21 - crash (09:25) wasicki pts/2 tcsh5-vm1.naf.de Thu Mar 17 20:10 - crash (10:36) wildt pts/36 :pts/29:S.0 Thu Mar 17 15:23 - crash (15:23) wildt pts/29 tcx130.naf.desy. Thu Mar 17 15:23 - crash (15:23) efeld pts/26 tcsh5-vm1.naf.de Thu Mar 17 13:05 - crash (17:41) almutp pts/40 tcsh5-vm1.naf.de Thu Mar 17 12:23 - crash (18:23) almutp pts/39 tcsh5-vm1.naf.de Thu Mar 17 12:23 - crash (18:23) almutp pts/38 tcsh5-vm1.naf.de Thu Mar 17 12:22 - crash (18:23) almutp pts/37 tcsh5-vm1.naf.de Thu Mar 17 12:22 - crash (18:24) efeld pts/33 tcsh5-vm1.naf.de Thu Mar 17 10:47 - crash (19:59) mzvolsky pts/10 tcsh5-vm1.naf.de Thu Mar 17 09:19 - crash (21:27) efeld pts/7 tcsh5-vm1.naf.de Wed Mar 16 15:38 - crash (1+15:08) efeld pts/34 tcsh5-vm1.naf.de Wed Mar 16 12:02 - crash (1+18:44) wolter pts/16 tcsh5-vm1.naf.de Wed Mar 16 10:55 - crash (1+19:51) efeld pts/17 tcsh5-vm1.naf.de Mon Mar 14 17:18 - crash (3+13:28) efeld pts/1 tcsh5-vm1.naf.de Mon Mar 14 17:12 - crash (3+13:34) mijovic pts/4 tcsh6-vm1.naf.de Sat Mar 12 19:36 - crash (5+11:10) warsinsk pts/3 tcsh6-vm1.naf.de Fri Mar 11 19:37 - crash (6+11:09) boehler pts/9 tcsh16-vm1.naf.d Fri Mar 11 10:16 - crash (6+20:29) katzy pts/21 tcsh5-vm1.naf.de Wed Mar 9 16:14 - crash (8+14:32) finnern pts/11 tcsh6-vm5.naf.de Tue Mar 8 10:33 - crash (9+20:13) }}} == February 2011 == * 28.2.2011: tcx080 reboot [18:29] lustre problems (IT asked for advice) * 24.2.2011: tcx100 reboot [13:35] had problems * 1.2.2011: all work group server [morning] update kernels in downtime (firewall upgrade) == January 2011 == * 17.1.2011: tcx080 reboot [17 15:44] kswap problem (high load) ATLAS not notified * 6.1.2011: tcx040 reboot [10:39] kswap problem (high load) WE reported * 6.1.2011: tcx080 reboot [10:26] kswap problem (high load) WE reported * 6.1.2011: tcx120 reboot [11:21] kswap problem (high load) WE reported == December 2010 == * 15.12.2010: tcx080 reboot [11:37] kswap problem * 14.12.2010: tcx120 reboot [07:54] ??? (sar: normal load) * 13.12.2010: tcx060 reboot [08:58] no ssh login possible, nothing obvious from IT (sar: high load -> kswapd) * 9.12.2010: tcx080 reboot [10:58] kswap problem * 8.12.2010: tcx120 reboot [18:04] nothing obvious (sar: high load -> kswapd) == November 2010 == * 17.11.2010: tcx080 reboot [14:27] under investigation * 17.11.2010: tcx120 reboot [09:49] two long running jobs, not killable * 11.11.2010: tcx040, tcx060 reboot [09:04] reboot for AFS client update (reduce dead lock) * 8.11.2010: tcx080, tcx120 [09:11] reboot for AFS client update (reduce dead lock) == October 2010 == * 11.10.2010: tcx040 reboot [6:27] * 6.10.2010: tcx040 reboot [8:34] (offline due to hardware problems since 23.9.2010, partly replaced by tutorial machines) * 5.10.2010: dCache problems: crash on one pool - same as 1.10.2010 (LOCALGROUPDISK ()) * 4.10.2010: dCache problems: problems with dcache-atlas17-02, DATADISK * 1.10.2010: dCache problems: crash on two pools (LOCALGROUPDISK (13597 recoverd, 8+18+2697 lost), DATADISK (13901 recovered, 40 lost (SAM test, ...))) == September 2010 == * 23.9.2010: * all work group server reboots due to down time * 18.9.2010: * reboot all work group server due to kernel update [14:45] * 9.9.2010: * various logging/AFS problems (10:00) * tcx120 reboot [11:28] * 1.9.2010: * various logging/AFS problems (13:40) == August 2010 == * 30.8.2010: * tcx060 reboot [07:06] ? (29.8.2010) * various logging/AFS problems (11:05, 11:35, 17:21) * 18.8.2010: * tcx120 reboot [11:30] lustre problems * tcx080 reboot [12:27], memory full (started around 11:14) * login problem (around 11:14) * 16.8.2010: * load balancing [around 11:00] (two machines off, all to tcx060 with load of 6, tcx080 load of 0.9) * tcx120 reboot (13:08), off since 11.8.2010 * tcx040 reboot (12:59), off since 12.8.2010 * tcx080 reboot (15:11), see 7.8.2010 * Lustre problem: /scratch/hh/lustre/atlas/users/mwildt/FirstData/SFrame_Output/DATA_7TeV (14:16) * 11.8.2010: * Various, short loging problems (whole day) AFS instabilities * 10.8.2010: * Varous, short login problems, AFS instabilities * 7.8.2010: * tcx080 needs reboot due to infiniband problems (blocked for users in load balancing only), reboot on 16.8.2010 (missing ATLAS support feedback due to vacation) * 5.8.2010: * reboot tcx060 (19:03) * 2.8.2010: * reboot tcx040 (13:37, 12:50) drained, IB problems * reboot tcx080 (15:20) 100% swap == July 2010 == * 15.7.2010: * maintenance with kernel upgrade * 27.7.2010: * load balancing problem [10:30-10:45] * load balancing problem [13:40-14:10] * reboot of tcx080 (10:47, 14:18), tcx060 (11:03), tcx120 (10:48, 14:21) * problem with PROOF lite and kswapd * 28.7.2010: * reboot tcx060 (12:31), tcx080 (12:39) * problem with PROOF lite and kswapd