[00:02:28] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:37] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:07:52] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.678 seconds [00:14:28] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:22:16] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.671 seconds [00:23:46] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.185 seconds [00:26:10] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:26:16] New patchset: Bhartshorne; "adding etag awareness to abort failed puts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2598 [00:27:31] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [00:27:40] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:27:56] New review: Aaron Schulz; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2598 [00:28:52] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:34:07] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:35:10] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [00:36:40] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.238 seconds [00:43:25] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:43:59] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2598 [00:43:59] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2598 [00:44:46] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.537 seconds [00:49:13] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:50:43] New patchset: Bhartshorne; "typoed semicolon should be comma" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2599 [00:51:15] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2599 [00:51:16] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2599 [00:51:37] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.341 seconds [00:53:04] New patchset: Bhartshorne; "yay more typos boo no lint checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2600 [00:53:25] too bad gerrit.2600 doesn't have more subversive content. [00:53:32] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2600 [00:53:33] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2600 [00:53:35] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.549 seconds [00:55:40] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:58:58] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:00:55] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.275 seconds [01:08:07] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.740 seconds [01:08:52] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:13:31] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:14:43] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.769 seconds [01:16:35] New patchset: Asher; "my fork of gdash from git://github.com/asher/gdash.git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2601 [01:18:46] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:19:43] New patchset: Asher; "my fork of gdash from git://github.com/asher/gdash.git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2601 [01:20:11] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2601 [01:20:12] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2601 [01:20:43] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.528 seconds [01:21:28] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 606s [01:21:55] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 635s [01:22:31] PROBLEM - MySQL replication status on db1025 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 673s [01:24:46] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:25:22] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.128 seconds [01:28:31] PROBLEM - Puppet freshness on carbon is CRITICAL: Puppet has not run in the last 10 hours [01:28:40] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.021 seconds [01:29:16] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:39:37] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:42:37] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.487 seconds [01:42:37] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [01:42:55] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [01:46:13] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.094 seconds [01:46:31] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:47:35] RECOVERY - MySQL replication status on db1025 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [01:49:04] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.561 seconds [01:54:10] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:54:37] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 609s [01:54:46] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 617s [01:56:43] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.534 seconds [01:57:10] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:00:46] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:02:25] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.873 seconds [02:04:40] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.704 seconds [02:08:43] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:10:22] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:13:58] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.664 seconds [02:17:52] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:24:28] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 243 seconds [02:25:49] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.344 seconds [02:29:53] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:55] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 0 seconds [02:35:19] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 217 seconds [02:41:55] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:42:22] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:47:10] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 0 seconds [02:54:13] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.201 seconds [02:54:58] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 239 seconds [02:58:07] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:12:45] PROBLEM - Host amssq50 is DOWN: PING CRITICAL - Packet loss = 100% [03:12:45] PROBLEM - Host amssq59 is DOWN: PING CRITICAL - Packet loss = 100% [03:12:45] PROBLEM - Host amssq57 is DOWN: PING CRITICAL - Packet loss = 100% [03:12:45] PROBLEM - Host amssq56 is DOWN: PING CRITICAL - Packet loss = 100% [03:12:45] PROBLEM - Host amssq60 is DOWN: PING CRITICAL - Packet loss = 100% [03:12:46] PROBLEM - Host amssq45 is DOWN: PING CRITICAL - Packet loss = 100% [03:12:46] PROBLEM - Host amssq62 is DOWN: PING CRITICAL - Packet loss = 100% [03:12:47] PROBLEM - Host amssq51 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:03] PROBLEM - Host amssq49 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:03] PROBLEM - Host amssq58 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:03] PROBLEM - Host amssq61 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:03] PROBLEM - Host amssq52 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:03] PROBLEM - Host bits.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [03:13:12] PROBLEM - Host cp3002 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:39] PROBLEM - Host cp3001 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:39] PROBLEM - Host br1-knams is DOWN: PING CRITICAL - Packet loss = 100% [03:13:48] PROBLEM - Host knsq24 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:57] PROBLEM - Host bits.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [03:14:15] PROBLEM - Host knsq17 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:15] PROBLEM - Host knsq20 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:15] PROBLEM - Host knsq27 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:15] PROBLEM - Host knsq29 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:15] PROBLEM - Host knsq21 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:16] PROBLEM - Host knsq23 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:16] PROBLEM - Host knsq26 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:16] who killed bits ? [03:14:17] PROBLEM - Host knsq18 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:17] PROBLEM - Host csw2-esams is DOWN: PING CRITICAL - Packet loss = 100% [03:14:18] PROBLEM - Host csw1-esams is DOWN: PING CRITICAL - Packet loss = 100% [03:14:24] PROBLEM - Host hooft is DOWN: PING CRITICAL - Packet loss = 100% [03:14:24] PROBLEM - Host foundation-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [03:14:42] PROBLEM - Host knsq28 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:51] PROBLEM - Host knsq19 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:51] PROBLEM - Host knsq22 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:51] PROBLEM - Host knsq16 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:51] PROBLEM - Host knsq25 is DOWN: PING CRITICAL - Packet loss = 100% [03:15:09] PROBLEM - Host maerlant is DOWN: PING CRITICAL - Packet loss = 100% [03:15:09] PROBLEM - Host foundation-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [03:15:15] hah, all of us come online [03:15:18] PROBLEM - Host ms6 is DOWN: PING CRITICAL - Packet loss = 100% [03:15:18] PROBLEM - Host mediawiki-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [03:15:18] PROBLEM - Host lily is DOWN: PING CRITICAL - Packet loss = 100% [03:15:27] RECOVERY - Host bits.esams.wikimedia.org is UP: PING WARNING - Packet loss = 73%, RTA = 115.55 ms [03:15:27] RECOVERY - Host cp3001 is UP: PING WARNING - Packet loss = 73%, RTA = 121.22 ms [03:15:27] RECOVERY - Host knsq24 is UP: PING WARNING - Packet loss = 28%, RTA = 120.16 ms [03:15:27] RECOVERY - Host knsq27 is UP: PING WARNING - Packet loss = 28%, RTA = 117.23 ms [03:15:27] RECOVERY - Host amssq52 is UP: PING WARNING - Packet loss = 28%, RTA = 119.56 ms [03:15:27] RECOVERY - Host amssq58 is UP: PING WARNING - Packet loss = 28%, RTA = 119.56 ms [03:15:27] RECOVERY - Host amssq49 is UP: PING WARNING - Packet loss = 28%, RTA = 125.82 ms [03:15:28] RECOVERY - Host maerlant is UP: PING WARNING - Packet loss = 66%, RTA = 115.94 ms [03:15:28] RECOVERY - Host ms6 is UP: PING OK - Packet loss = 16%, RTA = 116.82 ms [03:15:36] RECOVERY - Host amssq60 is UP: PING OK - Packet loss = 0%, RTA = 123.57 ms [03:15:36] RECOVERY - Host amssq59 is UP: PING OK - Packet loss = 0%, RTA = 117.85 ms [03:15:36] RECOVERY - Host cp3002 is UP: PING OK - Packet loss = 0%, RTA = 117.40 ms [03:15:36] RECOVERY - Host hooft is UP: PING OK - Packet loss = 0%, RTA = 123.60 ms [03:15:36] RECOVERY - Host knsq21 is UP: PING OK - Packet loss = 0%, RTA = 117.64 ms [03:15:36] RECOVERY - Host amssq62 is UP: PING OK - Packet loss = 0%, RTA = 117.62 ms [03:15:37] RECOVERY - Host amssq56 is UP: PING OK - Packet loss = 0%, RTA = 117.93 ms [03:15:37] RECOVERY - Host amssq50 is UP: PING OK - Packet loss = 0%, RTA = 123.56 ms [03:15:38] RECOVERY - Host amssq61 is UP: PING OK - Packet loss = 0%, RTA = 123.70 ms [03:15:38] RECOVERY - Host knsq20 is UP: PING OK - Packet loss = 0%, RTA = 117.63 ms [03:15:39] RECOVERY - Host knsq26 is UP: PING OK - Packet loss = 0%, RTA = 117.54 ms [03:15:39] RECOVERY - Host knsq23 is UP: PING OK - Packet loss = 0%, RTA = 117.87 ms [03:15:40] RECOVERY - Host knsq29 is UP: PING OK - Packet loss = 0%, RTA = 117.70 ms [03:15:40] RECOVERY - Host knsq18 is UP: PING OK - Packet loss = 0%, RTA = 123.21 ms [03:15:41] RECOVERY - Host knsq17 is UP: PING OK - Packet loss = 0%, RTA = 123.59 ms [03:15:41] RECOVERY - Host foundation-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 123.50 ms [03:15:42] RECOVERY - Host csw2-esams is UP: PING OK - Packet loss = 0%, RTA = 125.50 ms [03:15:54] RECOVERY - Host knsq28 is UP: PING OK - Packet loss = 0%, RTA = 126.84 ms [03:15:54] RECOVERY - Host amssq57 is UP: PING OK - Packet loss = 0%, RTA = 117.08 ms [03:15:54] RECOVERY - Host amssq51 is UP: PING OK - Packet loss = 0%, RTA = 120.42 ms [03:15:54] RECOVERY - Host amssq45 is UP: PING OK - Packet loss = 0%, RTA = 120.76 ms [03:16:03] RECOVERY - Host knsq19 is UP: PING OK - Packet loss = 0%, RTA = 114.12 ms [03:16:03] RECOVERY - Host knsq22 is UP: PING OK - Packet loss = 0%, RTA = 113.45 ms [03:16:03] RECOVERY - Host knsq16 is UP: PING OK - Packet loss = 0%, RTA = 120.75 ms [03:16:03] RECOVERY - Host knsq25 is UP: PING OK - Packet loss = 0%, RTA = 113.59 ms [03:16:12] RECOVERY - Host br1-knams is UP: PING OK - Packet loss = 0%, RTA = 114.67 ms [03:16:30] RECOVERY - Host mediawiki-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 114.38 ms [03:16:39] RECOVERY - Host csw1-esams is UP: PING OK - Packet loss = 0%, RTA = 114.41 ms [03:18:09] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.458 seconds [03:19:21] RECOVERY - Host bits.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 113.46 ms [03:20:33] RECOVERY - Host foundation-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 119.19 ms [03:20:42] RECOVERY - Host lily is UP: PING OK - Packet loss = 0%, RTA = 113.58 ms [03:22:21] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:23:42] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.909 seconds [03:24:18] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 0 seconds [03:27:45] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:28:21] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 231 seconds [03:36:09] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 194 seconds [03:38:24] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.004 seconds [03:38:24] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.259 seconds [03:42:27] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:42:36] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:57:00] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 29 seconds [04:00:09] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.278 seconds [04:01:03] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 272 seconds [04:05:21] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 3.772 seconds [04:08:30] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 0 seconds [04:08:48] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:11:39] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:14:57] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 253 seconds [04:17:03] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.660 seconds [04:22:27] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:26:12] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.452 seconds [04:30:24] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:38:30] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.982 seconds [04:40:00] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.239 seconds [04:42:24] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:48:06] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:49:00] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.081 seconds [04:54:51] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.464 seconds [04:55:54] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:58:54] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:00:55] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.695 seconds [05:04:58] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:11:25] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 4.957 seconds [05:11:34] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 6.087 seconds [05:20:43] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:21:01] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:21:55] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.551 seconds [05:22:22] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.029 seconds [05:27:46] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:28:49] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:29:07] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.780 seconds [05:31:22] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.928 seconds [05:52:49] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 1 seconds [05:58:16] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 234 seconds [06:37:07] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:37:07] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:38:19] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 0.960 seconds [06:38:19] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.933 seconds [06:56:19] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 5 seconds [07:00:13] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 239 seconds [07:17:01] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 0 seconds [07:23:37] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 260 seconds [07:47:28] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:50:01] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [07:51:04] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [07:53:46] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection refused [07:53:54] yeah it sure is [07:54:58] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [07:56:55] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [07:57:13] PROBLEM - Lucene on search6 is CRITICAL: Connection refused [07:59:09] stale NFS file handle [08:00:18] ah [08:02:47] apergos: started [08:02:55] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.003 second response time on port 8123 [08:03:00] !log remounted /home on search6, started lsearchd [08:03:04] Logged the message, Master [08:03:09] how did you get the remount to work? [08:03:15] that's what I was looking for how to do [08:03:22] just umount /home and mount /home [08:03:28] cause its in fstab [08:03:40] ok [08:03:49] RECOVERY - Lucene on search6 is OK: TCP OK - 0.014 second response time on port 8123 [08:03:53] I was still looking in the irc logs to figure out the workaround [08:03:54] thanks [08:04:08] yw [08:05:50] I wonder if we'll have that on all the other search boxes :-/ [08:06:11] I hate nfs [08:06:19] is there dsh group search boxes? [08:06:40] I dunno, let's look [08:06:46] lets just do an "ls /home" [08:07:05] won't it hang? [08:07:28] /usr/local/dsh/node_groups/search [08:07:36] hmm, didnt for me, well just "cd" then [08:07:38] -bash: cd: /home: Stale NFS file handle [08:07:43] k [08:07:45] /usr/local/dsh/node_groups/searchidx [08:08:07] the last one has only one member of course [08:08:26] hmm, search boxes ask for password.. [08:08:50] they do? [08:09:04] if you can ssh into them that seems odd [08:09:16] what's one that asked for a password? [08:09:19] i should be root on fenari first :p [08:10:28] they all have the Stale NFS file handle :p [08:10:32] no, just have the dsh go as root [08:10:50] search9,20,14,7,16,15,2,12,17... [08:11:24] dsh -cM -g search -- "cd /home" [08:11:52] it won't bite us til someone has to restart on those [08:11:54] then, boom [08:13:02] it only needs it at the beginning for startup, seems like it doesn't actually have anything on /home open after that [08:13:07] !log all search boxes had /home: Stale NFS file handle.. remounting [08:13:09] Logged the message, Master [08:13:44] apergos: better now [08:13:55] how's the cd look? [08:14:09] it does not return anythin, so good:) [08:14:13] yay! [08:14:18] so........ [08:14:21] well, there was one... [08:14:26] ? [08:14:29] searchidx2: mount.nfs: /home is busy or already mounted [08:14:38] ah, that could be the exception [08:14:42] searchidx2: umount.nfs: /home: device is busy [08:14:49] (to having things open) [08:14:54] yep, looks like it [08:15:36] the rest have it remounted, example search9, can list /home now..ack [08:16:22] oh [08:16:23] heh [08:16:40] stuff runs as rainman with things in his directory [08:16:54] it's already mounted [08:17:00] so yay for that [08:17:03] k;) [08:17:17] so... what do you think about another stab at https and office? [08:17:20] :-) [08:17:27] ah,heh;) [08:17:35] after i made coffee?:) [08:17:39] people in the us are asleep, [08:17:42] we hardly use it [08:17:43] sure! [08:17:48] ah it's an hour later for you [08:18:12] ok, cool, be back soon [08:18:45] ok [08:21:04] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [08:22:16] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 6 seconds [08:28:43] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 235 seconds [08:28:43] If the client provided a Host: header field the list is searched for a matching vhost and the first hit on a ServerName or ServerAlias is taken and the request is served from that vhost. [08:28:57] so that's what we have to work with ( http://httpd.apache.org/docs/2.2/vhosts/details.html ) [08:29:01] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [08:29:08] * apergos goes to finish fixing their oatmeal [08:39:04] PROBLEM - Puppet freshness on gilman is CRITICAL: Puppet has not run in the last 10 hours [08:39:04] PROBLEM - Puppet freshness on grosley is CRITICAL: Puppet has not run in the last 10 hours [08:39:39] re [08:40:04] ok, so we need to, quoting Ryan: "you need to configure the redirect to only redirect if X-Forwarded-Proto is http" [08:40:17] uh huh [08:40:31] RewriteCond %{HTTP:X-Forwarded-Proto} !https [08:40:32] so [08:40:43] :) [08:40:45] if we set up a separate vhost stanza just for office.wikimedia.org [08:40:57] maybe we can get away with using what's in remnants.conf [08:41:06] I'm gonna look at what's there now. [08:41:52] had you tried putting something in that stanza earlier? [08:42:15] before the first rewrite rule I guess [08:44:16] nah, not really, i was looking at a slightly different way to rewrite [08:44:20] RewriteCond %{HTTPS} off [08:44:51] but you got the right thing already i think [08:45:09] well, we put it on one server, we try it from fenari, etc [08:45:54] ok, srv250 is the guinea pig? [08:46:07] suer [08:46:10] sure [08:46:39] morning hashar [08:46:51] hello :) [08:47:15] apergos: do you like shared screen? [08:47:33] I can do that, whatever you like [08:47:53] then "screen -x" on srv250 pls [08:47:59] or you can just tell me when you have made changes [08:48:04] since I'm on there I can just look at them [08:48:20] it's not like the process of making them is so special... [08:48:58] PROBLEM - Puppet freshness on ganglia1001 is CRITICAL: Puppet has not run in the last 10 hours [08:49:24] it just combines the editing and chatting ;) poor man's etherpad.. but ok ..just editing [08:49:47] I have my irc window and my terminal window on the same desktop so.... [08:50:10] mind if I attach to srv250 ? Always wondered how it looks :-D [08:50:20] (and yes, one day I will have to learn how to use screen) [08:50:26] go ahead [08:51:11] root only feature 8-)) [08:51:19] will try out on my comp [08:55:46] apergos: saved redirects.conf on srv250 [08:56:10] oh, you put it there and not remnants.conf? [08:56:13] lemme look [08:56:36] i added it back like the circular one.but just added one more condition [08:56:41] see, because we have the "firstmatch only" [08:56:42] the !https one [08:56:52] thing, this means that the stanza in remnants.conf won't ever get used [08:57:17] and I think we want it (docroot, the math rewrites, all the rest) [08:58:18] I'm not 100% sure [08:58:26] gotcha, about "The first matching path on the list .." [08:58:43] but afaict that's what would have been active til now [09:00:37] having fun at work is important: https://www.mediawiki.org/wiki/Special:Code/MediaWiki/111525#c30978 [09:01:52] apergos: alright, moved it to remnant, same thing [09:02:08] apergos: before the other standard mediawiki rewrite rules [09:02:11] lemme stare at it some [09:02:55] RewriteCond %{HTTP_HOST} office.wikimedia.org that line can go now [09:03:14] by definition we already match, right? [09:03:19] ServerName office.wikimedia.org [09:03:34] true [09:04:51] hashar: hah, is that "wiki love"? [09:04:59] eh, "code love" [09:04:59] somehow! [09:05:56] there's the "technical barnstar ";) [09:09:02] since you're in the file are you taking out that line? :-P [09:09:21] apergos: i just did, gracefulled, and wgot it from fenari.. [09:09:29] ok [09:09:30] HTTP request sent, awaiting response... HTTP/1.1 301 Moved Permanently [09:09:32] "wgot" nice [09:09:35] Location: https://office.wikimedia.org/wiki/ [following] [09:09:38] :) [09:10:06] I checked with redirect=0, redirect=1, fior mainpage [09:10:10] now let me try a few other variants [09:12:20] hmmm [09:12:34] wget --header="Host: office.wikimedia.org" --max-redirect=2 -S 'http://srv250/' [09:12:39] this saves a copy of "index.html" [09:12:47] I wonder why it saves that and not Main_Page [09:13:52] also.... [09:14:01] RewriteRule ^/math/(.*) http://upload.wikimedia.org/math/$1 [R=301] [09:14:08] that's probably going to be a problem with https [09:14:31] it should be https I guess [09:14:38] and here that $wg Mediawiki setting also comes into play again, doesnt it [09:14:51] I dunno about that [09:15:12] apergos: index.html is probably a wget default whenever you ask for an url ending with / [09:15:30] let me try it for some other wiki [09:15:55] index.html contains Main Page content [09:15:57] hashar: bingo, same behavior for meta (which we didn't touch) [09:15:59] so that's ok. [09:16:12] class="firstHeading">Main Page [09:16:18] yeah, I saw the content was ok, just wanted to make sure we weren't changing the behavior [09:17:20] but we can just use https:// on upload.. it seems.. [09:17:32] good [09:17:44] hashar: re:screen, actually sharing is not really root-only, it's just "same user", so you can also login as "foobar" multiple times, first one does "screen", and the following ones "screen -x" (for different users you'd have to mess with tty permissions) [09:17:44] I didn't see "wgMediawiki" [09:17:51] in the usual config files [09:18:28] ah, it was this: [09:18:30] If you'd like to force the wiki to be SSL-only, set $wgServer = 'https://example.com'; (whatever your site is, do NOT include the path to the wiki here), along with .htaccess rewriterules to redirect people from the http site to the https site [09:18:48] mutante: yeah figured that out on my local comp. Looks like the perfect tool for "Xtreme operating" [09:18:52] but that does not apply to us, cause we do SSL termination..? [09:19:27] I dinot; think you have to set it [09:20:11] we already have a special stanza for this in CommonSettings [09:20:18] } elseif ( isset( $_SERVER['HTTP_X_FORWARDED_PROTO'] ) && $_SERVER['HTTP_X_FORWARDED_PROTO'] == 'https' ) { [09:20:26] $wgServer = preg_replace( '/^http:/', 'https:', $wgServer ); [09:20:38] ah:) [09:21:19] so just https://upload I guess and we see [09:21:55] i did, and added another RewriteCond %{HTTP:X-Forwarded-Proto} !https before that one [09:22:46] and..repeat it again. with the opposite condition [09:23:40] I think that's superfluous [09:23:42] nah, nevermind.. [09:23:44] ack [09:24:32] ah, there's another one down below.. "UseMod comp. URLs.." [09:24:45] what is interesting to me is that we have had this work for people going to https://office with images before now [09:25:14] I'm trying to think about how that has worked (without getting the "mixed secure and insecure content" warning [09:25:15] ) [09:25:58] what about the usemod urls? [09:27:13] do we need to worry about those? [09:27:16] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 1 seconds [09:27:33] no, we don't , i was just in the wrong vhost [09:27:37] ok [09:28:44] ok,yeah, is there more we can test ? [09:28:50] like the math redirect now [09:29:03] lemme think about that [09:30:15] this is going to take me a minute, I don't remember how formulas go in [09:30:23] if it works we could as well fix "chair.wm" [09:31:35] well I just did a grep on pages-articles for officewiki and there are no math tags :-D [09:31:52] hmm I [09:31:58] they probably will be & or something [09:32:06] heh, good idea to just grep the articles;) [09:32:22] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 264 seconds [09:32:47] don't find amp;math either [09:32:54] so I'm going to give up on testing that [09:33:00] as for the images... [09:33:57] hard to test without being logged in :-/ [09:34:00] so [09:34:11] wanna try making it live...? :-D [09:35:02] hmmm...yes:) [09:35:18] ok [09:35:21] oh boy [09:35:22] :-D [09:37:09] at least we know how to purge it quicker now:) [09:37:15] yes indeed [09:40:37] svn commited it... syncing [09:40:46] ok [09:41:24] does the sync no longer log?? [09:41:42] oh, yea, i noticed that yesterday [09:41:51] i was sure i didnt have to log them manually in the past.ack [09:41:54] damn it, broken [09:41:57] (the logging) [09:42:04] (not necessarily the redirect) [09:42:27] !log made a new change to remnant.conf and synced apaches in a fresh attempt to fix office.wm redirect [09:42:30] Logged the message, Master [09:43:11] so the sync completed? [09:43:20] i.e. should I fire up a fresh browser? [09:43:20] yes, checking if it arrived on 231 [09:43:29] it did [09:43:37] dum dum dee dum [09:43:59] apache-graceful-all [09:44:15] oh yeah :-D [09:44:49] done [09:45:05] wow... it looks ..like ... [09:45:06] it works:) [09:45:35] hashar: wanna test as well:) [09:45:59] I checked a File: page [09:46:01] images show up [09:46:20] i logged in.. no problems [09:46:23] what should I test ? [09:46:28] PROBLEM - check_gcsip on payments4 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [09:46:35] hashar: http on office wm [09:46:53] hashar: eh, it should force https now [09:47:08] and of course not break things as yesterday [09:47:24] right, all the urls in the page are relative anyways [09:47:33] for the images... [09:47:59] uh oh [09:48:01] let me ask my testing puppet [09:48:08] https://office.wikimedia.org/wiki/Business_Plan [09:48:11] * hashar gives kitty a test case [09:48:29] actually no, it's a chrome issue [09:48:55] trying to actually upload a file [09:49:00] this looks like the mixed content warning [09:49:04] result: http://www.quickmeme.com/meme/364ua9/ [09:49:24] hashar: :)) [09:50:09] apergos: yep, no warning in iceweasel/firefox [09:50:22] RECOVERY - check_gcsip on payments4 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.164 second response time [09:50:49] I wish chrome would tell me which elements it doesn't like [09:51:07] yeah -1 for Safari that does not let me see the cookies :/ [09:51:25] * apergos tries it from firefox [09:51:51] -1 for Firefox that FU***NG hide the protocol from the URL [09:51:55] seriously [09:52:00] that browser used to be a great one [09:54:26] when I look at page info from firefox, I see it's getting from bits via https and from upload via https [09:54:33] so that seems ok [09:54:43] re: uploads, ok, it uses commons only [09:54:55] let's keep it [09:55:08] if there are any edge cases someone will turn them up when the sf crew comes on line [09:56:35] tried to upload "SSL Symbol.png", but This file is a duplicate of the following file: ;9 [09:56:42] :-P [09:57:29] apergos: awesome, and thanks a lot for help and making me fix it right away.. [09:57:42] actually today my vacation starts.. and this way it feels a lot better! [09:57:47] congrats on getting it working [09:57:59] I didn't know this was your vacation or I would have not mentioned it at all! [09:58:29] its the last (half) day of work..its perfect.. this morning or it wouldnt have worked out [09:58:35] nice!! [09:59:55] yeah, this was top on the list "before vacation" [10:00:06] updating RT:)) [10:00:09] yay! [10:00:13] hmm [10:00:19] looks like the cookie is sent securely [10:00:36] err [10:00:41] sent with the "secure" flag [10:01:17] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 26 seconds [10:01:36] secure flag? [10:01:53] that is a flag you can set when sending a cookie [10:02:02] that instruct the browser to only send the cookie over HTTPS [10:02:05] ok [10:02:08] well good [10:02:14] we don't want it to go over http [10:02:20] else someone connecting to http://office will have his browser send the cookie [10:02:27] (that is what happens with jenkins :-( ) [10:02:34] oh really [10:02:36] "The Secure attribute is meant to keep cookie communication limited to encrypted transmission, directing browsers to use cookies only via secure/encrypted connections. Naturally, web servers should set Secure cookies via secure/encrypted connections, lest the cookie information be transmitted in a way that allows eavesdropping when first sent to the web browser." [10:02:43] if ( isset( $_SERVER['HTTP_X_FORWARDED_PROTO'] ) && $_SERVER['HTTP_X_FORWARDED_PROTO'] == 'https' ) { [10:02:44] $wgCookieSecure = true; [10:02:44] $_SERVER['HTTPS'] = 'on'; // Fake this so MW goes into HTTPS mode [10:02:44] } [10:02:47] OH YEAH [10:02:57] yet another hack in our configuration :-D [10:03:10] I was gonna say our configuration is full of hacks [10:03:16] but really our configuration *is* hacks [10:04:04] anyway that hacks makes office to send a secure cookie which is great :-D [10:04:17] yay us! [10:04:26] you will probably be able to proceed with the other pirates wikis :-) [10:04:26] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 230 seconds [10:05:33] pirate wikis! [10:05:36] that is what we need!! [10:06:53] I am totally going to find a reason that the next lab project has to be called "pirate" [10:06:54] !log office.wm now forces https (in a less broken way;) (remnant.conf) [10:06:56] Logged the message, Master [10:09:26] we won't run out of these .. ;) there is still "https://mediawiki.org redirects to http://www.mediawiki.org/" [10:09:41] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 0 seconds [10:09:54] I seeeeee [10:10:26] just make mediawiki.org https only :-) [10:10:38] er no [10:11:30] RewriteCond %{HTTPS} off [10:11:30] RewriteCond %{HTTP_HOST} mediawiki.org [10:11:30] RewriteRule ^/(.*)$ http://www.mediawiki.org/$1 [R=301,L] [10:11:35] RewriteCond %{HTTPS} on [10:11:35] RewriteCond %{HTTP_HOST} mediawiki.org [10:11:35] RewriteRule ^/(.*)$ https://www.mediawiki.org/$1 [R=301,L] [10:11:49] oh you are on a roll today [10:12:05] that was a suggestion in October :p [10:12:17] meh [10:12:38] !rt 1668 [10:12:38] https://rt.wikimedia.org/Ticket/Display.html?id=1668 [10:13:28] * apergos is returning to their regularly scheduled pile o' cr^H^Hwork for the day... [10:13:29] wasnt used right away mainly cause of "%{HTTPS} might be best... [10:13:32] but it's going to result in an awful lot of code duplication for one letter." [10:14:20] yea, i didnt mean to start another one right now [10:14:47] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 231 seconds [10:15:08] you can... I'm not gonna though :-D [10:16:29] ok,but there's enough "scheduled pile" for me too:) ttyl [10:16:38] see ya :-D [10:16:53] * mutante waves [10:32:02] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 0 seconds [10:35:56] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 190 seconds [11:09:08] New patchset: Dzahn; "fix pubkey for aengels" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2602 [11:09:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2602 [11:11:22] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 0 seconds [11:14:17] New review: Dzahn; "wrong one, user did not have matching private key. yes, not leaving the old one in as "absent", it w..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2602 [11:14:18] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2602 [11:14:58] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 205 seconds [11:21:29] apergos: btw, re: search / stale NFS.. it would have been broken by cron soon .heh :P -_> "lsearchd is currently restarted when its weekly logrotate run.""#2449: change lsearchd logrotate script to not restart lsearchd" [11:21:39] oh joy [11:21:47] looks like peter will fix it:) [11:21:47] good catch [11:21:56] yay for that! [11:26:49] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 0 seconds [11:29:40] PROBLEM - Puppet freshness on carbon is CRITICAL: Puppet has not run in the last 10 hours [11:30:43] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 198 seconds [11:32:39] !log sync-apache / graceful not logged anymore by logmsgbot ? [11:32:42] Logged the message, Master [11:50:13] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 0 seconds [11:58:01] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 212 seconds [12:35:33] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:36:36] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:58:22] I don't see why nagios is whining [12:58:30] ekrem is busily serving requests, seems to be ok [13:03:21] <^demon|zzz> apergos: Want to switch to git? [13:03:29] I sure as heck do! [13:03:39] * apergos does a little happy dance [13:03:51] you'll hav to let me know what the commit/review/etc path is for branches [13:03:53] and for core [13:04:06] and for personal branches :-P [13:06:55] <^demon|zzz> apergos: git clone ssh://ariel@gerrit.wikimedia.org/operations/dumps.git -b ariel [13:07:03] <^demon|zzz> Will give you a clone and switch you to your ariel branch [13:07:14] <^demon|zzz> .org:29418 [13:07:24] <^demon|zzz> Stupid port. I keep forgetting it [13:07:36] operations? [13:08:03] <^demon|zzz> Made the most sense to me to put it there. [13:08:08] huh [13:08:09] :-D [13:08:25] <^demon|zzz> And renames/deletes are impossible. Guess I shoulda asked first :p [13:09:10] I'll switch into my branch later [13:09:16] doing a fresh clone let's see how it looks [13:10:17] <^demon|zzz> Argghhhh, why is it missing the Notes: [13:10:42] what Notes [13:10:44] <^demon|zzz> Hrm, it shows up on the original copy on formey but not my clone :\ [13:10:56] damn demon polluting our namespace [13:11:00] <^demon|zzz> http://p.defau.lt/?lhKxG2Y_D7EgyOUz0bOgmw [13:11:07] * mark deletes [13:11:26] ah [13:11:29] <^demon|zzz> mark: It's Ariel's stuff. At least I put all the mediawiki stuff in mediawiki/ ;-) [13:11:40] soooo [13:12:01] then at least it should be under operations/software/ [13:12:03] you want to figure out about the NOtes? I can toss my clone and redo when it's happy [13:12:17] <^demon|zzz> Does `git log` show the Notes: for you? [13:12:38] no chance [13:15:11] <^demon|zzz> I'm baffled why this disappears when cloning. [13:15:48] <^demon|zzz> Hrm, they're missing on the repo copy too. Must get dropped during push. [13:16:50] <^demon|zzz> Ah, you have to push refs/notes/* [13:17:01] <^demon|zzz> Annoying you have to push that separately. [13:17:17] * apergos waits for it [13:20:01] <^demon|zzz> Ok, they're in the repo now. [13:20:20] getting... [13:21:24] <^demon|zzz> Hopefully they'll show up on clone now and you don't have to pull them explicitly. [13:21:26] <^demon|zzz> That'd be annoying too [13:21:36] <^demon|zzz> Argggghhhh [13:21:37] <^demon|zzz> Stupid git [13:21:40] I'll find out ina minute [13:21:42] oh? [13:21:45] <^demon|zzz> A clone didn't pull the notes. [13:21:49] baaaahhhhh [13:22:00] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.711 seconds [13:23:12] so... what incantation do I need to get them? [13:24:43] <^demon|zzz> Looking... [13:26:03] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:27:11] ^demon|zzz: wake up :) /nick ^demon [13:27:58] dear ops, what would it takes to have someone run a pear update on gallium (the cont int server). Should I feel a bugzilla / rt ticket ? :) [13:28:10] pear / PHPUnit does not seem to be puppetized [13:29:03] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.883 seconds [13:29:42] <^demon> Hrm, well `git clone --config push='+refs/heads/*:refs/notes/*' ssh://demon@gerrit.wikimedia.org:29418/operations/dumps.git` didn't work [13:29:49] <^demon> There's *got* to be an easier way [13:29:58] <^demon> s/push/fetch/ [13:32:21] ^demon: git clone --mirror [13:32:30] <^demon> That gives you a --bare repo [13:32:32] that should copy everithing [13:32:32] <^demon> But with notes :p [13:32:52] <^demon> You don't have a working tree in a --mirror repo [13:35:25] <^demon> http://progit.org/2010/08/25/notes.html#sharing_notes [13:35:48] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:48] I love the first line of this page [13:38:30] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.965 seconds [13:38:58] <^demon> After cloning, you can `git fetch origin refs/notes/*:refs/notes/*` [13:39:00] <^demon> Should work [13:39:06] yeah I am looking at that [13:39:13] what kind of icky reference string is that?? [13:39:30] * apergos tries it anyways [13:39:45] <^demon> That's obnoxious, but at least we know now and can document it :) [13:40:07] * [new branch] refs/notes/commits -> refs/notes/commits [13:40:10] that's bizarre [13:40:11] but otoh [13:40:28] git log shows the fricking Notes now [13:42:33] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:10] <^demon> Hrm, git review's freezing on me :\ [13:46:29] so when I check something into my branch do I need to go through gerrit or whatever? [13:48:21] <^demon> Yeah, like we do with puppet. We could change the permissions for your branch though if you'd like. [13:48:38] <^demon> Per-branch permissions are possible in gerrit which is cool :) [13:49:30] in order for it to go to the "production" copy of my branch it has to get reviewed etc first eh? [13:49:48] <^demon> Well you work mainly off your 'ariel' branch, right? [13:49:52] uh huh [13:50:23] things do not get automatically synced from there to the snapshot hosts [13:50:52] <^demon> *nod* [13:51:03] updates are done manually after testing [13:51:08] <^demon> Well right now master & ariel are both using the "review to merge." [13:51:10] I'm happy to have folks review things [13:52:07] <^demon> hashar: Mind cloning operations/dumps.git and seeing if you can use git-review? I'm having trouble. [13:52:10] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 0 seconds [13:52:10] I would be irritated if I had to wait for someone to find time to review something that I was trying to get into testing (a number of admittedly large commits in the past have been deferred, which si fine by me, [13:52:16] I just don't want it to hold up the work) [13:52:23] <^demon> I made a .gitreview file but git review -s is freezing. [13:52:40] * hashar cloning in [13:52:41] <^demon> apergos: Well you have permission to review your own stuff just like with puppet :) [13:52:47] hahaha [13:52:53] ok but hmm [13:53:01] what I would like ideally is this [13:53:07] I can merge my stuff out. fine [13:53:14] ^demon: That's weird [13:53:15] no way!!! [13:53:15] but other people can review before or after [13:53:20] you must face peer review! [13:53:43] without takign into account my merges [13:53:52] <^demon> apergos: There's no such thing as post-merge review in gerrit :\ If you bypass gerrit then there's no code review changeset. [13:54:02] meh [13:54:04] ^demon: you should move it in operations/software/dumps.git IMHO [13:54:07] well I don't want to bypass it [13:54:20] <^demon> hashar: Oh well, too late now. [13:54:22] and yeah we already noted the path preference [13:54:35] ^demon: na too late is not argument :-)) [13:54:53] anyway I have cloned the .git repo and can't see notes [13:55:00] no you won't see them [13:55:01] til you do [13:55:03] <^demon> apergos: As a practical matter, if it's something you need to go ahead and merge you can do so since you're ops. [13:55:19] git fetch origin refs/notes/*:refs/notes/* [13:55:40] New patchset: Catrope; "Add .gitreview file" [operations/dumps] (master) - https://gerrit.wikimedia.org/r/2603 [13:56:02] ^demon: ----^^ [13:56:03] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 204 seconds [13:56:04] Worked just fine for me [13:56:22] <^demon> odd. [13:56:22] I didn't use -s but just committed and ran git-review [13:56:39] <^demon> Tried that too. [13:56:44] oh, that's right, we are going to use thaat tool now (maybe) [13:56:47] <^demon> I was working on the branch. Wonder if that's a bug [13:56:48] The ariel branch is managed separately, right? [13:57:03] <^demon> apergos: git-review's just an addon. It makes it simpler to use gerrit. [13:57:08] <^demon> But it's not a requirement :) [13:57:40] well "managed" = there's no automated sync out to the snap hosts, and thank god [13:58:02] well I can just push to operations/dumps.git :D [13:58:29] at some pint very soon I ought to fold my stuff into "trunk" [13:58:45] <^demon> apergos: master is the new trunk ;-) [13:58:47] heh [13:58:56] then I can do what makes sense: [13:59:01] test code in my branch [13:59:07] test it [13:59:13] New patchset: Hashar; "adding in .gitreview" [operations/dumps] (master) - https://gerrit.wikimedia.org/r/2604 [13:59:17] (actually running some dumps on it) [13:59:18] \O/ [13:59:21] merge to master [13:59:23] ^demon: WFM [13:59:32] <^demon> Hrm.... [13:59:38] where by merge I mean it goes to gerrit [13:59:40] <^demon> Wonder if it's a bug in trying to use it on a branch [13:59:41] I don't think I should be allowed to push to that repo [13:59:43] <^demon> Which would be annoying. [13:59:44] then people review that [13:59:48] * RoanKattouw notices there's a lot of divergence between master and ariel [13:59:54] that's right [13:59:57] <^demon> hashar: Anyone can push to any repo. That's part of the joy of git. [14:00:18] <^demon> But with a gated repo, we can easily say DENIED if you make stupid changes ;-) [14:00:41] Change abandoned: Catrope; "Already done correctly (repo path is wrong, missing .git) in https://gerrit.wikimedia.org/r/#change,..." [operations/dumps] (master) - https://gerrit.wikimedia.org/r/2604 [14:00:55] can we get email notifications about commits to [14:00:56] hmm [14:01:04] I wonder how that would possibly work [14:01:04] Yes [14:01:44] basically I'd lov it if I got email notification for anything to operations/dumps master [14:02:14] New patchset: Catrope; "Add .gitreview file for the ariel branch as well" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/2605 [14:02:21] That can be done, although I don't know how it works [14:02:25] ok [14:02:27] New review: gerrit2; "Lint check passed." [operations/dumps] (ariel); V: 1 - https://gerrit.wikimedia.org/r/2605 [14:02:51] 2605 adds a separate .gitreview file for operations/dumps/ariel , assuming that you'll want work on the ariel branch to be pushed into ariel, not into master directly [14:02:56] well for now I'll just go back to my oppy things (I have some other stuff which is not code stillon my plate, probably be several days before an issue of a commit comes up) [14:03:06] yes, not into master directly [14:03:10] correct. [14:03:12] Good [14:03:18] never into master directly [14:03:30] Well your first exercise could be approving 2603 & 2605 so they get merged :) [14:03:40] :-D [14:03:50] <^demon> apergos: If we're happy with dumps in git now, I can go ahead and make the current dumps code read-only in svn. [14:04:30] Hmph, and of course gerrit is doing puppet lint checks for this repo, lol [14:04:40] We desperately need to move all that stuff into Jenkins [14:04:58] <^demon> No no no. [14:05:06] <^demon> Lint checks should remain in gerrit, not move to jenkins [14:05:26] <^demon> But we should be able to define per-repo, "do these lints" => php, python, ruby [14:05:28] Why? [14:05:29] <^demon> etc. [14:05:42] Doing per-repo definitions is so much easier in Jenkins [14:05:51] <^demon> But adding new jobs per-repo is annoying :\ [14:06:00] <^demon> Especially once we get ~500 extensions in there. [14:06:12] Well, 1) you can edit the repo filter on existing jobs [14:06:40] and 2) Diederik had this idea of writing a universal lint script that just traverses the entire tree and invokes the correct linter for each file based on the extension [14:06:51] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 134 MB (1% inode=62%): /var/lib/ureadahead/debugfs 134 MB (1% inode=62%): [14:07:03] yes, a short spot check looks ok [14:07:04] read only it [14:07:20] puppet lint checks? [14:07:27] <^demon> RoanKattouw: That's how we do it now for jenkins and it's slow as molasses. [14:07:29] on c and python and bash scripts? [14:07:42] <^demon> Doing php -l for thousands of files is sloowwwww [14:07:54] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:07:59] Hmm yeah at least for PHP linting you need parsekit [14:08:12] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 170 MB (2% inode=62%): /var/lib/ureadahead/debugfs 170 MB (2% inode=62%): [14:08:51] Other linters aren't that slow though, are they? [14:09:18] apergos: mutante: thank you for working out the page this morning. I do indeed have a ticket to take care of that, and the plan is to do it today :) [14:09:38] apergos: The puppet lint check skips all the non-puppet files, which is why it's half-useful for the puppet repo and useless for your repo [14:09:40] ah yeah [14:09:47] nice :-D [14:10:09] ^demon: can't you denied git push per default ? [14:10:22] I mean pushing without review sounds like an issue to me [14:10:27] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [14:10:45] RECOVERY - Disk space on srv223 is OK: DISK OK [14:12:56] New review: ArielGlenn; "(no comment)" [operations/dumps] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2603 [14:12:57] Change merged: ArielGlenn; [operations/dumps] (master) - https://gerrit.wikimedia.org/r/2603 [14:13:11] <^demon> Whoops. [14:13:22] <^demon> RoanKattouw: So like I was about to say.... [14:13:30] <^demon> Parsekit is intermittently flakey on 5.3 [14:14:06] So we're stuck with php -l? [14:14:12] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 0 seconds [14:14:18] whoops what? [14:14:19] <^demon> I don't trust parsekit enough in 5.3 [14:14:21] Then doesn't it make sense to make the lint job more asynchronous by putting it in Jenkins? [14:14:27] <^demon> apergos: Wasn't paying attention and my batter died. [14:14:32] ohhh [14:14:42] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2605 [14:14:43] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/2605 [14:14:47] there's those [14:14:51] <^demon> RoanKattouw: Isn't the lint check async anyway? Pushing to operations/puppet doesn't wait for a lint check to complete. [14:15:14] True, but it seems to be flaky sometimes [14:15:22] Sometimes random commits don't get linted [14:15:59] <^demon> Well that's worth fixing. [14:16:07] I personally trust Jenkins's Gerrit plugin more than I trust Gerrit's "hook system" (if you can call it that), especially when concurrency is involved [14:16:10] RECOVERY - Disk space on srv224 is OK: DISK OK [14:16:17] <^demon> *shrug* [14:16:41] <^demon> As long as we don't have to create a new repo for every frickin' extension and it's reliable...you and hashar can do what you want :p [14:16:51] Since you obviously don't really care and I do, can we compromise on this and just let me do it? :) [14:17:08] <^demon> As long as the bikeshed can be orange :p [14:17:10] as long as I don't break stuff and you don't have to touch it [14:17:26] * RoanKattouw hands ^demon some orange paint to distract him while he executes his evil plans [14:17:34] ah a flaw in the implementation... of course. not that we've ever had problems with random number generation before :-/ [14:17:56] well anyway we are going to drop git / gerrit tonight [14:18:06] we migrate to bazaar / launchpad [14:18:34] seriously, I think the linking check will be made async [14:18:39] by using a jenkins build [14:18:54] the issue I have is that I need jenkins to ssh to the gerrit host. Need to catch up with Ryan about it tonight [14:19:27] <^demon> hashar: I read your e-mail. As long as we lock down the account to only do the couple of things we want it to I think we'll be fine. [14:19:27] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 227 seconds [14:19:42] apergos: check the sec channel for a new problem in the word related to random number generation ;) [14:20:21] that's what I was reacting to [14:20:26] ah [14:21:16] I find it really sad, actually, that when I, say, go for a walk down by the stream near my house, I can see enough entropy to keep the entire world safe and random forever. and yet, harnessing that is somehow hard [14:22:03] it's not that we don't have decent sources of entropy [14:22:13] it's always that implementation sucks (and it's hard) [14:22:15] hashar: We can just create an account for Jenkins in Gerrit, right? That's what I did in my labs project [14:23:55] <^demon> RoanKattouw: git review also does automatic rebasing before pushing-for-review? [14:24:03] <^demon> <3 [14:24:06] Yes [14:24:17] Oh and it uses the local branch name as the topic [14:24:20] that's a nice feature, I have to admit [14:24:20] Unless you override with -t [14:24:27] the auto rebase [14:24:45] <^demon> That's so awesomely useful. Keeps useless patchset 2s due to a failed merge. [14:25:41] 2s? [14:26:09] <^demon> Having to submit a patch #2 because it won't merge :p [14:26:23] ah [14:28:53] * ^demon twiddles thumbs while git-review seems to do its thing all slow-like [14:30:56] <^demon> The hell...it's just...hanging here.... [14:32:11] <^demon> It's hanging on the rebase... | [14:34:50] awwwww [14:35:51] <^demon> And svn is now r/o for dumps stuff :) [14:36:15] <^demon> apergos: Might want to let qchris know :p [14:37:09] I will, he'll want to create a branch [14:37:25] It's hanging on the rebase? That's strange [14:37:39] RoanKattouw: I need jenkins to ssh to formey which host gerrit [14:37:48] Why do you need to SSH there? [14:37:56] so jenkins can use the gerrit CLI on formey [14:38:05] And why does it need to use the Gerrit CLI? [14:38:19] sorry should have made a complete sentence [14:38:37] Don't worry, I'll get it out of you eventually :) [14:38:37] the jenkins plugin use the gerrit cli to figure out which changes have been added [14:38:43] * RoanKattouw puts on interrogator hat [14:38:48] and to submit comments such as "Lint passed" [14:38:53] and that's not the gerrit cli over gerrit's ssh? [14:38:54] Oh, you mean the Gerrit Trigger Plugin? [14:38:58] Yes, it is [14:39:05] yeah Gerrit Trigger Plugin [14:39:07] that's easy then [14:39:08] I think you just need to create a Gerrit account for Jenkins [14:39:11] That's what I did in labs [14:39:23] <^demon> Swap that. A jenkins account for gerrit. [14:39:32] Eh? [14:39:39] <^demon> Oh wait, am I confused now? [14:39:41] * ^demon gives up [14:39:50] You need to create an account in the Gerrit system called 'jenkins' [14:40:00] * RoanKattouw hopes that's clearer [14:40:10] ^demon: go back to Talk: namespaces removal :-]]] [14:40:13] <^demon> Yeah, we're saying the same thing. I misread you. [14:40:22] <^demon> hashar: Go write unit tests. [14:40:33] :) [14:41:01] if we have a jenkins user on gerrit, we need a ssh key pair for jenkins@gallium [14:41:09] Yes [14:41:51] <^demon> hashar: Well you can sudo as jenkins so that shouldn't take more than...5 seconds :p [14:41:51] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 19 seconds [14:44:02] RoanKattouw: I forwarded you the mail I sent to Ryan [14:46:26] New review: Hashar; "Looks like the .git is optional since I have pushed that change without it." [operations/dumps] (master) - https://gerrit.wikimedia.org/r/2604 [14:46:44] hashar: Thanks, replied [14:47:06] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 249 seconds [14:47:21] might work [14:48:02] That's how I did it, it gives Jenkins the same trust level as any random person that can push stuff into Gerrit [14:48:34] You would need to set permissions on the account carefully, of course, it would be able to do a little bit more than random people, like V+1 and stuff [14:48:36] I guess I was confused at some point [14:48:44] SSH always tick like "shell" access to me [14:50:26] <^demon> Also remember that gerrit's ssh daemon isn't interactive. You can only issue gerrit commands :) [14:51:00] Yeah SSH != SSH here, I can see how that's confusing [14:51:05] yeah that SSH passphrase field in Jenkins ticked like "Are you sure you want to give anyone shell access to production cluster [y/N]?" [14:51:28] so that locked me in a syndrome of "don't do anything or you will open a serious security breach" [14:52:04] This is why I set this up in labs first :) [14:52:09] So now that you're both here, anyway [14:52:15] What's the story with the two Jenkins puppetizations? [14:52:59] One on gallium (misc::contint::test::jenkins) and one on gilman (misc::jenkins) [14:54:02] The gilman one is incomplete and the gallium one is broken (at least for setting up new installs) [14:54:25] <^demon> Ugh, I want fulltext searching in gerrit. [14:54:51] Welcome to the "I want $foo in Gerrit" club [14:54:54] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.546 seconds [14:56:04] ^demon, hashar: Would either of you care to comment on the double Jenkins thing above? [14:56:24] misc::jenkins is for mobile / fundraising / whatever project [14:56:52] what I suspect is that the gallium one was installed with misc::jenkins then the class was copy/pasted in puppet later on [14:57:24] afaik any fundraising part of misc::jenkins does not work [14:57:39] Well misc::jenkins doesn't contain any fundraising-specific parts [14:57:44] All it does is install Jenkins [14:57:48] ok [14:57:56] misc::contint::test::jenkins does a lot of things but what it does *not* do is install Jenkins [14:57:59] that's commented out [14:58:16] so that class is broken when trying to install a new machine [14:58:18] this is one of those things that I need to get to eventually--figuring out how fundraising-jenkins should be set up [14:58:40] Well it would make sense to have a shared class that installs Jenkins [14:58:45] <^demon> Also automating that from package install would be nice :) [14:58:48] totally [14:58:49] <^demon> So we can stop doing it manually. [14:58:49] misc::jenkins does so, but it installs it from a 3rd party PPA [14:59:00] So we need to put the package in our own APT repo [14:59:06] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:59:20] # FIXME: third party repository [14:59:21] # This needs to removed, and changed to use Jenkins from our own WMF repository instead. [15:01:39] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.971 seconds [15:05:42] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:05:57] RoanKattouw: looks like jenkins was installed on gallium from the WMF Ubuntu mirror [15:06:24] Strange [15:06:28] It's not in puppet anywhere [15:06:49] it is commented out in misc::content class jenkins [15:06:54] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 9 seconds [15:07:01] # first had code here to add the jenkins repo and key, but this package should be added to our own repo instead [15:07:01] # package { "jenkins": ensure=> present ... [15:07:38] and """ $ apt-cache policy jenkins """ gives out Installed: 1.431 and references the apt.wm.org repo [15:07:43] so maybe we can put a new package there [15:07:53] err [15:08:08] we could uncomment the package{ "jenkins" : ensure=> present } [15:09:09] PROBLEM - Lucene on search1002 is CRITICAL: Connection refused [15:09:36] hi robh [15:09:42] hello [15:09:49] sorry for bugging you [15:09:53] :) [15:09:56] or :( [15:10:11] do you have an ETA for the proxy server on locke? [15:10:28] nope, sorry, i have not touched it. [15:10:46] i thought someone else had more knowledge on that, wasnt there a discussion about this? [15:10:48] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 243 seconds [15:11:04] yes, there has been a discusion about this [15:11:15] or is this something i should ask mark? [15:11:32] i would ensure that how its planned to do things passes his review [15:11:39] then he would be able to advise who best to implement as well [15:11:51] (i asked you because you were doing the transition of locke) [15:12:03] ahh, yea i just did the allocation of the server and the OS install [15:12:07] hi mark, are you around? [15:12:34] New patchset: Hashar; "misc::contint::jenkins now install jenkins" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2606 [15:15:05] RobH: So what's the story with fluorine (RT #2350)? [15:15:40] sorry, in middle of swift order, lemme finish this then i check it out [15:16:25] Sure no rush [15:18:57] New patchset: Catrope; "misc::contint::jenkins now install jenkins" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2606 [15:19:15] hashar: ---^^ With spelling fixes in the comments [15:19:35] taking english lessons is on my todo list [15:19:45] once my daughter will stop crying constantly 8-)))) [15:19:51] hehe no worries [15:20:13] New review: Catrope; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2606 [15:20:21] I +1ed it, I don't have +2 powers [15:20:23] New review: Hashar; "Thanks Roan!" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2606 [15:20:40] I am +1 restricted too [15:20:49] which is fine since I don't want to mess up with ops [15:21:36] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.747 seconds [15:21:49] I will send that one to leslie [15:22:11] another question: it turns out that there are almost 0 referrals in the mobile log files, which surprises me. is this correct behavior or should we expect a decent number of referrals? [15:23:24] PROBLEM - Disk space on db40 is CRITICAL: DISK CRITICAL - free space: /a 91029 MB (3% inode=99%): [15:24:09] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.848 seconds [15:24:09] PROBLEM - MySQL disk space on db40 is CRITICAL: DISK CRITICAL - free space: /a 90973 MB (3% inode=99%): [15:24:45] RECOVERY - Lucene on search1002 is OK: TCP OK - 0.027 second response time on port 8123 [15:27:54] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:28:12] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:12] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:04] hashar: who should I talk to about getting a proxy for my bugzilla gadget like Tim suggested? [15:29:06] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [15:30:27] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 15 seconds [15:31:45] mutante: around? [15:32:23] * hexmode is tempted to start randomly pinging people [15:33:42] <^demon> hexmode: Why not just file it in rt? [15:35:34] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 262 seconds [15:35:35] ^demon: because "just file it in rt" is the way to get things lost. But, yes, it is a start. [15:35:44] I should've done that first [15:35:51] * hexmode goes to fix that now [15:36:36] PROBLEM - Lucene on search1002 is CRITICAL: Connection refused [15:38:26] hexmode: ponf [15:38:33] oh men incorrect implementation [15:39:01] hashar: ponf? [15:39:16] I meant pong [15:39:28] heh [15:39:46] so, hashar, is there just a proxy available right now for my gadget to use? [15:40:02] I have no idea, that sounds like an op thing [15:40:08] maybe he was referring to the front caches ? [15:40:33] and adding a sub domain like dos-me.bugzilla.wikimedia.org that will point to them [15:40:48] I assumed he meant something like that subdomain [15:40:59] thus maybe we could ban any requests made to "dos-me.bugzilla.wikimedia.org" whenever there is an issue [15:41:32] were did he comment about that ? Was it on IRC / a bugzilla / RT ? [15:41:38] s/were/where/ [15:41:41] private-l [15:42:37] maybe ask Tim by replying so :/ [15:42:38] or maybe he sent me private email, double checking [15:42:44] I am not sure what he meant [15:42:52] hexmode: I got the email in private-l [15:43:36] hashar: We just want a way to be able to say "these requests are coming from the gadget. Flick this switch to turn them off" [15:43:40] that remembers me I need to open a ticket to get bzapi installed [15:43:55] it is, isn't it? [15:44:02] or do you mean something else? [15:44:14] I mean the REST API to query bugzilla [15:44:33] https://wiki.mozilla.org/Bugzilla:REST_API [15:44:34] the JSON API isn't good enough? [15:44:47] where is it at ? [15:45:06] https://bugzilla.wikimedia.org/jsonrpc.cgi [15:46:34] I've been scripting it with php [15:46:46] my code is in tools/bugzilla/client [15:47:51] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.482 seconds [15:49:41] https://rt.wikimedia.org/Ticket/Display.html?id=2452 [15:49:46] !log updating dns for cadmium [15:49:48] Logged the message, RobH [15:51:01] hexmode: maybe the json interface would do it. Thanks [15:52:44] The dell tech stood me up. [15:53:04] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:53:18] !log os install on cadmium [15:53:19] Logged the message, RobH [15:59:04] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 0 seconds [16:01:01] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 84, down: 3, dormant: 0, excluded: 0, unused: 0BRae3.1019: down - Subnet private1-c-eqiadBRae3.32767: down - BRae3.1003: down - Subnet public1-c-eqiadBR [16:01:32] hashar: fwiw, the REST-API is probably enabled similarly, too [16:02:02] hexmode: as I understood it, it is an extension that need to be installed [16:02:31] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.098 seconds [16:02:47] hexmode: anyway, the json interface is probably enough [16:02:58] hashar: ok, I see XML-RPC and JSON-RPC... so you're right [16:03:04] http://www.bugzilla.org/docs/tip/en/html/api/Bugzilla/WebService.html [16:10:28] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:10:46] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 87, down: 3, dormant: 0, excluded: 0, unused: 0BRae3.1003: down - Subnet public1-c-eqiadBRae3.1019: down - Subnet private1-c-eqiadBRae3.32767: down - BR [16:18:13] New review: Mark Bergsma; "Can you please move this out of the puppet repository, and put it in operations/software instead? Or..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2601 [16:23:27] !log carbon halted, allows login and freezes on password entry, rebooting [16:23:29] Logged the message, RobH [16:23:40] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.1409206087 (gt 8.0) [16:23:49] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.227 seconds [16:25:55] RECOVERY - Puppet freshness on carbon is OK: puppet ran at Wed Feb 15 16:25:43 UTC 2012 [16:26:04] RECOVERY - SSH on carbon is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [16:26:13] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.85601438596 [16:27:40] New patchset: Mark Bergsma; "Prepare oxygen for multicast relaying" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2607 [16:28:01] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:29:29] New patchset: Mark Bergsma; "Comment again, until I have time to look at it" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2608 [16:37:28] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.005 seconds [16:37:47] !log forgot to log, carbon resumed service normally [16:37:49] Logged the message, RobH [16:41:31] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:47:20] New review: Dzahn; "enhanced page_all SMS script (the one for manual use, does not affect nagios)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2264 [16:47:22] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2264 [16:54:43] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.558 seconds [16:58:37] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:01:22] !log cadmium setup for wikimania video transcoding [17:01:24] Logged the message, RobH [17:03:52] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.982 seconds [17:07:55] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:25] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.382 seconds [17:15:52] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.209 seconds [17:17:22] RECOVERY - Lucene on search1002 is OK: TCP OK - 0.031 second response time on port 8123 [17:19:37] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:04] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:55] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.923 seconds [17:26:58] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:27:52] video transcoding? [17:28:13] ah temp [17:32:16] can we just let it sit there unused for a couple of years? :D [17:32:52] Well since I'm the one that's gonna be using it, it's gonna sit unused for at least a week probably [17:33:26] heh. I was referencing the old transcode boxes [17:34:55] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.000 seconds [17:35:04] there's only transcode1 left [17:35:10] and that's for the dc cameras [17:35:47] totally temp [17:35:56] and roan and i will tear it down once he finishes with it. [17:36:58] New patchset: Demon; "Adding .gitreview" [mediawiki/tools/mwdumper] (master) - https://gerrit.wikimedia.org/r/2609 [17:37:16] New review: Demon; "(no comment)" [mediawiki/tools/mwdumper] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2609 [17:37:16] Change merged: Demon; [mediawiki/tools/mwdumper] (master) - https://gerrit.wikimedia.org/r/2609 [17:38:49] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:56] New patchset: RobH; "added candium and roan to access it to site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2610 [17:40:10] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.813 seconds [17:44:04] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:31] New patchset: RobH; "added candium and roan to access it to site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2610 [17:45:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2610 [17:47:14] New review: RobH; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2610 [17:47:15] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2610 [17:47:40] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.667 seconds [17:51:52] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:51:52] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [17:55:55] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [17:57:52] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [17:59:50] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 241 seconds [18:07:55] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.464 seconds [18:08:58] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:10:10] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [18:14:31] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.701 seconds [18:15:29] New patchset: Demon; "Adding redirect for easier finding of gitweb urls" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2611 [18:15:52] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:17:13] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.479 seconds [18:18:16] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 2 seconds [18:21:17] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:17] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:52] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [18:22:10] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 237 seconds [18:24:52] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 0 seconds [18:28:46] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 224 seconds [18:29:49] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [18:35:47] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 20.9923882609 (gt 8.0) [18:38:02] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.743 seconds [18:38:08] cmjohnson1: I'm sure mark will have better suggestions, but IIRC he was saying something about there being too much RAM in the box? maybe take half of it out and try again? [18:38:24] * maplebed stabs in the dark... [18:38:42] New review: Demon; "(no comment)" [operations/software] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2575 [18:38:42] Change merged: Demon; [operations/software] (master) - https://gerrit.wikimedia.org/r/2575 [18:39:09] maplebed: ms4? [18:39:15] yeah. [18:39:18] (hey, context!) [18:39:49] worth a shot [18:39:57] New patchset: Pyoungmeister; "perms: they matter" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2612 [18:40:08] PROBLEM - Puppet freshness on gilman is CRITICAL: Puppet has not run in the last 10 hours [18:40:08] PROBLEM - Puppet freshness on grosley is CRITICAL: Puppet has not run in the last 10 hours [18:40:49] what's up? [18:42:05] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:09] mark: ms4 rt 885 [18:42:41] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 2 seconds [18:43:05] so the memory in there is either broken or doesn't work well with the motherboard [18:43:39] there should be both the new memory, and the original memory [18:43:47] the original meomry had issues and we wanted to upgrade [18:43:54] the new memeory is what crucial said would work in the system. [18:44:27] i would reduce to the minimum memory pair and see if it posts and tests. [18:44:53] then work on adding a pair at at time and testing post [18:45:15] offtopic: dell tech has gotten mw1103 to start posting, it only took since noon. [18:45:18] still working on it =P [18:45:20] oh joy [18:46:35] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 228 seconds [18:46:44] I am starving [18:46:52] * RobH foolishly did not eat breakfast, or pack a lunch [18:47:04] i cannot even go to the vending machines, must watch dell tech. [18:47:53] New patchset: Pyoungmeister; "and let's append to a log file, shall we?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2613 [18:48:18] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2612 [18:48:19] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2612 [18:50:11] PROBLEM - Puppet freshness on ganglia1001 is CRITICAL: Puppet has not run in the last 10 hours [18:51:24] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2613 [18:51:25] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2613 [18:56:38] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.01438956522 (gt 8.0) [18:58:27] get the dell tech to buy you a lunch from a vending machine [18:58:33] you can tell him which one to buy :-P [19:04:26] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.421 seconds [19:04:35] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.525 seconds [19:04:53] apergos: that means its that much longer until he finishes. [19:05:01] plus he cannot walk out ther, he cannot get back in ;] [19:07:49] if you go with him he can! [19:07:49] RobH: Bah, cadmium won't let me ssh in [19:07:59] that's how you tell him which one to get: "buy that one" [19:08:00] RoanKattouw: do it as root? [19:08:11] that works [19:08:18] good enough =] [19:08:20] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:08:29] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:08:36] RoanKattouw: it would be a problem if that host was going to be around in three months [19:08:43] but it has to die when we finish, as its not fully puppetized. [19:09:01] ZOMG, all sorts of Windows garbage on that disk [19:09:06] Run ls /wd for laughs [19:09:12] eah [19:09:29] I'm just gonna copy the files from /wd (removable HD) to /a (LVM) [19:11:11] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 1.82914284483 [19:12:21] RoanKattouw: its ntfs, i had to look up how to mount that shit [19:12:23] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.716 seconds [19:12:26] i didnt recall the flags [19:12:31] heh [19:13:03] !log Copying all Wikimania files from the removable HD to cadmium's HD [19:13:05] Logged the message, Mr. Obvious [19:13:46] RoanKattouw: careful though, that removable disk is larger than the ones in the system [19:13:52] may run out of room if you copy then transcode [19:14:00] rather than transcode from WD to internal. [19:14:10] Well crap, you'reright [19:14:19] There's 2T of data on there, only 1.7T of room [19:14:27] yea, that was my fear [19:14:30] I need to du -sh those dirs, see what's actual video and what's now [19:14:31] *not [19:14:41] some may be repeated video as well [19:14:46] !log Aborted copy operation on cadmium, data won't fit [19:14:48] its not organized really. [19:14:48] Logged the message, Mr. Obvious [19:14:50] Yup [19:15:00] I was just gonna copy it all cause ls is so slow [19:15:05] yea =/ [19:15:22] 1.8T Wikimania-Source [19:15:23] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 2 seconds [19:15:26] we dont really have spare hosts with ton of disk space [19:15:30] =/ [19:16:12] I think Wikimania Edited is what I want [19:16:14] I'll copy that over [19:16:23] But please leave the external HD mounted for now, I might need more later [19:16:33] I was hoping to just copy everything so you could unmount it, but meh [19:17:16] !log Let's try that again: copying /wd/Wikimania\ Edited to /a on cadmium [19:17:19] Logged the message, Mr. Obvious [19:19:17] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 204 seconds [19:20:11] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:50] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.602 seconds [19:33:41] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 0 seconds [19:34:44] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:53] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.716 seconds [19:37:44] PROBLEM - MySQL Slave Delay on db31 is CRITICAL: CRIT replication delay 253 seconds [19:38:47] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:40:08] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.770 seconds [19:40:46] mark, when you have a minute, I would love some help on the htcp stuff. [19:41:21] ok [19:41:25] I've currently got ~/htcp.php on fenari that I pulled from the file Tim mentioned [19:41:42] I pulled out mediawiki-specific stuff so that it runs without a full MW install [19:41:59] I don't think it's actually sending the packet though (my test is tcpdump listening on a squid) [19:42:34] ok [19:42:37] did you run tcpdump on the sending host? [19:43:23] not yet. [19:43:38] do you set a ttl > 1 ? [19:43:42] a multicast ttl [19:43:45] if you don't that's probably why [19:44:02] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:05] nada. [19:44:14] I set it to 3 initially, [19:44:21] then looked in srv221's config and changed it to 1 [19:44:23] that's not enough to reach eqiad in all cases [19:44:29] you should set it to 10 or so [19:44:40] done [19:44:41] hmm really? I thought we set mediawiki to 10 or something like that [19:44:55] * mark looks at htcp.php [19:45:05] I might have been looking in the wrong place. [19:45:11] (and seen the default rather than the actual config) [19:45:35] huh. tcpdump saw "HEAD htcp.php HTTP/1.0" [19:45:42] so it's sending something, but not yet the right thing. [19:46:02] yay! [19:46:16] I see it sent from fenari and recieved by sq86. [19:46:22] cool :) [19:46:25] so it was the ttl? [19:46:25] thanks, I think the TTL was it. [19:46:35] yeah [19:46:38] then it won't leave the subnet [19:46:41] that and I was pulling argv[0] instead of argv[1] [19:46:48] and squids are in a different subnet than everything else [19:47:23] thanks for the suggestion to just use the php; way easier than trying to rewrite it. [19:47:56] and now I have a generic 'send a URL by htcp' I can run from the command line. [19:48:47] robh: i cannot use b1 for new ms-be...i have room in b3 [19:49:29] maplebed: but that already exists [19:49:35] I'm sure Tim wrote one [19:49:44] he even optimized it at some point, nonblocking and all that iirc [19:50:01] I'm just not entirely sure where it would be, I -think - in mediawiki's maintenance/ dir [19:50:08] ah well. [19:54:59] RECOVERY - MySQL Slave Delay on db31 is OK: OK replication delay 0 seconds [19:56:40] cmjohnson1: checkin [20:11:10] PROBLEM - MySQL Slave Delay on db31 is CRITICAL: CRIT replication delay 203 seconds [20:11:19] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.796 seconds [20:15:31] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:41] New patchset: Pyoungmeister; "should make sure we don't get openjdk in there too" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2614 [20:20:46] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.095 seconds [20:21:55] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2614 [20:21:56] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2614 [20:26:28] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.786 seconds [20:31:34] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:31:52] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:32:55] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.697 seconds [20:37:07] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:54:13] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.560 seconds [20:54:13] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 6.572 seconds [20:57:55] yay wikimania videos! :) [20:58:14] Yeah don't get too excited just yet :) [20:58:19] I need to transcode them to OGG first [20:58:25] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:58:25] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:58:33] And I need to get the 6th floor people to give me file description pages [21:02:19] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.242 seconds [21:02:19] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.240 seconds [21:03:30] New patchset: Jgreen; "mystery solved, stupid typo on new apache vhost config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2615 [21:04:22] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2615 [21:04:23] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2615 [21:06:22] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:06:22] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:06:58] RECOVERY - Puppet freshness on grosley is OK: puppet ran at Wed Feb 15 21:06:36 UTC 2012 [21:07:24] New patchset: Pyoungmeister; "search hosts now can use xfs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2616 [21:08:28] RECOVERY - Puppet freshness on aluminium is OK: puppet ran at Wed Feb 15 21:08:06 UTC 2012 [21:13:27] robh: did you get a chance to look into b3 as a possible location? [21:13:53] cmjohnson1: checkin now [21:14:08] i got distracted, they were getting ready to start doing shit in the cage and i wanted to clear out of their way [21:14:26] yep..no worries...gave me a chance to tinker with ms4 [21:14:38] btw: still sux [21:14:44] cmjohnson1: I wanna leave space in b3 for some things [21:14:51] ok, we are going to be adding 6 more search nodes to there [21:15:08] so you can use b3, but put it in U29/30 [21:15:16] that leaves 6u below it for the rest of the search servers [21:15:23] okay [21:15:42] but otherwise yep thats cool [21:15:55] on a2 and a4. the u's you suggested are filled. I can go higher and power is ok [21:15:59] cmjohnson1: glad to know how to balance power before you rack the heavy server eh? [21:16:17] ok, so i typo'd or racktables is wrong, checking [21:16:28] yep...makes life easier [21:17:17] cmjohnson1: no idea wtf i did with the U space in there [21:17:25] you're distributing the swift cluster over 5 racks? [21:17:26] im not used to putting it in there, and i messed up, but yea the racks are right [21:17:37] ben wanted it on at least 3 [21:17:44] i have a bunch of racks that can take one or two servers most [21:17:49] so we are just filling those in [21:18:09] mark: sound good? [21:18:14] i hate it [21:18:28] you want them in a dedicated rack, but ben insisted that he needed 3 minimum. [21:18:31] ? [21:18:32] but it's tampa [21:18:33] I won't care [21:18:37] the first part is a question, the second a statement [21:18:42] do it :P [21:19:06] fyi: new row c in eqiad will wire slightly differently [21:19:13] as in how i layout the access switch and such for cable mgmt [21:19:17] ...no love for Tampa [21:19:29] * cmjohnson1 sheds a tear or two [21:19:33] top u access switch, then 1u cable mgmt, then msw, then 1u cable mgmt [21:19:53] maybe just one cable mgmt [21:20:05] but something, cuz in the denser racks, wiring is a pain at the top. [21:20:35] also once we have the proper fiber trays, I wanna drop the plastic tube/raceways from the downspouts into each rack [21:20:45] i dislike seeing naked fiber. [21:21:08] since we will have to migrate over to new fibers in trays and push traffic from one router to the other [21:21:22] why not cable mgmt above the switch instead of below? [21:21:24] its simple enough to improve the racks in the meantime. [21:21:58] RECOVERY - Puppet freshness on gilman is OK: puppet ran at Wed Feb 15 21:21:36 UTC 2012 [21:22:01] well if i can get by with a single 1u cable manager, between the switches works [21:22:27] i rather not take up the entire rack top with mgmt, so i wanna try one first, but if we need two [21:22:36] i guess they can go above the switches, do you find that makes it easier? [21:22:49] if we have the space, why not go with two ? if we leave one empty, it wouldn't hurt anything [21:23:09] well, i meant in rack go switch, cable mgmt, switch, cable mgmt [21:23:18] but perhaps we would be better with switch, 2u cable mgmt, mgmt switch. [21:23:30] the 2U cable managers are very roomy. [21:24:01] New patchset: Hashar; "dumb test of gerrit / jenkins integration" [test/mediawiki/core2] (master) - https://gerrit.wikimedia.org/r/2617 [21:24:04] but 1U would be much cleaner then [21:24:06] heh, can even retrofit existing racks slowly, since a rack can lose all mgmt for an hour or two. [21:24:16] yeah sure [21:24:26] yea the 1u look tighter. [21:24:29] looks nicer in rack i think [21:24:47] I would do (from top) 1u cable mgmt, 1u prod switch, 1u cable mgmt, 1u mgmt switch [21:24:49] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.230 seconds [21:24:49] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.261 seconds [21:24:53] OR [21:24:57] even switch prod and mgmt around [21:25:05] since now production cords always cross the mgmt switch [21:25:09] and are more at risk that way ;) [21:25:18] but there's also the stacking cables and all [21:25:22] meh, let's keep production at the top I guess [21:25:25] yea [21:25:30] New review: jenkins-bot; "Build Started https://integration.mediawiki.org/ci/job/MediaWiki-GIT-Fetching/24/ (1/2)" [test/mediawiki/core2] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2617 [21:25:31] New review: jenkins-bot; "Build Started https://integration.mediawiki.org/ci/job/MediaWiki-GIT-Fetching/25/ (2/2)" [test/mediawiki/core2] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2617 [21:25:58] well, i can just order the cable mgmt and rearrange them and see which looks/works best [21:26:16] I suppose if the cables route from above to the switch, it puts even less strain on the actual switch ports. [21:26:23] though the cable mgmt will eliminate a lot of that anyhow [21:26:45] now I just use velcro to do strain relief, but it loosens over time. [21:27:06] had to go around eqiad on friday and redo half the racks wiring [21:28:29] New review: jenkins-bot; "Build Started https://integration.mediawiki.org/ci/job/MediaWiki-GIT-Fetching/26/ (2/2)" [test/mediawiki/core2] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2617 [21:28:34] PROBLEM - Disk space on mw44 is CRITICAL: DISK CRITICAL - free space: /tmp 9 MB (0% inode=87%): [21:28:38] hashar: Wait, are we duplicating work here? [21:28:43] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:28:48] You set up the gerrit trigger plugin in production [21:28:50] ? [21:28:52] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:29:31] RoanKattouw: yes [21:29:47] and of course it does not work like on my local computer :) [21:29:58] * RoanKattouw is a little sad [21:30:08] I did exactly the same in labs at the SF hackathon [21:30:34] ahhhh [21:30:42] I was hoping to puppetize it there, then move it over [21:30:48] But I guess it's getting done now at lesat [21:30:57] And if the trigger plugin works, I can just put in jobs [21:31:08] what have you worked on ? [21:31:09] [21:31:16] RECOVERY - Disk space on mw44 is OK: DISK OK [21:31:29] Just a basic lint job so far [21:31:43] And a job that I stole from OpenStack that implements their test-the-merged-state-not-the-submitted-state stuff [21:31:54] But I can put all of those things in easily [21:32:27] drdee: Hey you were talking about a universal lint script at some point, did that ever get anywhere? [21:32:34] Cause if not I'll write a basic one next week [21:32:42] Or maybe I'll get hashar to do it :D [21:32:46] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.652 seconds [21:32:55] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.961 seconds [21:37:19] RoanKattouw: to get a clean state and test the merged state, I took a script from stack overflow https://gerrit.wikimedia.org/r/#change,2513 [21:37:43] I am not sure it is needed though [21:38:19] +# Script extracted from the OpenStack project v2012.02.08 [21:38:23] That's exactly what I used [21:38:28] \o/ [21:38:40] Did you put it in as a separate job? [21:38:56] I am not sure yet how to organize the various jobs [21:39:21] Ideally you'd put that script in a job of its own, and you can have one job run another one [21:39:32] what I want is to fetch the changes then trigger a job that run some lints [21:39:36] yeah [21:39:40] I have that in my labs project [21:39:45] Let me give you access [21:40:03] I don't have access to labs [21:40:10] for some reason my account is cursed there [21:40:19] Meh [21:40:22] New review: jenkins-bot; "Build Started https://integration.mediawiki.org/ci/job/MediaWiki-GIT-Fetching/27/ (2/2)" [test/mediawiki/core2] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2617 [21:40:33] but go ahead :) [21:40:38] maybe it will work [21:40:43] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:40:52] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:41:44] labs-home-wm 02/15/2012 - 21:41:18 - Creating a home directory for hashar at /export/home/jenkins/hashar [21:41:58] hashar: OK, you should now be able to ssh to bastion.wmflabs.org and then from there to jenkins2 [21:42:14] It doesn't have a public IP so you can't access the Jenkins install in your browser unless you set up FoxyProxy or something [21:43:34] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.363 seconds [21:44:02] OH MY GOD [21:44:06] IT WORKED!!!!!!!!!!!!!!!!!!!! [21:44:15] seriously Roan you are blessed by something [21:44:23] you should start a new religion together with Ryan [21:44:32] you will be successful (and I will be your first follower) [21:45:16] updating the keys solved the issue it seems [21:47:37] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:22] New review: jenkins-bot; "Build Started https://integration.mediawiki.org/ci/job/MediaWiki-GIT-Fetching/28/ (2/2)" [test/mediawiki/core2] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2617 [21:51:31] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.333 seconds [21:59:28] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:02:55] RECOVERY - MySQL Slave Delay on db31 is OK: OK replication delay 0 seconds [22:05:01] PROBLEM - MySQL Replication Heartbeat on db50 is CRITICAL: CRIT replication delay 188 seconds [22:05:19] PROBLEM - MySQL Slave Delay on db50 is CRITICAL: CRIT replication delay 203 seconds [22:07:44] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (18966) [22:29:46] New review: Hashar; "(no comment)" [test/mediawiki/core2] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2617 [22:29:47] Change merged: Hashar; [test/mediawiki/core2] (master) - https://gerrit.wikimedia.org/r/2617 [22:43:14] Ouch, enwiki picked up 20k jobs at 21:55 [22:43:48] mmm, refreshlinks [22:44:19] binasher: what's the state of the db updates? Is it only s7 stuff we need to "worry" about for todays deployments? [22:45:01] s7 is done, so everything is clear for today [22:45:15] great :) [22:45:19] and commons is done, so i think everything is clear for next week too [22:45:49] yup, looks to be [22:45:53] robla: ^ [22:46:16] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.885 seconds [22:47:44] sweet [22:47:48] thanks binasher [22:48:09] New patchset: Andre Engels; "My files; current status" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2618 [22:48:11] binasher: what's *not* done at this point? [22:48:27] enwiki? ;) [22:48:42] s1, s5, and s6 [22:49:03] enwiki, dewiki, fr ja ru wiki [22:49:05] en, de, fr, ja, ru [22:49:08] yup [22:50:10] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:51:00] ok....so we'll be able to do all of the little wikis next week [22:51:16] (little traffic that is) [22:51:46] "all projects except for Wikipedia" on Thursday [22:51:55] I'm assuming the bulk of those are on S3 [22:52:52] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.113 seconds [22:57:09] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.994 seconds [22:57:09] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:57:36] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.910 seconds [23:01:30] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:01:39] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:06:22] !log installed pagecache-management on all search nodes [23:06:24] Logged the message, Master [23:06:36] !log updated /etc/lsearch.conf:Rsync.path to "/usr/local/bin/rsync-no-pagecache" on all search nodes [23:06:39] Logged the message, Master [23:09:27] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.232 seconds [23:10:24] New patchset: Pyoungmeister; "new logrotate for search nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2619 [23:10:57] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.134 seconds [23:14:51] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:20:24] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:23:42] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 602s [23:24:09] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 631s [23:24:54] PROBLEM - MySQL replication status on db1025 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 674s [23:25:39] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.223 seconds [23:30:45] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.415 seconds [23:38:51] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:38:51] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:40:39] RECOVERY - MySQL replication status on db1025 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [23:44:51] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [23:45:27] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [23:46:09] !log running a slow staggered restart of lsearchd [23:46:12] Logged the message, Master [23:50:51] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.032 seconds [23:50:51] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.043 seconds [23:52:03] PROBLEM - RAID on search10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:56:42] RECOVERY - MySQL Slave Delay on db50 is OK: OK replication delay 0 seconds [23:57:18] RECOVERY - RAID on search10 is OK: OK: 1 logical device(s) checked [23:57:27] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:57:27] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:57:45] RECOVERY - MySQL Replication Heartbeat on db50 is OK: OK replication delay 0 seconds