[00:07:04] what is the sense of having the ops maillist archive private when the mails in it are being archived by public archivers? [00:07:26] well it would delay things [00:07:50] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri Feb 1 00:07:43 UTC 2013 [00:08:10] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:08:30] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri Feb 1 00:08:24 UTC 2013 [00:09:10] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:09:11] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri Feb 1 00:09:09 UTC 2013 [00:10:11] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:10:38] Prodego: how do you mean that? [00:10:40] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Puppet has not run in the last 10 hours [00:11:10] Danny_B: I assume the archivers don't archive instantly [00:16:20] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [00:16:48] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [00:18:11] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 7.483 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [00:18:26] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 0.049 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [00:20:12] Prodego: eg. mail-archive has up to date wikitech-l [00:27:18] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [00:29:09] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 5.015 seconds response time. nagiostest.beta.wmflabs.org returns [00:34:19] RECOVERY - Parsoid on cerium is OK: HTTP OK: HTTP/1.1 200 OK - 1308 bytes in 0.007 second response time [00:34:53] RECOVERY - Parsoid on cerium is OK: HTTP OK HTTP/1.1 200 OK - 1221 bytes in 0.057 seconds [00:38:12] Danny_B, it is? link? [00:45:26] RobH: Seeing as you're on duty this week, would you be able to find someone who could review https://gerrit.wikimedia.org/r/#/c/44164/ ? Antoine and I have +1ed it but we don't have +2 in puppet [00:45:58] I can give it a shot, but most folks are traveling to Fosdem [00:46:03] Yeah, I know [00:46:38] RoanKattouw, solution, Mediawiki:Gerrit project ownership or whatever it is :) [00:46:53] I don't necessarily want +2 on puppet :) [00:47:25] heh [00:51:46] +2 on puppet cannot get anyone anything but trouble =P [00:56:06] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [01:26:36] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [01:27:07] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [01:27:13] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [01:27:22] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [01:33:59] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [01:34:04] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [01:34:45] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [01:34:53] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [01:47:55] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [01:56:14] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [01:56:45] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [02:04:55] New patchset: Dereckson; "Maintenance for http://fr.planet.wikimedia.org/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47047 [02:05:13] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [02:09:43] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [02:13:14] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [02:16:04] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 3.692 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [02:20:19] New patchset: Dereckson; "Fixed a small typo in Planet config files." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47048 [02:27:10] !log LocalisationUpdate completed (1.21wmf8) at Fri Feb 1 02:27:09 UTC 2013 [02:27:11] Logged the message, Master [02:50:13] PROBLEM - HTTP on formey is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:51:02] RECOVERY - HTTP on formey is OK: HTTP OK: HTTP/1.1 200 OK - 3596 bytes in 0.056 second response time [02:58:58] secure.wm.org no longer works? [02:59:41] works for me [02:59:59] hmm, can't create the url for wikimania 2013 [03:00:09] could you suggest pls? [03:00:40] why would you want to make links to it? it only exists for legacy URLs, as a redirect service [03:00:57] just use https://wikimania2013.wikimedia.org/ [03:01:21] enough for me to know it is redir only [03:01:34] so i can delete some stuff from common.js [03:03:15] https://secure.wikimedia.org/wikipedia/wikimania2013/ [03:15:44] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [03:16:19] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:17:58] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 247 bytes in 0.054 seconds [03:36:17] bad kibble [03:36:22] :o ! [03:36:27] * kibble is good kibble. [03:41:33] ha, Reedy! [03:41:43] You're no better!! [03:42:25] how's the batch move, Reedy ? [03:42:30] Dunno [03:46:15] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [03:48:11] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [03:50:06] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 5.439 seconds response time. nagiostest.beta.wmflabs.org returns [03:51:38] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 0.041 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [03:54:11] PROBLEM - HTTP on formey is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:55:50] RECOVERY - HTTP on formey is OK: HTTP OK HTTP/1.1 200 OK - 3596 bytes in 0.006 seconds [04:15:11] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [04:16:50] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 6.947 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [04:23:41] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [04:29:54] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [04:31:33] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 7.471 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [05:06:05] paravoid: andrewbogott_afk: labs pmtpa bastion's broken. refusing my loging [05:06:08] login* [05:06:14] Permission denied (publickey). [05:06:19] (see also labs-l) [05:09:21] ssmollett: ^ [05:20:11] andrewbogott: ping [05:20:30] jeremyb: broken! [05:20:41] andrewbogott: yah :) [05:20:54] I spent a while on this earlier but didn't come up with much… having another go now. [05:21:10] i see there was some scrollback earlier in #-labs [05:21:20] anyway, wanted to make sure someone knew at least [05:21:25] * jeremyb heads to sleep :) [05:21:37] Thanks. Most ops are traveling today, unfortunately. [05:21:41] orly [05:21:43] allhands? [05:22:11] fosdem [05:22:32] ohhhhh [05:22:41] i didn't realize so many people were going [05:22:41] I have a reasonably good idea of what's broken, just not of how to fix it. [05:36:38] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [05:38:28] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 3.570 seconds response time. nagiostest.beta.wmflabs.org returns [05:39:56] paravoid, are you awake by chance? [05:47:06] LeslieCarr, how about you? [05:54:29] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 181 seconds [05:54:38] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 183 seconds [05:55:42] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 197 seconds [05:56:00] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 201 seconds [05:56:44] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:06:24] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [06:06:54] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [06:11:46] New patchset: Andrew Bogott; "Up ulimit for glusterd again" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47054 [06:12:22] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47054 [06:20:55] PROBLEM - NTP on mw1085 is CRITICAL: NTP CRITICAL: Offset unknown [06:25:54] RECOVERY - NTP on mw1085 is OK: NTP OK: Offset 0.001241207123 secs [06:28:34] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri Feb 1 06:28:30 UTC 2013 [06:28:45] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:28:54] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri Feb 1 06:28:44 UTC 2013 [06:29:45] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:37:31] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [06:39:39] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [06:40:10] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [06:41:01] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 3.671 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [06:45:21] PROBLEM - HTTP on formey is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:47:00] RECOVERY - HTTP on formey is OK: HTTP OK HTTP/1.1 200 OK - 3596 bytes in 0.011 seconds [06:47:31] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [06:47:40] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [06:47:45] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [06:48:04] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [06:56:10] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [06:57:00] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 5.008 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [07:02:50] PROBLEM - HTTP on formey is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:03:41] RECOVERY - HTTP on formey is OK: HTTP OK: HTTP/1.1 200 OK - 3596 bytes in 0.055 second response time [07:04:30] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [07:06:57] PROBLEM - HTTP on formey is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:08:11] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [07:08:36] RECOVERY - HTTP on formey is OK: HTTP OK HTTP/1.1 200 OK - 3596 bytes in 0.005 seconds [07:15:59] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [07:16:49] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:16:59] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 6.520 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [07:17:39] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 247 bytes in 0.002 second response time [07:22:29] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 182 seconds [07:22:33] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 182 seconds [07:22:39] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 183 seconds [07:22:42] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 183 seconds [07:24:21] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [07:24:29] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [07:24:39] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [07:25:00] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [07:25:49] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:26:00] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [07:26:39] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 247 bytes in 0.001 second response time [07:26:59] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 9.123 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [07:33:09] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 183 seconds [07:33:12] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 184 seconds [07:33:19] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 190 seconds [07:34:15] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 209 seconds [07:35:09] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [07:35:24] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [07:35:54] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [07:36:39] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 1 seconds [07:45:39] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [07:45:39] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [07:45:39] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [07:45:39] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [07:45:39] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [07:47:36] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [08:20:13] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri Feb 1 08:07:39 UTC 2013 [08:20:13] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [08:20:13] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri Feb 1 08:08:04 UTC 2013 [08:20:13] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [08:20:14] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [08:21:21] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri Feb 1 08:21:12 UTC 2013 [08:22:50] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [08:27:21] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [08:29:00] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 0.048 seconds response time. nagiostest.beta.wmflabs.org returns [08:38:02] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 181 seconds [08:38:02] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 181 seconds [08:38:45] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 205 seconds [08:38:54] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 207 seconds [08:40:01] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [08:40:02] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [08:40:33] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [08:40:33] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [09:24:22] helllo [09:25:24] New patchset: Hashar; "contint: install mercurial package on gallium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46931 [09:28:33] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Puppet has not run in the last 10 hours [09:33:08] New patchset: Hashar; "(bug 44041) adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [09:38:18] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 191 seconds [09:39:17] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [10:00:54] New patchset: Hashar; "insert 'realm' in role::cache::configuration::active_nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47067 [10:01:25] New patchset: Hashar; "(bug 44041) adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [10:01:42] New review: Hashar; "Rebased on top of https://gerrit.wikimedia.org/r/47067" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44709 [10:03:16] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [10:04:01] mark: hi :-] So the role::cache::configuration::active_nodes missed the realm. I have added it in with https://gerrit.wikimedia.org/r/47067 [10:04:05] andrebased my infamous patchset [10:05:03] New review: Hashar; "Makes puppet happier for the wikimedia frontend configuration:" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/47067 [10:10:20] bahhh [10:11:03] I lost my instance [10:11:34] deployment-varnish-t3 login: Feb 1 10:07:14 deployment-varnish-t3 nslcd[1074]: [a1deaa] error writing to client: Broken pipe [10:11:35] Feb 1 10:07:14 deployment-varnish-t3 nslcd[1074]: [c6c33a] error writing to client: Broken pipe [10:11:36] Feb 1 10:07:14 deployment-varnish-t3 nslcd[1074]: [e685fb] error writing to client: Broken pipe [10:11:36] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Puppet has not run in the last 10 hours [10:11:37] youhouu [10:13:46] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:36] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 247 bytes in 0.001 second response time [10:21:53] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: Connection timed out [10:21:54] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 247 bytes in 0.001 second response time [10:46:44] New patchset: Hashar; "insert 'realm' in role::cache::configuration::active_nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47067 [10:46:44] New patchset: Hashar; "(bug 44041) adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [10:47:44] New review: Hashar; "* removed an unrelated template (labs-upload.conf)" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/47067 [10:59:21] ah [10:59:37] the varnish backends use an LVS entry as a backend and there is none in labs :-D [10:59:39] *grin* [11:12:07] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [11:35:48] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: Connection timed out [11:36:37] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 247 bytes in 0.001 second response time [11:44:47] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:45:35] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 247 bytes in 0.002 second response time [11:49:23] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [12:04:38] !log authdns update adding db1051-60 to zone files [12:04:41] Logged the message, Master [12:07:45] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri Feb 1 12:07:38 UTC 2013 [12:08:06] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [12:08:15] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri Feb 1 12:08:12 UTC 2013 [12:09:05] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [12:09:15] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri Feb 1 12:09:10 UTC 2013 [12:10:05] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [12:10:35] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [12:46:20] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [12:47:22] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [13:06:20] New patchset: JanZerebecki; "replace the ugly HTML redirect from the old planet with a proper HTTP redirect ( RT-4410 )" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47073 [13:11:52] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:12:42] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 247 bytes in 0.001 second response time [13:17:05] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47073 [13:20:55] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:40] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 247 bytes in 0.004 second response time [13:22:12] New patchset: Hashar; "(bug 44251) hardcode $wgDBuser = 'wikiuser'" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47074 [13:29:51] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: Connection timed out [13:30:40] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 247 bytes in 0.001 second response time [13:33:09] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 189 seconds [13:33:36] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 201 seconds [13:33:40] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 203 seconds [13:33:50] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 210 seconds [13:45:50] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 19 seconds [13:45:54] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 22 seconds [13:46:40] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [13:47:15] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [13:56:33] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [14:00:27] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 188 seconds [14:00:43] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 198 seconds [14:00:45] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 196 seconds [14:01:03] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 208 seconds [14:09:04] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 19 seconds [14:09:09] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 19 seconds [14:09:43] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 5 seconds [14:09:45] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 9 seconds [14:24:33] jzerebecki: congrats on resolving an RT ticket:) thanks [14:25:35] yay! thank you. [14:33:46] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:34:35] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 247 bytes in 0.001 second response time [14:45:12] !log authdns update "adding mw1161-1200 to eqiad mgmt and production zone files [14:45:15] Logged the message, Master [14:47:46] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:48:17] apergos: ^^ ? [14:48:32] * jeremyb isn't really up to date... not sure if that's important [14:48:36] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 247 bytes in 0.001 second response time [14:48:40] I see those, and I have no idea about them [14:49:55] hey [14:50:05] just logged in, don't have much time [14:50:10] ceph cluster crapped itself out again [14:50:11] sigh [14:50:18] ah hello [14:50:25] shall be safe to ignore [14:50:26] what do I need to do for these cases (and how can you tell)? [14:50:50] ignore until I send an email explaining our architecture, basic debugging steps etc. [14:50:54] heh [14:50:55] ok then [14:51:00] which should be before we put it into production [14:51:10] good idea :-D [14:51:28] are you in brussels then? [14:51:33] if we put into prod and I haven't done that, feel free to call me at ungodly hours and scream at me :) [14:51:37] yes [14:52:04] ah great, how was the trip? [14:52:19] enjoy the beer faidon :-] [14:52:38] tiring [14:52:42] too early [14:52:45] ugh [14:52:55] hope you get some rest time [14:53:10] that's what I plan to do now [14:53:26] see you [14:53:43] so, how do we know it's not swift that broke? [14:53:56] or can we? (from just reading the msg above) [14:55:11] sleep well! [14:59:25] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [15:00:15] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [15:05:02] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [15:07:23] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [15:08:23] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:17:02] hiiiiii morning cmjohnson1 [15:17:11] ho ottomata [15:17:12] pIInnnnnggggggg on analytisc1007 [15:17:14] that is all! [15:17:21] heh [16:42:23] ok another q [16:42:23] cmjohnson1 (and maybe mark) [16:42:23] I just asked on this RT: [16:42:23] https://rt.wikimedia.org/Ticket/Display.html?id=4328 [16:42:23] analytics could do with a crappy machine for a bastion host [16:42:23] analytics1000 [16:42:23] or osmething [16:42:23] right now our public IP is on analytics1001 [16:42:23] which is a beefy cisco, [16:42:23] can we use db42 for that? [16:42:23] I need to reinstall OS on analytics1001 soon (probably today) [16:42:23] so I could move the IP as part of that pocess [16:42:23] process [16:42:24] i suggest to rename db42 to 'lair' if it becomes a bastion box :) [16:42:24] <^demon> Or maybe bastNNNN like bast1001. Similar names are nice :) [16:42:24] ottomata: i don't see why we couldn't but let's wait for mark or robh to confirm that it is ok [16:42:24] cool [16:42:24] their one off ibm servers [16:42:24] drdee: do you know dkg? [16:42:24] yeah anything will do [16:42:24] he IRCs from a box named lair [16:42:24] iirc [16:42:24] jeremby: nope [16:42:24] * jeremyb wonders who jeremby is :P [16:42:24] your twin brother :D [16:42:25] <^demon> An evil twin? [16:42:25] * jeremyb will bbl [16:42:25] bizarro jb [16:42:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:42:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:42:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.570 second response time [16:42:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.201 seconds [16:50:54] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 187 seconds [16:50:54] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 190 seconds [16:50:54] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 191 seconds [16:50:55] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 196 seconds [16:50:56] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 21 seconds [16:50:56] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:50:56] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [16:50:56] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [16:50:56] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 247 bytes in 0.001 second response time [16:50:56] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [16:51:11] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [16:53:02] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [16:53:07] Change abandoned: Alex Monk; "This event is over..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46547 [17:04:31] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [17:06:41] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: Connection timed out [17:07:40] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 247 bytes in 0.001 second response time [17:20:50] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:21:39] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 247 bytes in 0.001 second response time [17:25:59] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Puppet has not run in the last 10 hours [17:35:49] PROBLEM - MySQL Recent Restart on db1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:36:38] PROBLEM - MySQL Recent Restart on db1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:36:40] RECOVERY - MySQL Recent Restart on db1011 is OK: OK 370 seconds since restart [17:38:17] RECOVERY - MySQL Recent Restart on db1011 is OK: OK 460 seconds since restart [17:43:33] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 188 seconds [17:43:40] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 192 seconds [17:43:53] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 198 seconds [17:44:02] New patchset: Micha? ?azowik; "Wikidata language code subdomain redirect to ItemByTitle special page" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/47088 [17:44:52] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 218 seconds [17:45:10] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:07] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [17:47:07] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [17:47:07] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [17:47:07] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [17:47:08] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [17:48:37] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 247 bytes in 0.054 seconds [17:49:00] .... [17:49:04] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [17:49:09] New patchset: Micha? ?azowik; "Wikidata language code subdomain redirect to ItemByTitle special page" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/47088 [17:49:11] come to check about page, get clear page. [17:49:16] easiest alert ever. [17:50:29] New patchset: Micha? ?azowik; "Wikidata language code subdomain redirect to ItemByTitle special page" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/47088 [17:51:33] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [17:51:54] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [17:51:55] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [17:51:59] New patchset: Micha? ?azowik; "Wikidata language code subdomain redirect to ItemByTitle special page" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/47088 [17:52:40] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [18:03:03] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:03:53] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 247 bytes in 0.001 second response time [18:08:53] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Puppet has not run in the last 10 hours [18:08:58] New review: Denny Vrandecic; "This really means "Looks good to me", i.e. it seems to do what it should, i.e. rewriting http://en.w..." [operations/apache-config] (master) C: 1; - https://gerrit.wikimedia.org/r/47088 [18:12:03] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:12:53] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [18:12:54] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 247 bytes in 0.001 second response time [18:13:50] RoanKattouw: So I tested your apache config locally [18:13:53] and it does indeed work [18:13:58] so im gonna +2/merge your shit [18:15:22] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44164 [18:15:40] New review: RobH; "tested apache config stuff, works, reviewed rest, seems legit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44164 [18:15:56] RobH: Thanks man [18:16:44] its merged, want me to push a puppet update on the CI server now so we can see if it breaks it? [18:17:08] ....gallium is slow. [18:17:43] and puppet is processing on gallium right now, explains why its slow. [18:19:24] <^demon> gallium needs a reboot too :\ [18:20:09] ^demon: does it? [18:20:13] i can reboot it now if it needs it. [18:20:19] <^demon> *** System restart required *** [18:20:27] meh. [18:20:36] <^demon> Antoine's not around, so I'm leery of doing it w/o him. [18:20:38] I read that 'guillaume needa a reboot too' [18:20:40] <^demon> Afraid Zuul will freak out. [18:20:43] guess it's time for a break [18:20:48] yea we will wait then. [18:20:55] puppet update applying for the ci update. [18:21:06]