[00:21:25] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [00:21:43] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [00:25:28] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [00:27:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:40:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.249 seconds [00:47:22] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 181 seconds [00:47:50] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 183 seconds [00:48:43] PROBLEM - MySQL Replication Heartbeat on db26 is CRITICAL: CRIT replication delay 198 seconds [00:49:01] PROBLEM - MySQL Slave Delay on db26 is CRITICAL: CRIT replication delay 206 seconds [00:54:17] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [00:54:43] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [00:55:37] RECOVERY - MySQL Replication Heartbeat on db26 is OK: OK replication delay 3 seconds [00:55:55] RECOVERY - MySQL Slave Delay on db26 is OK: OK replication delay 9 seconds [01:08:13] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.051 second response time [01:13:28] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 280 seconds [01:14:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:15:16] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [01:25:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.538 seconds [01:26:40] PROBLEM - Puppet freshness on manganese is CRITICAL: Puppet has not run in the last 10 hours [01:58:38] PROBLEM - Puppet freshness on cp1044 is CRITICAL: Puppet has not run in the last 10 hours [01:59:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:05:45] !log LocalisationUpdate failed: git pull of extensions failed [02:05:49] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 184 seconds [02:06:43] PROBLEM - MySQL Replication Heartbeat on db26 is CRITICAL: CRIT replication delay 219 seconds [02:06:44] PROBLEM - MySQL Slave Delay on db26 is CRITICAL: CRIT replication delay 220 seconds [02:06:44] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 212 seconds [02:12:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.590 seconds [02:20:04] PROBLEM - MySQL disk space on db78 is CRITICAL: DISK CRITICAL - free space: /a 117551 MB (3% inode=99%): [02:27:16] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [02:32:40] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [02:35:42] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [02:38:40] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [02:39:16] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.002 second response time on port 11000 [02:39:25] RECOVERY - MySQL Slave Delay on db26 is OK: OK replication delay 0 seconds [02:39:34] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [02:39:34] RECOVERY - MySQL Replication Heartbeat on db26 is OK: OK replication delay 0 seconds [02:44:40] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [03:20:40] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [03:27:43] RECOVERY - MySQL disk space on db78 is OK: DISK OK [03:54:25] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [03:58:01] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [04:02:58] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.062 second response time [04:03:16] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [06:42:09] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [06:51:20] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [06:51:20] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [06:51:21] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [06:51:21] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [06:51:32] Nagios spam... [08:12:27] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [08:12:28] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [08:28:30] PROBLEM - MySQL Slave Delay on db26 is CRITICAL: CRIT replication delay 192 seconds [08:29:51] PROBLEM - MySQL Replication Heartbeat on db26 is CRITICAL: CRIT replication delay 244 seconds [08:30:19] RECOVERY - MySQL Slave Delay on db26 is OK: OK replication delay 0 seconds [08:31:30] RECOVERY - MySQL Replication Heartbeat on db26 is OK: OK replication delay 0 seconds [09:32:26] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [09:32:26] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [09:32:26] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [09:32:26] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [09:32:26] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [09:32:27] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [09:52:26] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 191 seconds [09:52:33] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 196 seconds [09:55:51] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [09:56:00] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [10:33:13] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 192 seconds [10:33:39] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 204 seconds [10:34:15] PROBLEM - MySQL Slave Delay on db26 is CRITICAL: CRIT replication delay 183 seconds [10:35:36] PROBLEM - MySQL Replication Heartbeat on db26 is CRITICAL: CRIT replication delay 196 seconds [10:39:03] RECOVERY - MySQL Replication Heartbeat on db26 is OK: OK replication delay 0 seconds [10:39:21] RECOVERY - MySQL Slave Delay on db26 is OK: OK replication delay 0 seconds [10:41:54] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [10:42:21] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [11:28:25] PROBLEM - Puppet freshness on manganese is CRITICAL: Puppet has not run in the last 10 hours [11:32:47] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 208 seconds [11:33:13] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 224 seconds [11:57:04] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [11:57:22] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [11:59:19] PROBLEM - Puppet freshness on cp1044 is CRITICAL: Puppet has not run in the last 10 hours [12:34:05] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [12:37:05] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [12:46:05] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [13:08:17] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [13:22:05] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [13:22:41] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.010 second response time on port 11000 [13:47:10] PROBLEM - MySQL Replication Heartbeat on db26 is CRITICAL: CRIT replication delay 205 seconds [13:47:37] PROBLEM - MySQL Slave Delay on db26 is CRITICAL: CRIT replication delay 220 seconds [13:54:04] RECOVERY - MySQL Replication Heartbeat on db26 is OK: OK replication delay 0 seconds [13:54:31] RECOVERY - MySQL Slave Delay on db26 is OK: OK replication delay 0 seconds [14:38:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:46:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [15:13:00] New patchset: Dereckson; "(bug 43411) Import sources for se.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40561 [15:20:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:35:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [15:57:57] New patchset: Dereckson; "(bug 43310) Import sources for it.wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40563 [16:07:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:18:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.954 seconds [16:43:31] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [16:48:06] New patchset: Jgreen; "attempting to add passive check for activemq cclimbo queue" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40564 [16:48:24] New patchset: Dereckson; "(bug 43310) Import sources for it.wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40565 [16:48:24] New patchset: Dereckson; "(bug 42933) Initial configuration for es.wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/38054 [16:48:37] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40564 [16:52:31] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [16:52:31] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [16:52:31] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [16:52:31] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [16:52:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:10:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.038 seconds [17:42:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [18:14:08] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [18:14:08] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [18:16:57] New patchset: Jgreen; "wrong check_command for passive test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40568 [18:18:32] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40568 [18:20:35] maxsem: i just updated ticket solr1-3 in tampa are all yours...solr1002 in eqiad is still down until next week [18:20:50] cmjohnson1, thanks [18:21:03] yw [18:22:04] cmjohnson1, so I wrote https://gerrit.wikimedia.org/r/#/c/39739/ :) [18:22:59] cool [18:24:16] i am going to take it out of decom list now so we know we can use it again [18:24:46] wouldn't it result in Nagios complaints? [18:25:22] no, nagios stops reporting it about 24 hours after it's in decom [18:25:53] it will stay that way [18:28:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.694 seconds [18:57:29] RECOVERY - Varnish HTCP daemon on cp1044 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [18:57:56] RECOVERY - Varnish HTTP mobile-backend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 698 bytes in 0.487 seconds [18:58:05] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Wed Dec 26 18:57:51 UTC 2012 [18:58:23] RECOVERY - Varnish HTTP mobile-frontend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.054 seconds [18:58:32] RECOVERY - NTP on cp1044 is OK: NTP OK: Offset 0.001098632812 secs [18:58:50] RECOVERY - SSH on cp1044 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:06:20] Robla wanted me to subscribe to the ops mailing list, and he said I should mention it here when I did so someone can approve the subscription. So, I'm mentioning it. Thanks. [19:06:57] New patchset: MaxSem; "Use FQDN for Solr replication" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40569 [19:07:23] anomie, slowny becoming a half-op? :) [19:07:41] MaxSem- For the Eqiad migration [19:11:50] New patchset: MaxSem; "Add PageImages" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40570 [19:14:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:29] New patchset: Andrew Bogott; "Added install_path param for wikidata.pp, files are affected by (https://gerrit.wikimedia.org/r/#/c/35313/2)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36353 [19:24:10] PROBLEM - SSH on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:28] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:24:55] PROBLEM - Varnish HTCP daemon on cp1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:25:19] New review: Andrew Bogott; "Crap, I totally missed this when it came in. Sorry for the delay in review... feel free to nag me vi..." [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/36353 [19:25:19] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36353 [19:25:23] PROBLEM - Varnish HTTP mobile-backend on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:31] PROBLEM - Varnish HTTP mobile-frontend on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:27:37] RECOVERY - SSH on cp1044 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:28:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.063 seconds [19:28:23] RECOVERY - Varnish HTCP daemon on cp1044 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [19:28:49] RECOVERY - Varnish HTTP mobile-backend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 696 bytes in 0.053 seconds [19:28:58] RECOVERY - Varnish HTTP mobile-frontend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.053 seconds [19:33:55] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [19:33:56] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [19:33:56] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [19:33:56] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [19:33:56] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [19:33:56] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [19:49:16] New patchset: Asher; "redis replication topology and mc eqiad node defs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40573 [19:57:54] New patchset: Jgreen; "added passive check_donations" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40576 [20:00:33] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40576 [20:01:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:08:17] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40573 [20:17:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.048 seconds [20:35:28] Change merged: awjrichards; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40570 [20:36:28] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:40:05] PROBLEM - Varnish HTTP mobile-frontend on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:05] PROBLEM - Varnish HTCP daemon on cp1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:40:23] PROBLEM - Varnish HTTP mobile-backend on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:40] PROBLEM - SSH on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:42:23] RECOVERY - SSH on cp1044 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [20:42:32] RECOVERY - Varnish HTTP mobile-frontend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [20:43:26] RECOVERY - Varnish HTCP daemon on cp1044 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [20:43:53] RECOVERY - Varnish HTTP mobile-backend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 696 bytes in 0.053 seconds [20:45:59] apergos, still on RT duty? [20:46:35] PROBLEM - Memcached on mc1001 is CRITICAL: Connection refused [20:46:54] a good q [20:46:56] in thory no [20:47:00] theory too [20:47:08] in practice I dunno if anyone else is around to take it [20:47:29] but also in practice Im going to sleep soon, so I'll check the queue tomorrow [20:47:42] heh, this is actually just a merge request, can you merge 4202 into 4201? [20:48:25] not right now but I can do it tomorrow morning [20:48:28] I didn't realise RT wasn't quite as clever as OTRS in that replying to an email with the ops-.. email would create a new ticket, instead of putting it in the old one [20:48:32] sure, thanks [20:48:38] ok [20:48:57] have a good reast of the day [20:49:03] rest [20:49:41] and you :) [20:50:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:03:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.703 seconds [21:10:17] PROBLEM - MySQL Slave Delay on db26 is CRITICAL: CRIT replication delay 189 seconds [21:10:34] !log awjrichards synchronized wmf-config/CommonSettings.php 'Enabling PageImages on testwiki' [21:11:03] !log awjrichards synchronized wmf-config/InitialiseSettings.php 'Enabling PageImages on testwiki' [21:11:47] PROBLEM - MySQL Replication Heartbeat on db26 is CRITICAL: CRIT replication delay 231 seconds [21:14:07] New patchset: MaxSem; "zomgfail" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40682 [21:14:26] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40682 [21:17:47] New patchset: Asher; "removing depooled dbs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40713 [21:17:56] PROBLEM - Varnish HTTP mobile-frontend on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:18:05] PROBLEM - SSH on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:18:07] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40713 [21:19:35] RECOVERY - Varnish HTTP mobile-frontend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [21:19:44] RECOVERY - SSH on cp1044 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [21:23:39] !log maxsem synchronized wmf-config/CommonSettings.php 'Fix PageImages inclusion' [21:25:43] * Jeff_Green inadvertently downed nagios by restarting it mid-puppet-run, fixing as soon as puppet is done... [21:26:37] anyone here following this bug? https://bugzilla.wikimedia.org/show_bug.cgi?id=41130 (purge requests failing for north america) [21:27:03] New patchset: Jgreen; "adding frtech user to nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40740 [21:27:45] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40740 [21:29:43] MaxSem: I fixed a small typo: https://gerrit.wikimedia.org/r/#/c/40741/ [21:33:05] PROBLEM - Puppet freshness on manganese is CRITICAL: Puppet has not run in the last 10 hours [21:33:14] RECOVERY - MySQL Slave Delay on db26 is OK: OK replication delay 0 seconds [21:33:55] Raymond_, thanks! Is there an analog to {{optional}} that says don't translate? [21:34:33] RECOVERY - MySQL Replication Heartbeat on db26 is OK: OK replication delay 0 seconds [21:34:34] MaxSem: I add the "don't translate" to our twn settings in a minute [21:34:53] cool, thanks [21:36:10] merged [21:37:51] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:38:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:39:13] done with 40744 [21:40:19] PROBLEM - Varnish HTTP mobile-backend on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:40:37] PROBLEM - NTP on cp1044 is CRITICAL: NTP CRITICAL: No response from NTP server [21:40:37] PROBLEM - Varnish HTTP mobile-frontend on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:40:50] MaxSem: it's documented in https://www.mediawiki.org/wiki/Manual:System_messages#Creating_new_messages by the way, feedback appreciated about where you'd expect the info to be, if it's unclear etc. [21:41:04] PROBLEM - SSH on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:41:31] PROBLEM - Varnish HTCP daemon on cp1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:46:17] New patchset: Jgreen; "fixing puppet+nagios+activemq+fundraising config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40747 [21:47:49] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - check plugin (check_job_queue) or PHP errors - [21:49:13] New review: Asher; "shm_workspace defaults to 8k, and should be > than shm_reclen. also see:" [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/40554 [21:49:42] New patchset: Jgreen; "fixing puppet+nagios+activemq+fundraising config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40747 [21:50:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.070 seconds [21:50:36] * Jeff_Green garg. [21:50:38] New patchset: Jgreen; "fixing puppet+nagios+activemq+fundraising config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40747 [21:51:12] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40747 [21:59:46] !log awjrichards Started syncing Wikimedia installation... : MobileFrontend updates per https://www.mediawiki.org/wiki/Extension:MobileFrontend/Deployments/2012-12-26 [22:10:39] New review: Ori.livneh; "@binasher: Works for me. I'll update the patch." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/40554 [22:22:39] New patchset: Ori.livneh; "(RT 4094) Increase varnish SHM defaults" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40554 [22:22:46] ^ binasher [22:23:19] !log awjrichards Finished syncing Wikimedia installation... : MobileFrontend updates per https://www.mediawiki.org/wiki/Extension:MobileFrontend/Deployments/2012-12-26 [22:23:21] i also specified the values in units of kilobytes rather than bytes; it seemed easier to sanity-check at a glance. [22:24:17] New review: Anomie; "Notes for the future: We should add a "$wmfActiveDatacenter" value set to whichever data center (pmt..." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/32167 [22:25:53] !log awjrichards synchronized php-1.21wmf6/extensions/MobileFrontend/ 'touch files' [22:27:48] PROBLEM - Memcached on mc1002 is CRITICAL: Connection refused [22:28:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:29:34] New patchset: MaxSem; "Enable PageImages on test2" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40751 [22:29:36] RECOVERY - Memcached on mc1001 is OK: TCP OK - 0.027 second response time on port 11211 [22:29:36] RECOVERY - Memcached on mc1002 is OK: TCP OK - 0.028 second response time on port 11211 [22:30:05] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40751 [22:31:37] !log maxsem synchronized wmf-config/InitialiseSettings.php 'Enable PageImages on test2' [22:35:18] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [22:38:18] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [22:38:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.032 seconds [22:47:18] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [22:49:30] New patchset: MaxSem; "Enable PageImages everywhere" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40753 [22:50:21] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40753 [22:51:58] !log maxsem synchronized wmf-config/InitialiseSettings.php 'Enable PageImages everywhere' [22:59:12] New patchset: Jgreen; "check_dummy != nsca-fail" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40754 [23:00:07] now that the ganglia web interface wants a login and password… do I /have/ a login and password? [23:03:13] PM'd [23:12:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:17:13] hi TimStarling, looks like binasher may have fixed the problem while you were investigating it [23:19:18] !log !logging what morebots missed [23:19:27] Logged the message, Master [23:19:49] !log 2012-12-24 13:28:34 <+logmsgbot> !log reedy synchronized live-1.5/ [23:19:52] !log 2012-12-24 14:49:25 < apergos> !log rebooting cp1043 which had fallen over [23:19:55] !log 2012-12-25 02:23:37 <+logmsgbot> !log LocalisationUpdate completed (1.21wmf6) at Tue Dec 25 02:23:37 UTC 2012 [23:19:58] Logged the message, Master [23:19:59] !log 2012-12-25 05:45:08 < Ryan_Lane> !log restarting gerrit. also blocked a bad bot [23:20:01] !log 2012-12-25 10:58:38 <+logmsgbot> !log maxsem synchronized wmf-config/CommonSettings.php 'Captchas broken on every wiki' [23:20:08] Logged the message, Master [23:20:16] Logged the message, Master [23:20:25] Logged the message, Master [23:20:33] Logged the message, Master [23:20:45] !log 2012-12-25 11:03:46 <+logmsgbot> !log maxsem synchronized wmf-config/CommonSettings.php 'Disable broken Swift captchas on mw.o too' [23:20:49] !log 2012-12-26 02:05:45 <+logmsgbot> !log LocalisationUpdate failed: git pull of extensions failed [23:20:51] !log 2012-12-26 21:10:34 <+logmsgbot> !log awjrichards synchronized wmf-config/CommonSettings.php 'Enabling PageImages on testwiki' [23:20:54] Logged the message, Master [23:20:54] !log 2012-12-26 21:11:03 <+logmsgbot> !log awjrichards synchronized wmf-config/InitialiseSettings.php 'Enabling PageImages on testwiki' [23:20:58] !log 2012-12-26 21:23:38 <+logmsgbot> !log maxsem synchronized wmf-config/CommonSettings.php 'Fix PageImages inclusion' [23:21:02] Logged the message, Master [23:21:11] Logged the message, Master [23:21:19] Logged the message, Master [23:21:27] Logged the message, Master [23:21:29] !log 2012-12-26 21:59:47 <+logmsgbot> !log awjrichards Started syncing Wikimedia installation... : MobileFrontend updates per https://www.mediawiki.org/wiki/Extension:MobileFrontend/Deployments/2012-12-26 [23:21:33] !log 2012-12-26 22:23:19 <+logmsgbot> !log awjrichards Finished syncing Wikimedia installation... : MobileFrontend updates per https://www.mediawiki.org/wiki/Extension:MobileFrontend/Deployments/2012-12-26 [23:21:37] Logged the message, Master [23:21:37] !log 2012-12-26 22:25:53 <+logmsgbot> !log awjrichards synchronized php-1.21wmf6/extensions/MobileFrontend/ 'touch files' [23:21:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.938 seconds [23:21:41] !log 2012-12-26 22:31:37 <+logmsgbot> !log maxsem synchronized wmf-config/InitialiseSettings.php 'Enable PageImages on test2' [23:21:42] New patchset: Jgreen; "fix check_command for fundraising tests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40759 [23:21:43] !log 2012-12-26 22:51:58 <+logmsgbot> !log maxsem synchronized wmf-config/InitialiseSettings.php 'Enable PageImages everywhere' [23:21:45] Logged the message, Master [23:21:53] Logged the message, Master [23:22:01] Logged the message, Master [23:22:09] Logged the message, Master [23:22:59] binasher: did you change something to do with HTCP purging? [23:23:18] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [23:23:24] TimStarling: yes he did , per bug 41130 [23:23:41] TimStarling: i restarted varnishhtcpd on all eqiad upload varnish hosts, see email on ops@ [23:24:37] were you going to add that to the server admin log? [23:25:48] oh, that [23:26:22] the most recent LocalisationUpdate failed so there's a decent chance tonight's will fail too. idk who may have a chance to look at it. Reedy ? [23:26:40] * jeremyb really runs away [23:26:58] Change abandoned: Jgreen; "checkout got mangled. starting over." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40754 [23:27:06] !log restarted varnishhtcpd on all eqiad upload varnish hosts; looking back fondly on vacation [23:27:07] New patchset: Ori.livneh; "wmgUseEventLogging: default => true" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40761 [23:27:15] Logged the message, Master [23:28:07] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40759 [23:30:30] PROBLEM - SSH on ms1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:30:39] * Reedy runs l10nupdate manually [23:30:51] * Susan smiles at binasher. [23:30:55] New patchset: Asher; "cache 4xx ttl in wikimedia.vcl.erb was being inadvertantly overridden, with all ttl's set to 30 days regardless of response code" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40762 [23:31:36] Reedy, we;ve just scapped [23:31:39] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40761 [23:31:47] so no difference [23:31:53] Well, yes [23:32:17] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40762 [23:32:37] If the problem is in the lu update directory while updating the clones... [23:34:40] binasher: you're good :) [23:35:46] :) [23:56:24] Well, that explains what is up with localisation update then... [23:56:30] mw18: Permission denied (publickey,password). [23:56:31] mw7: Permission denied (publickey,password). [23:56:31] etc etc [23:56:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds