[00:21:25] <nagios-wm>	 PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100%
[00:21:43] <nagios-wm>	 RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms
[00:25:28] <nagios-wm>	 PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused
[00:27:52] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:40:01] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.249 seconds
[00:47:22] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 181 seconds
[00:47:50] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 183 seconds
[00:48:43] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db26 is CRITICAL: CRIT replication delay 198 seconds
[00:49:01] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db26 is CRITICAL: CRIT replication delay 206 seconds
[00:54:17] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds
[00:54:43] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds
[00:55:37] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db26 is OK: OK replication delay 3 seconds
[00:55:55] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db26 is OK: OK replication delay 9 seconds
[01:08:13] <nagios-wm>	 RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.051 second response time
[01:13:28] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 280 seconds
[01:14:40] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:15:16] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds
[01:25:01] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.538 seconds
[01:26:40] <nagios-wm>	 PROBLEM - Puppet freshness on manganese is CRITICAL: Puppet has not run in the last 10 hours
[01:58:38] <nagios-wm>	 PROBLEM - Puppet freshness on cp1044 is CRITICAL: Puppet has not run in the last 10 hours
[01:59:58] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:05:45] <logmsgbot>	 !log LocalisationUpdate failed: git pull of extensions failed
[02:05:49] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 184 seconds
[02:06:43] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db26 is CRITICAL: CRIT replication delay 219 seconds
[02:06:44] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db26 is CRITICAL: CRIT replication delay 220 seconds
[02:06:44] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 212 seconds
[02:12:16] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.590 seconds
[02:20:04] <nagios-wm>	 PROBLEM - MySQL disk space on db78 is CRITICAL: DISK CRITICAL - free space: /a 117551 MB (3% inode=99%):
[02:27:16] <nagios-wm>	 PROBLEM - Memcached on virt0 is CRITICAL: Connection refused
[02:32:40] <nagios-wm>	 PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours
[02:35:42] <nagios-wm>	 PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours
[02:38:40] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds
[02:39:16] <nagios-wm>	 RECOVERY - Memcached on virt0 is OK: TCP OK - 0.002 second response time on port 11000
[02:39:25] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db26 is OK: OK replication delay 0 seconds
[02:39:34] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds
[02:39:34] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db26 is OK: OK replication delay 0 seconds
[02:44:40] <nagios-wm>	 PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours
[03:20:40] <nagios-wm>	 PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours
[03:27:43] <nagios-wm>	 RECOVERY - MySQL disk space on db78 is OK: DISK OK
[03:54:25] <nagios-wm>	 PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused
[03:58:01] <nagios-wm>	 PROBLEM - SSH on lvs1001 is CRITICAL: Server answer:
[04:02:58] <nagios-wm>	 RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.062 second response time
[04:03:16] <nagios-wm>	 RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0)
[06:42:09] <nagios-wm>	 PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours
[06:51:20] <nagios-wm>	 PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours
[06:51:20] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours
[06:51:21] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours
[06:51:21] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours
[06:51:32] <Matthew_>	 Nagios spam...
[08:12:27] <nagios-wm>	 PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours
[08:12:28] <nagios-wm>	 PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours
[08:28:30] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db26 is CRITICAL: CRIT replication delay 192 seconds
[08:29:51] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db26 is CRITICAL: CRIT replication delay 244 seconds
[08:30:19] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db26 is OK: OK replication delay 0 seconds
[08:31:30] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db26 is OK: OK replication delay 0 seconds
[09:32:26] <nagios-wm>	 PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours
[09:32:26] <nagios-wm>	 PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours
[09:32:26] <nagios-wm>	 PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours
[09:32:26] <nagios-wm>	 PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours
[09:32:26] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours
[09:32:27] <nagios-wm>	 PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours
[09:52:26] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 191 seconds
[09:52:33] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 196 seconds
[09:55:51] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds
[09:56:00] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds
[10:33:13] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 192 seconds
[10:33:39] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 204 seconds
[10:34:15] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db26 is CRITICAL: CRIT replication delay 183 seconds
[10:35:36] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db26 is CRITICAL: CRIT replication delay 196 seconds
[10:39:03] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db26 is OK: OK replication delay 0 seconds
[10:39:21] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db26 is OK: OK replication delay 0 seconds
[10:41:54] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds
[10:42:21] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds
[11:28:25] <nagios-wm>	 PROBLEM - Puppet freshness on manganese is CRITICAL: Puppet has not run in the last 10 hours
[11:32:47] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 208 seconds
[11:33:13] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 224 seconds
[11:57:04] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds
[11:57:22] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds
[11:59:19] <nagios-wm>	 PROBLEM - Puppet freshness on cp1044 is CRITICAL: Puppet has not run in the last 10 hours
[12:34:05] <nagios-wm>	 PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours
[12:37:05] <nagios-wm>	 PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours
[12:46:05] <nagios-wm>	 PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours
[13:08:17] <nagios-wm>	 PROBLEM - Memcached on virt0 is CRITICAL: Connection refused
[13:22:05] <nagios-wm>	 PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours
[13:22:41] <nagios-wm>	 RECOVERY - Memcached on virt0 is OK: TCP OK - 0.010 second response time on port 11000
[13:47:10] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db26 is CRITICAL: CRIT replication delay 205 seconds
[13:47:37] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db26 is CRITICAL: CRIT replication delay 220 seconds
[13:54:04] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db26 is OK: OK replication delay 0 seconds
[13:54:31] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db26 is OK: OK replication delay 0 seconds
[14:38:01] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:46:52] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds
[15:13:00] <gerrit-wm>	 New patchset: Dereckson; "(bug 43411) Import sources for se.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40561
[15:20:28] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:35:10] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds
[15:57:57] <gerrit-wm>	 New patchset: Dereckson; "(bug 43310) Import sources for it.wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40563
[16:07:32] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:18:10] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.954 seconds
[16:43:31] <nagios-wm>	 PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours
[16:48:06] <gerrit-wm>	 New patchset: Jgreen; "attempting to add passive check for activemq cclimbo queue" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40564
[16:48:24] <gerrit-wm>	 New patchset: Dereckson; "(bug 43310) Import sources for it.wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40565
[16:48:24] <gerrit-wm>	 New patchset: Dereckson; "(bug 42933) Initial configuration for es.wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/38054
[16:48:37] <gerrit-wm>	 Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40564
[16:52:31] <nagios-wm>	 PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours
[16:52:31] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours
[16:52:31] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours
[16:52:31] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours
[16:52:58] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:10:22] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.038 seconds
[17:42:47] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:55:14] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds
[18:14:08] <nagios-wm>	 PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours
[18:14:08] <nagios-wm>	 PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours
[18:16:57] <gerrit-wm>	 New patchset: Jgreen; "wrong check_command for passive test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40568
[18:18:32] <gerrit-wm>	 Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40568
[18:20:35] <cmjohnson1>	 maxsem: i just updated ticket solr1-3 in tampa are all yours...solr1002 in eqiad is still down until next week
[18:20:50] <MaxSem>	 cmjohnson1, thanks
[18:21:03] <cmjohnson1>	 yw
[18:22:04] <MaxSem>	 cmjohnson1, so I wrote https://gerrit.wikimedia.org/r/#/c/39739/ :)
[18:22:59] <cmjohnson1>	 cool
[18:24:16] <cmjohnson1>	 i am going to take it out of decom list now so we know we can use it again
[18:24:46] <MaxSem>	 wouldn't it result in Nagios complaints?
[18:25:22] <cmjohnson1>	 no, nagios stops reporting it about 24 hours after it's in decom
[18:25:53] <cmjohnson1>	 it will stay that way
[18:28:32] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:40:51] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.694 seconds
[18:57:29] <nagios-wm>	 RECOVERY - Varnish HTCP daemon on cp1044 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker
[18:57:56] <nagios-wm>	 RECOVERY - Varnish HTTP mobile-backend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 698 bytes in 0.487 seconds
[18:58:05] <nagios-wm>	 RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Wed Dec 26 18:57:51 UTC 2012
[18:58:23] <nagios-wm>	 RECOVERY - Varnish HTTP mobile-frontend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.054 seconds
[18:58:32] <nagios-wm>	 RECOVERY - NTP on cp1044 is OK: NTP OK: Offset 0.001098632812 secs
[18:58:50] <nagios-wm>	 RECOVERY - SSH on cp1044 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0)
[19:06:20] <anomie>	 Robla wanted me to subscribe to the ops mailing list, and he said I should mention it here when I did so someone can approve the subscription. So, I'm mentioning it. Thanks.
[19:06:57] <gerrit-wm>	 New patchset: MaxSem; "Use FQDN for Solr replication" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40569
[19:07:23] <MaxSem>	 anomie, slowny becoming a half-op? :)
[19:07:41] <anomie>	 MaxSem- For the Eqiad migration
[19:11:50] <gerrit-wm>	 New patchset: MaxSem; "Add PageImages" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40570
[19:14:08] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:23:29] <gerrit-wm>	 New patchset: Andrew Bogott; "Added install_path param for wikidata.pp, files are affected by (https://gerrit.wikimedia.org/r/#/c/35313/2)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36353
[19:24:10] <nagios-wm>	 PROBLEM - SSH on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:24:28] <nagios-wm>	 PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:24:55] <nagios-wm>	 PROBLEM - Varnish HTCP daemon on cp1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:25:19] <gerrit-wm>	 New review: Andrew Bogott; "Crap, I totally missed this when it came in. Sorry for the delay in review... feel free to nag me vi..." [operations/puppet] (production); V: 2 C: 2;  - https://gerrit.wikimedia.org/r/36353
[19:25:19] <gerrit-wm>	 Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36353
[19:25:23] <nagios-wm>	 PROBLEM - Varnish HTTP mobile-backend on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:25:31] <nagios-wm>	 PROBLEM - Varnish HTTP mobile-frontend on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:27:37] <nagios-wm>	 RECOVERY - SSH on cp1044 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0)
[19:28:13] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.063 seconds
[19:28:23] <nagios-wm>	 RECOVERY - Varnish HTCP daemon on cp1044 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker
[19:28:49] <nagios-wm>	 RECOVERY - Varnish HTTP mobile-backend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 696 bytes in 0.053 seconds
[19:28:58] <nagios-wm>	 RECOVERY - Varnish HTTP mobile-frontend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.053 seconds
[19:33:55] <nagios-wm>	 PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours
[19:33:56] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours
[19:33:56] <nagios-wm>	 PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours
[19:33:56] <nagios-wm>	 PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours
[19:33:56] <nagios-wm>	 PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours
[19:33:56] <nagios-wm>	 PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours
[19:49:16] <gerrit-wm>	 New patchset: Asher; "redis replication topology and mc eqiad node defs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40573
[19:57:54] <gerrit-wm>	 New patchset: Jgreen; "added passive check_donations" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40576
[20:00:33] <gerrit-wm>	 Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40576
[20:01:31] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:08:17] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40573
[20:17:34] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.048 seconds
[20:35:28] <gerrit-wm>	 Change merged: awjrichards; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40570
[20:36:28] <nagios-wm>	 PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:40:05] <nagios-wm>	 PROBLEM - Varnish HTTP mobile-frontend on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:40:05] <nagios-wm>	 PROBLEM - Varnish HTCP daemon on cp1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:40:23] <nagios-wm>	 PROBLEM - Varnish HTTP mobile-backend on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:40:40] <nagios-wm>	 PROBLEM - SSH on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:42:23] <nagios-wm>	 RECOVERY - SSH on cp1044 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0)
[20:42:32] <nagios-wm>	 RECOVERY - Varnish HTTP mobile-frontend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds
[20:43:26] <nagios-wm>	 RECOVERY - Varnish HTCP daemon on cp1044 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker
[20:43:53] <nagios-wm>	 RECOVERY - Varnish HTTP mobile-backend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 696 bytes in 0.053 seconds
[20:45:59] <Thehelpfulone>	 apergos, still on RT duty?
[20:46:35] <nagios-wm>	 PROBLEM - Memcached on mc1001 is CRITICAL: Connection refused
[20:46:54] <apergos>	 a good q
[20:46:56] <apergos>	 in thory no
[20:47:00] <apergos>	 theory too
[20:47:08] <apergos>	 in practice I dunno if anyone else is around to take it
[20:47:29] <apergos>	 but also in practice Im going to sleep soon, so I'll check the queue tomorrow
[20:47:42] <Thehelpfulone>	 heh, this is actually just a merge request, can you merge 4202 into 4201?
[20:48:25] <apergos>	 not right now but I can do it tomorrow morning
[20:48:28] <Thehelpfulone>	 I didn't realise RT wasn't quite as clever as OTRS in that replying to an email with the ops-.. email would create a new ticket, instead of putting it in the old one
[20:48:32] <Thehelpfulone>	 sure, thanks
[20:48:38] <apergos>	 ok
[20:48:57] <apergos>	 have a good reast of the day
[20:49:03] <apergos>	 rest
[20:49:41] <Thehelpfulone>	 and you :)
[20:50:38] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:03:05] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.703 seconds
[21:10:17] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db26 is CRITICAL: CRIT replication delay 189 seconds
[21:10:34] <logmsgbot>	 !log awjrichards synchronized wmf-config/CommonSettings.php  'Enabling PageImages on testwiki'
[21:11:03] <logmsgbot>	 !log awjrichards synchronized wmf-config/InitialiseSettings.php  'Enabling PageImages on testwiki'
[21:11:47] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db26 is CRITICAL: CRIT replication delay 231 seconds
[21:14:07] <gerrit-wm>	 New patchset: MaxSem; "zomgfail" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40682
[21:14:26] <gerrit-wm>	 Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40682
[21:17:47] <gerrit-wm>	 New patchset: Asher; "removing depooled dbs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40713
[21:17:56] <nagios-wm>	 PROBLEM - Varnish HTTP mobile-frontend on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:18:05] <nagios-wm>	 PROBLEM - SSH on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:18:07] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40713
[21:19:35] <nagios-wm>	 RECOVERY - Varnish HTTP mobile-frontend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds
[21:19:44] <nagios-wm>	 RECOVERY - SSH on cp1044 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0)
[21:23:39] <logmsgbot>	 !log maxsem synchronized wmf-config/CommonSettings.php  'Fix PageImages inclusion'
[21:25:43] * Jeff_Green  inadvertently downed nagios by restarting it mid-puppet-run, fixing as soon as puppet is done...
[21:26:37] <robla>	 anyone here following this bug?  https://bugzilla.wikimedia.org/show_bug.cgi?id=41130  (purge requests failing for north america)
[21:27:03] <gerrit-wm>	 New patchset: Jgreen; "adding frtech user to nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40740
[21:27:45] <gerrit-wm>	 Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40740
[21:29:43] <Raymond_>	 MaxSem: I fixed a small typo: https://gerrit.wikimedia.org/r/#/c/40741/
[21:33:05] <nagios-wm>	 PROBLEM - Puppet freshness on manganese is CRITICAL: Puppet has not run in the last 10 hours
[21:33:14] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db26 is OK: OK replication delay 0 seconds
[21:33:55] <MaxSem>	 Raymond_, thanks! Is there an analog to {{optional}} that says don't translate?
[21:34:33] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db26 is OK: OK replication delay 0 seconds
[21:34:34] <Raymond_>	 MaxSem: I add the "don't translate" to our twn settings in a minute
[21:34:53] <MaxSem>	 cool, thanks
[21:36:10] <MaxSem>	 merged
[21:37:51] <nagios-wm>	 PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:38:09] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:39:13] <Raymond_>	 done with 40744
[21:40:19] <nagios-wm>	 PROBLEM - Varnish HTTP mobile-backend on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:40:37] <nagios-wm>	 PROBLEM - NTP on cp1044 is CRITICAL: NTP CRITICAL: No response from NTP server
[21:40:37] <nagios-wm>	 PROBLEM - Varnish HTTP mobile-frontend on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:40:50] <Nemo_bis>	 MaxSem: it's documented in https://www.mediawiki.org/wiki/Manual:System_messages#Creating_new_messages by the way, feedback appreciated about where you'd expect the info to be, if it's unclear etc.
[21:41:04] <nagios-wm>	 PROBLEM - SSH on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:41:31] <nagios-wm>	 PROBLEM - Varnish HTCP daemon on cp1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:46:17] <gerrit-wm>	 New patchset: Jgreen; "fixing puppet+nagios+activemq+fundraising config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40747
[21:47:49] <nagios-wm>	 PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - check plugin (check_job_queue) or PHP errors -
[21:49:13] <gerrit-wm>	 New review: Asher; "shm_workspace defaults to 8k, and should be > than shm_reclen. also see:" [operations/puppet] (production); V: 0 C: -1;  - https://gerrit.wikimedia.org/r/40554
[21:49:42] <gerrit-wm>	 New patchset: Jgreen; "fixing puppet+nagios+activemq+fundraising config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40747
[21:50:13] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.070 seconds
[21:50:36] * Jeff_Green  garg.
[21:50:38] <gerrit-wm>	 New patchset: Jgreen; "fixing puppet+nagios+activemq+fundraising config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40747
[21:51:12] <gerrit-wm>	 Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40747
[21:59:46] <logmsgbot>	 !log awjrichards Started syncing Wikimedia installation... : MobileFrontend updates per https://www.mediawiki.org/wiki/Extension:MobileFrontend/Deployments/2012-12-26
[22:10:39] <gerrit-wm>	 New review: Ori.livneh; "@binasher: Works for me. I'll update the patch." [operations/puppet] (production) C: 0;  - https://gerrit.wikimedia.org/r/40554
[22:22:39] <gerrit-wm>	 New patchset: Ori.livneh; "(RT 4094) Increase varnish SHM defaults" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40554
[22:22:46] <ori-l>	 ^ binasher
[22:23:19] <logmsgbot>	 !log awjrichards Finished syncing Wikimedia installation... : MobileFrontend updates per https://www.mediawiki.org/wiki/Extension:MobileFrontend/Deployments/2012-12-26
[22:23:21] <ori-l>	 i also specified the values in units of kilobytes rather than bytes; it seemed easier to sanity-check at a glance.
[22:24:17] <gerrit-wm>	 New review: Anomie; "Notes for the future: We should add a "$wmfActiveDatacenter" value set to whichever data center (pmt..." [operations/mediawiki-config] (master); V: 0 C: 0;  - https://gerrit.wikimedia.org/r/32167
[22:25:53] <logmsgbot>	 !log awjrichards synchronized php-1.21wmf6/extensions/MobileFrontend/  'touch files'
[22:27:48] <nagios-wm>	 PROBLEM - Memcached on mc1002 is CRITICAL: Connection refused
[22:28:15] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:29:34] <gerrit-wm>	 New patchset: MaxSem; "Enable PageImages on test2" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40751
[22:29:36] <nagios-wm>	 RECOVERY - Memcached on mc1001 is OK: TCP OK - 0.027 second response time on port 11211
[22:29:36] <nagios-wm>	 RECOVERY - Memcached on mc1002 is OK: TCP OK - 0.028 second response time on port 11211
[22:30:05] <gerrit-wm>	 Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40751
[22:31:37] <logmsgbot>	 !log maxsem synchronized wmf-config/InitialiseSettings.php  'Enable PageImages on test2'
[22:35:18] <nagios-wm>	 PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours
[22:38:18] <nagios-wm>	 PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours
[22:38:54] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.032 seconds
[22:47:18] <nagios-wm>	 PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours
[22:49:30] <gerrit-wm>	 New patchset: MaxSem; "Enable PageImages everywhere" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40753
[22:50:21] <gerrit-wm>	 Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40753
[22:51:58] <logmsgbot>	 !log maxsem synchronized wmf-config/InitialiseSettings.php  'Enable PageImages everywhere'
[22:59:12] <gerrit-wm>	 New patchset: Jgreen; "check_dummy != nsca-fail" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40754
[23:00:07] <andrewbogott>	 now that the ganglia web interface wants a login and password… do I /have/ a login and password?
[23:03:13] <MaxSem>	 PM'd
[23:12:39] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:17:13] <robla>	 hi TimStarling, looks like binasher may have fixed the problem while you were investigating it
[23:19:18] <jeremyb>	 !log !logging what morebots missed
[23:19:27] <morebots>	 Logged the message, Master
[23:19:49] <jeremyb>	 !log 2012-12-24 13:28:34 <+logmsgbot> !log reedy synchronized live-1.5/
[23:19:52] <jeremyb>	 !log 2012-12-24 14:49:25 < apergos> !log rebooting cp1043 which had fallen over
[23:19:55] <jeremyb>	 !log 2012-12-25 02:23:37 <+logmsgbot> !log LocalisationUpdate completed (1.21wmf6) at Tue Dec 25 02:23:37 UTC 2012
[23:19:58] <morebots>	 Logged the message, Master
[23:19:59] <jeremyb>	 !log 2012-12-25 05:45:08 < Ryan_Lane> !log restarting gerrit. also blocked a bad bot
[23:20:01] <jeremyb>	 !log 2012-12-25 10:58:38 <+logmsgbot> !log maxsem synchronized wmf-config/CommonSettings.php  'Captchas broken on every wiki'
[23:20:08] <morebots>	 Logged the message, Master
[23:20:16] <morebots>	 Logged the message, Master
[23:20:25] <morebots>	 Logged the message, Master
[23:20:33] <morebots>	 Logged the message, Master
[23:20:45] <jeremyb>	 !log 2012-12-25 11:03:46 <+logmsgbot> !log maxsem synchronized wmf-config/CommonSettings.php  'Disable broken Swift captchas on mw.o too'
[23:20:49] <jeremyb>	 !log 2012-12-26 02:05:45 <+logmsgbot> !log LocalisationUpdate failed: git pull of extensions failed
[23:20:51] <jeremyb>	 !log 2012-12-26 21:10:34 <+logmsgbot> !log awjrichards synchronized wmf-config/CommonSettings.php  'Enabling PageImages on testwiki'
[23:20:54] <morebots>	 Logged the message, Master
[23:20:54] <jeremyb>	 !log 2012-12-26 21:11:03 <+logmsgbot> !log awjrichards synchronized wmf-config/InitialiseSettings.php  'Enabling PageImages on testwiki'
[23:20:58] <jeremyb>	 !log 2012-12-26 21:23:38 <+logmsgbot> !log maxsem synchronized wmf-config/CommonSettings.php  'Fix PageImages inclusion'
[23:21:02] <morebots>	 Logged the message, Master
[23:21:11] <morebots>	 Logged the message, Master
[23:21:19] <morebots>	 Logged the message, Master
[23:21:27] <morebots>	 Logged the message, Master
[23:21:29] <jeremyb>	 !log 2012-12-26 21:59:47 <+logmsgbot> !log awjrichards Started syncing Wikimedia installation... : MobileFrontend updates per https://www.mediawiki.org/wiki/Extension:MobileFrontend/Deployments/2012-12-26
[23:21:33] <jeremyb>	 !log 2012-12-26 22:23:19 <+logmsgbot> !log awjrichards Finished syncing Wikimedia installation... : MobileFrontend updates per https://www.mediawiki.org/wiki/Extension:MobileFrontend/Deployments/2012-12-26
[23:21:37] <morebots>	 Logged the message, Master
[23:21:37] <jeremyb>	 !log 2012-12-26 22:25:53 <+logmsgbot> !log awjrichards synchronized php-1.21wmf6/extensions/MobileFrontend/  'touch files'
[23:21:39] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.938 seconds
[23:21:41] <jeremyb>	 !log 2012-12-26 22:31:37 <+logmsgbot> !log maxsem synchronized wmf-config/InitialiseSettings.php  'Enable PageImages on test2'
[23:21:42] <gerrit-wm>	 New patchset: Jgreen; "fix check_command for fundraising tests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40759
[23:21:43] <jeremyb>	 !log 2012-12-26 22:51:58 <+logmsgbot> !log maxsem synchronized wmf-config/InitialiseSettings.php  'Enable PageImages everywhere'
[23:21:45] <morebots>	 Logged the message, Master
[23:21:53] <morebots>	 Logged the message, Master
[23:22:01] <morebots>	 Logged the message, Master
[23:22:09] <morebots>	 Logged the message, Master
[23:22:59] <TimStarling>	 binasher: did you change something to do with HTCP purging?
[23:23:18] <nagios-wm>	 PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours
[23:23:24] <LeslieCarr>	 TimStarling: yes he did , per bug 41130
[23:23:41] <binasher>	 TimStarling: i restarted varnishhtcpd on all eqiad upload varnish hosts, see email on ops@
[23:24:37] <TimStarling>	 were you going to add that to the server admin log?
[23:25:48] <binasher>	 oh, that
[23:26:22] <jeremyb>	 the most recent LocalisationUpdate failed so there's a decent chance tonight's will fail too. idk who may have a chance to look at it. Reedy ?
[23:26:40] * jeremyb  really runs away
[23:26:58] <gerrit-wm>	 Change abandoned: Jgreen; "checkout got mangled. starting over." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40754
[23:27:06] <binasher>	 !log restarted varnishhtcpd on all eqiad upload varnish hosts; looking back fondly on vacation
[23:27:07] <gerrit-wm>	 New patchset: Ori.livneh; "wmgUseEventLogging: default => true" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40761
[23:27:15] <morebots>	 Logged the message, Master
[23:28:07] <gerrit-wm>	 Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40759
[23:30:30] <nagios-wm>	 PROBLEM - SSH on ms1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:30:39] * Reedy  runs l10nupdate manually
[23:30:51] * Susan  smiles at binasher.
[23:30:55] <gerrit-wm>	 New patchset: Asher; "cache 4xx ttl in wikimedia.vcl.erb was being inadvertantly overridden, with all ttl's set to 30 days regardless of response code" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40762
[23:31:36] <MaxSem>	 Reedy, we;ve just scapped
[23:31:39] <gerrit-wm>	 Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40761
[23:31:47] <MaxSem>	 so no difference
[23:31:53] <Reedy>	 Well, yes
[23:32:17] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40762
[23:32:37] <Reedy>	 If the problem is in the lu update directory while updating the clones...
[23:34:40] <LeslieCarr>	 binasher: you're good :)
[23:35:46] <binasher>	 :)
[23:56:24] <Reedy>	 Well, that explains what is up with localisation update then...
[23:56:30] <Reedy>	 mw18: Permission denied (publickey,password).
[23:56:31] <Reedy>	 mw7: Permission denied (publickey,password).
[23:56:31] <Reedy>	 etc etc
[23:56:54] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds