[00:19:42] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:23:26] <TimStarling>	 Thehelpfulone: http://wikitech.wikimedia.org/view/Password_reset
[00:24:33] <Thehelpfulone>	 thanks Tim
[00:27:29] <Damianz>	 oh god that looks nasty
[00:29:19] <Thehelpfulone>	 I guess that In your browser, go to Special:ResetPassword on the user's main wiki.
[00:29:19] <Thehelpfulone>	  requires them to be part of the global sysadmin group?
[00:30:12] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.669 seconds
[00:32:02] <jeremyb>	 Thehelpfulone: errr, isn't that something anyone can do?
[00:32:20] <Thehelpfulone>	 jeremyb, well I just tried it, I can't enter someone else's username apparently
[00:32:50] <TimStarling>	 anyone can do that
[00:33:10] <TimStarling>	 normally anons do it, it wouldn't really make sense to restrict it, would it?
[00:33:27] <Thehelpfulone>	 oh I see, it's Special:PasswordReset
[00:33:45] <Thehelpfulone>	 Special:ResetPassword redirects to Special:ChangePassword
[00:33:57] <Thehelpfulone>	 fixed the doc
[00:34:10] <TimStarling>	 hmm, and I thought I had copy/pasted that
[00:34:26] <TimStarling>	 guess not
[00:34:37] * jeremyb  just figured that out at the same time
[00:34:45] <jeremyb>	 why aren't those synonymous?!?!??
[00:35:29] <TimStarling>	 well, usually you just follow the link
[00:36:31] <Thehelpfulone>	 hmm, If you are certain of your e-mail, but not your username, only enter your e-mail. -> so if you have two accounts with the same email (my account and my bot account), will it send two emails? (I can't test it now because I just did one)
[00:40:19] <jeremyb>	 !bug 34386 | Thehelpfulone
[00:40:19] <wm-bot>	  Thehelpfulone: https://bugzilla.wikimedia.org/34386
[00:40:44] <jeremyb>	 in particular i guess comment 17
[00:41:09] <nagios-wm>	 PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours
[00:41:46] <Thehelpfulone>	 thanks, heh Umherirrender was persistent
[00:42:21] <jeremyb>	 next week, next week, next week!
[00:42:32] <jeremyb>	 Susan: ^ :)
[00:42:58] <Susan>	 Heh.
[00:43:04] <Susan>	 Being persistent is often helpful.
[00:55:06] <nagios-wm>	 PROBLEM - Puppet freshness on solr2 is CRITICAL: Puppet has not run in the last 10 hours
[00:57:03] <nagios-wm>	 PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours
[01:05:45] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:07:06] <nagios-wm>	 PROBLEM - Puppet freshness on solr1003 is CRITICAL: Puppet has not run in the last 10 hours
[01:07:06] <nagios-wm>	 PROBLEM - Puppet freshness on solr3 is CRITICAL: Puppet has not run in the last 10 hours
[01:08:09] <nagios-wm>	 PROBLEM - Puppet freshness on solr1001 is CRITICAL: Puppet has not run in the last 10 hours
[01:13:42] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 294 seconds
[01:15:30] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 11 seconds
[01:18:30] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds
[01:33:03] <nagios-wm>	 PROBLEM - Puppet freshness on tin is CRITICAL: Puppet has not run in the last 10 hours
[01:35:18] <nagios-wm>	 PROBLEM - Host lvs1001 is DOWN: PING CRITICAL - Packet loss = 100%
[01:36:12] <nagios-wm>	 RECOVERY - Host lvs1001 is UP: PING WARNING - Packet loss = 28%, RTA = 41.58 ms
[01:37:15] <nagios-wm>	 PROBLEM - Host wiktionary-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[01:37:24] <nagios-wm>	 PROBLEM - LVS HTTP IPv4 on wikiversity-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:37:24] <nagios-wm>	 PROBLEM - LVS HTTP IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:37:24] <nagios-wm>	 PROBLEM - LVS HTTP IPv4 on wikimedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:37:42] <nagios-wm>	 PROBLEM - LVS HTTPS IPv6 on wiktionary-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:37:51] <nagios-wm>	 PROBLEM - LVS HTTPS IPv4 on wikimedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:37:52] <nagios-wm>	 PROBLEM - LVS HTTP IPv6 on wikimedia-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:37:52] <nagios-wm>	 PROBLEM - LVS HTTP IPv4 on wikibooks-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:37:52] <nagios-wm>	 PROBLEM - LVS HTTPS IPv6 on wikisource-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:37:52] <nagios-wm>	 PROBLEM - LVS HTTP IPv4 on mediawiki-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:37:52] <nagios-wm>	 PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:38:00] <nagios-wm>	 PROBLEM - LVS HTTPS IPv6 on mediawiki-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:38:00] <nagios-wm>	 PROBLEM - LVS HTTPS IPv6 on wikiversity-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:38:09] <nagios-wm>	 PROBLEM - LVS HTTP IPv6 on wikisource-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:38:09] <nagios-wm>	 PROBLEM - LVS HTTPS IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:38:26] <LeslieCarr>	 ok, who unplugged the servers
[01:38:27] <nagios-wm>	 PROBLEM - LVS HTTP IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:38:27] <nagios-wm>	 PROBLEM - LVS HTTP IPv6 on wikibooks-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:38:36] <nagios-wm>	 PROBLEM - LVS HTTPS IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:39:03] <nagios-wm>	 RECOVERY - LVS HTTP IPv4 on wikiversity-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 66649 bytes in 0.145 seconds
[01:39:03] <nagios-wm>	 RECOVERY - LVS HTTP IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 66270 bytes in 0.136 seconds
[01:39:04] <nagios-wm>	 RECOVERY - LVS HTTP IPv4 on wikimedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 93182 bytes in 0.165 seconds
[01:39:22] <nagios-wm>	 RECOVERY - LVS HTTPS IPv6 on wiktionary-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 66270 bytes in 0.196 seconds
[01:39:30] <nagios-wm>	 RECOVERY - LVS HTTP IPv6 on wikimedia-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 92797 bytes in 0.136 seconds
[01:39:30] <nagios-wm>	 RECOVERY - LVS HTTPS IPv4 on wikimedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 92797 bytes in 0.192 seconds
[01:39:31] <nagios-wm>	 RECOVERY - LVS HTTP IPv4 on wikibooks-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 66649 bytes in 0.136 seconds
[01:39:32] <nagios-wm>	 RECOVERY - LVS HTTP IPv4 on mediawiki-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 66651 bytes in 0.134 seconds
[01:39:32] <nagios-wm>	 RECOVERY - LVS HTTPS IPv6 on wikisource-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 66270 bytes in 0.196 seconds
[01:39:32] <nagios-wm>	 RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 92797 bytes in 0.222 seconds
[01:39:39] <nagios-wm>	 RECOVERY - LVS HTTPS IPv6 on mediawiki-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 66270 bytes in 0.193 seconds
[01:39:39] <nagios-wm>	 RECOVERY - LVS HTTPS IPv6 on wikiversity-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 66270 bytes in 0.193 seconds
[01:39:48] <nagios-wm>	 RECOVERY - LVS HTTP IPv6 on wikisource-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 66272 bytes in 0.137 seconds
[01:39:48] <nagios-wm>	 RECOVERY - LVS HTTPS IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 66272 bytes in 0.193 seconds
[01:40:06] <nagios-wm>	 RECOVERY - LVS HTTP IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 66270 bytes in 0.138 seconds
[01:40:06] <nagios-wm>	 RECOVERY - LVS HTTP IPv6 on wikibooks-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 66270 bytes in 0.138 seconds
[01:40:06] <nagios-wm>	 RECOVERY - LVS HTTPS IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 66270 bytes in 0.193 seconds
[01:40:15] <nagios-wm>	 RECOVERY - Host wiktionary-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 27.03 ms
[01:41:58] <woosters>	 lesliecarr - recovered?
[01:42:04] <LeslieCarr>	 i think so
[01:42:39] <Damianz>	 someone needs to feed the hamsters
[01:45:14] <nagios-wm>	 PROBLEM - Host wikimedia-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[01:45:58] <LeslieCarr>	 doh, i fed the hamster poison instead of hamster food
[01:46:04] <LeslieCarr>	 i knew i shouldn't have put them next to each other
[01:46:05] <LeslieCarr>	 ;)
[01:46:25] <nagios-wm>	 RECOVERY - Host wikimedia-lb.pmtpa.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[01:52:06] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:54:30] <nagios-wm>	 RECOVERY - Memcached on mc1011 is OK: TCP OK - 0.027 second response time on port 11211
[01:54:30] <nagios-wm>	 RECOVERY - Memcached on mc1003 is OK: TCP OK - 0.028 second response time on port 11211
[01:54:39] <nagios-wm>	 RECOVERY - Memcached on mc1005 is OK: TCP OK - 0.028 second response time on port 11211
[01:54:48] <nagios-wm>	 RECOVERY - Memcached on mc1004 is OK: TCP OK - 0.030 second response time on port 11211
[01:54:48] <nagios-wm>	 RECOVERY - Memcached on mc1007 is OK: TCP OK - 0.028 second response time on port 11211
[01:54:57] <nagios-wm>	 RECOVERY - Memcached on mc1009 is OK: TCP OK - 0.027 second response time on port 11211
[01:55:06] <nagios-wm>	 RECOVERY - Memcached on mc1015 is OK: TCP OK - 0.027 second response time on port 11211
[01:55:06] <nagios-wm>	 PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours
[01:55:15] <nagios-wm>	 RECOVERY - Memcached on mc1013 is OK: TCP OK - 0.027 second response time on port 11211
[01:55:15] <nagios-wm>	 RECOVERY - Memcached on mc1016 is OK: TCP OK - 0.030 second response time on port 11211
[01:55:33] <nagios-wm>	 RECOVERY - Memcached on mc1014 is OK: TCP OK - 0.027 second response time on port 11211
[01:55:33] <nagios-wm>	 RECOVERY - Memcached on mc1010 is OK: TCP OK - 0.027 second response time on port 11211
[01:55:51] <nagios-wm>	 RECOVERY - Memcached on mc1006 is OK: TCP OK - 0.027 second response time on port 11211
[01:55:51] <nagios-wm>	 RECOVERY - Memcached on mc1008 is OK: TCP OK - 0.027 second response time on port 11211
[01:58:35] <binasher>	 wee, that's enough work for now.  unless the pager tells me otherwise.
[02:01:06] <nagios-wm>	 PROBLEM - Puppet freshness on sockpuppet is CRITICAL: Puppet has not run in the last 10 hours
[02:06:21] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds
[02:19:55] <jeremyb>	 did you !log the poison?
[02:26:12] <logmsgbot>	 !log LocalisationUpdate completed (1.21wmf6) at Mon Dec 31 02:26:11 UTC 2012
[02:26:22] <morebots>	 Logged the message, Master
[02:50:26] <nagios-wm>	 PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours
[02:53:35] <nagios-wm>	 RECOVERY - Puppet freshness on neon is OK: puppet ran at Mon Dec 31 02:53:03 UTC 2012
[02:54:20] <nagios-wm>	 RECOVERY - Puppet freshness on sockpuppet is OK: puppet ran at Mon Dec 31 02:54:07 UTC 2012
[02:59:26] <nagios-wm>	 PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours
[03:06:47] <nagios-wm>	 RECOVERY - Puppet freshness on tin is OK: puppet ran at Mon Dec 31 03:06:37 UTC 2012
[03:36:20] <nagios-wm>	 PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours
[03:46:05] <nagios-wm>	 PROBLEM - Memcached on virt0 is CRITICAL: Connection refused
[03:46:25] <jeremyb>	 Ryan_Lane: virt0 memcache still bouncing ^
[04:02:53] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:06:20] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.582 seconds
[04:07:05] <nagios-wm>	 RECOVERY - Memcached on virt0 is OK: TCP OK - 0.002 second response time on port 11000
[04:22:23] <nagios-wm>	 PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours
[04:40:05] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:54:20] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.522 seconds
[05:02:44] <nagios-wm>	 PROBLEM - Memcached on mc1012 is CRITICAL: Connection refused
[05:08:26] <nagios-wm>	 PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours
[05:08:26] <nagios-wm>	 PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours
[05:29:44] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[05:40:23] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.140 seconds
[06:15:47] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:26:27] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.858 seconds
[06:57:14] <nagios-wm>	 PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours
[07:01:26] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:05:20] <nagios-wm>	 PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours
[07:05:20] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours
[07:05:20] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours
[07:05:20] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours
[07:13:53] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.042 seconds
[07:47:47] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:55:17] <nagios-wm>	 PROBLEM - Puppet freshness on mw55 is CRITICAL: Puppet has not run in the last 10 hours
[08:00:14] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds
[08:28:17] <nagios-wm>	 PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours
[08:28:17] <nagios-wm>	 PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours
[08:33:41] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:39:14] <nagios-wm>	 PROBLEM - Puppet freshness on cp1028 is CRITICAL: Puppet has not run in the last 10 hours
[08:44:20] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.454 seconds
[08:51:22] <nagios-wm>	 PROBLEM - Apache HTTP on mw35 is CRITICAL: Connection refused
[09:17:55] <nagios-wm>	 RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.064 second response time
[09:19:52] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:29:10] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 183 seconds
[09:30:31] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 186 seconds
[09:32:28] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.500 seconds
[09:33:04] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 192 seconds
[09:34:16] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 233 seconds
[09:34:25] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 183 seconds
[09:36:04] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 3 seconds
[09:37:43] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds
[09:38:28] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds
[09:39:22] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds
[09:49:16] <nagios-wm>	 PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours
[09:49:16] <nagios-wm>	 PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours
[09:49:16] <nagios-wm>	 PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours
[09:49:16] <nagios-wm>	 PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours
[09:49:16] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours
[09:49:16] <nagios-wm>	 PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours
[10:05:46] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:18:21] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.764 seconds
[10:48:57] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 188 seconds
[10:48:57] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 189 seconds
[10:53:18] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:56:09] <nagios-wm>	 PROBLEM - Puppet freshness on solr2 is CRITICAL: Puppet has not run in the last 10 hours
[10:58:06] <nagios-wm>	 PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours
[11:03:03] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds
[11:03:12] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds
[11:03:58] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.834 seconds
[11:08:09] <nagios-wm>	 PROBLEM - Puppet freshness on solr3 is CRITICAL: Puppet has not run in the last 10 hours
[11:08:09] <nagios-wm>	 PROBLEM - Puppet freshness on solr1003 is CRITICAL: Puppet has not run in the last 10 hours
[11:09:03] <nagios-wm>	 PROBLEM - Puppet freshness on solr1001 is CRITICAL: Puppet has not run in the last 10 hours
[11:39:38] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:52:05] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.132 seconds
[11:56:53] <nagios-wm>	 PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours
[12:27:11] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:37:50] <nagios-wm>	 PROBLEM - Puppet freshness on sq81 is CRITICAL: Puppet has not run in the last 10 hours
[12:39:38] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.601 seconds
[12:51:17] <nagios-wm>	 PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours
[12:54:16] <nagios-wm>	 PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours
[13:00:16] <nagios-wm>	 PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours
[13:14:22] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:26:49] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds
[13:37:19] <nagios-wm>	 PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours
[13:37:47] <gerrit-wm>	 New patchset: Faidon; "partman: new generation ceph-ssd recipe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41574
[13:38:43] <gerrit-wm>	 New review: Faidon; "Painfully iterated and tested." [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/41574
[13:38:44] <gerrit-wm>	 Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41574
[13:43:55] <nagios-wm>	 RECOVERY - Host ms-be1004 is UP: PING OK - Packet loss = 0%, RTA = 26.63 ms
[13:47:40] <nagios-wm>	 PROBLEM - SSH on ms-be1004 is CRITICAL: Connection refused
[13:52:55] <nagios-wm>	 RECOVERY - SSH on ms-be1004 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0)
[13:58:28] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:09:34] <nagios-wm>	 PROBLEM - NTP on ms-be1004 is CRITICAL: NTP CRITICAL: No response from NTP server
[14:09:52] <nagios-wm>	 PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100%
[14:12:43] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.081 seconds
[14:15:43] <nagios-wm>	 RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms
[14:19:55] <nagios-wm>	 PROBLEM - SSH on ms-be1003 is CRITICAL: Connection refused
[14:23:13] <nagios-wm>	 PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours
[14:23:40] <nagios-wm>	 PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100%
[14:25:19] <nagios-wm>	 RECOVERY - SSH on ms-be1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0)
[14:25:28] <nagios-wm>	 RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms
[14:29:31] <nagios-wm>	 PROBLEM - SSH on ms-be1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:31:10] <nagios-wm>	 RECOVERY - SSH on ms-be1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0)
[14:46:28] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:58:55] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.278 seconds
[15:09:16] <nagios-wm>	 PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours
[15:09:16] <nagios-wm>	 PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours
[15:18:15] <gerrit-wm>	 New patchset: Alex Monk; "(bug 43517) Change testwiki permissions" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/41579
[15:32:31] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:43:10] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.788 seconds
[16:18:34] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:26:10] <gerrit-wm>	 Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/41579
[16:27:06] <logmsgbot>	 !log demon synchronized wmf-config/InitialiseSettings.php  'Deploying I9be1e7ac, testwiki permission changes'
[16:27:18] <morebots>	 Logged the message, Master
[16:31:01] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds
[16:45:10] <nagios-wm>	 PROBLEM - Memcached on virt0 is CRITICAL: Connection refused
[16:52:04] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 189 seconds
[16:52:58] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 181 seconds
[16:54:46] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds
[16:55:31] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds
[16:58:22] <nagios-wm>	 PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours
[17:04:40] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:06:19] <nagios-wm>	 PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours
[17:06:19] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours
[17:06:19] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours
[17:06:19] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours
[17:17:07] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.125 seconds
[17:36:19] <nagios-wm>	 RECOVERY - Memcached on virt0 is OK: TCP OK - 0.002 second response time on port 11000
[17:51:10] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:56:16] <nagios-wm>	 PROBLEM - Puppet freshness on mw55 is CRITICAL: Puppet has not run in the last 10 hours
[18:01:49] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.023 seconds
[18:29:30] <nagios-wm>	 PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours
[18:29:30] <nagios-wm>	 PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours
[18:37:54] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:40:36] <nagios-wm>	 PROBLEM - Puppet freshness on cp1028 is CRITICAL: Puppet has not run in the last 10 hours
[18:48:33] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.744 seconds
[19:11:39] <mwalker>	 mo
[19:24:07] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:34:18] <nagios-wm>	 RECOVERY - Memcached on mc1012 is OK: TCP OK - 0.027 second response time on port 11211
[19:38:21] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.402 seconds
[19:40:42] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41194
[19:50:30] <nagios-wm>	 PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours
[19:50:30] <nagios-wm>	 PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours
[19:50:31] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours
[19:50:31] <nagios-wm>	 PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours
[19:50:31] <nagios-wm>	 PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours
[19:50:31] <nagios-wm>	 PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours
[19:51:40] <gerrit-wm>	 New patchset: Asher; "missing class parameter" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41593
[19:51:58] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41593
[19:57:35] <gerrit-wm>	 New patchset: Asher; "fixing incorrect syntax in merged change Id4362fdc, jenkins failed to -1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41594
[19:58:05] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41594
[20:03:10] <gerrit-wm>	 New patchset: Asher; "dbtree: cache ganglia xml data every minute" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41596
[20:04:34] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41596
[20:11:39] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:12:14] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40554
[20:22:26] <ori-l>	 binasher: :D :D :D :D
[20:22:34] <ori-l>	 thank you!
[20:22:46] <ori-l>	 do you know when it'll go live?
[20:24:15] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.020 seconds
[20:27:41] <binasher>	 ori-l: i was just looking at the puppet manifest, and unfortunately every frontend varnish instance will need a full manual restart.  the service only subscribes to the package version, not the /etc/default file.  i think i'll wait til 1/2 to do that, after puppet has updated the file everywhere
[20:28:00] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 206 seconds
[20:28:09] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 211 seconds
[20:28:30] <ori-l>	 binasher: sounds good to me
[20:29:21] <ori-l>	 if i don't hear from you, is it ok if i check in with you on 1/3?
[20:30:17] <binasher>	 ori-l: please do, i might need the reminder :)
[20:30:44] <ori-l>	 will do. thanks again for your help with this.
[20:32:57] <binasher>	 i did update and restart cp1044 just to make sure it's ok, and it is
[20:34:27] <ori-l>	 sweet
[20:34:54] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds
[20:35:21] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds
[20:57:33] <nagios-wm>	 PROBLEM - Puppet freshness on solr2 is CRITICAL: Puppet has not run in the last 10 hours
[20:57:51] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:58:42] <robla>	 notpeter: were the Lucene restarts on December 13 on behalf of Patrick?  http://wikitech.wikimedia.org/view/Server_admin_log#December_13
[20:59:30] <nagios-wm>	 PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours
[21:00:28] * robla  is trying to figure out when the stuff that Patrick did on Tim's behalf went out
[21:02:33] <robla>	 so....I'm guessing LeslieCarr isn't the on call person today, since she's taking the day off
[21:02:56] <robla>	 the Search boxes might need to be kicked
[21:04:10] <robla>	 https://bugzilla.wikimedia.org/show_bug.cgi?id=42423  Wikimedia wiki search is broken (outputting inconsistent results)
[21:05:08] <robla>	 also,  http://wikitech.wikimedia.org/view/Server_admin_log from yesterday: "09:05 Nemo_bis: Search reported broken with no results at all returned on en.wikt, (en|ru).source etc. "Lucene on search14 is CRITICAL" since 3h ago."
[21:06:19] <Nemo_bis>	 btw MaxSem noted that search14 is the index for en.wiki so the two things are supposedly unrelated
[21:06:24] <Nemo_bis>	 iirc
[21:09:33] <nagios-wm>	 PROBLEM - Puppet freshness on solr3 is CRITICAL: Puppet has not run in the last 10 hours
[21:09:33] <nagios-wm>	 PROBLEM - Puppet freshness on solr1003 is CRITICAL: Puppet has not run in the last 10 hours
[21:10:18] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.021 seconds
[21:10:36] <nagios-wm>	 PROBLEM - Puppet freshness on solr1001 is CRITICAL: Puppet has not run in the last 10 hours
[21:10:52] <binasher>	 search14 is fine now
[21:11:07] <binasher>	 and commons fulltext search is returning results quickly
[21:11:27] <binasher>	 it probably sucks during index sync
[21:15:59] <robla>	 thanks binasher.  I'm having a hard time reproducing any problems, actually.  I was responding to a ping from andre__ on the subject
[21:16:57] <andre__>	 yeah, we had the problem of search not working reliably mentioned a few weeks ago in bug 42423, and it came up again yesterday
[21:17:23] <binasher>	 robla: that bugzilla ticket should probably be closed unless it's worth have a ticket to report new search problems as they happen.  search was broken over the thanksgiving holiday, which was when mzmcbride opened it
[21:17:38] <Susan>	 Which bug?
[21:17:49] <binasher>	 https://bugzilla.wikimedia.org/show_bug.cgi?id=42423
[21:18:00] <andre__>	 sumanah could reproduce it yesterday and somebody could reproduce it two hours ago
[21:18:12] <andre__>	 though recent issues were about search on mediawiki.org, old issue was about commons
[21:18:44] <binasher>	 comment 17 might be worth acting on from the mediawiki side
[21:18:45] <robla>	 wfm
[21:18:46] <Susan>	 Just out of curiosity, do search results go in the HTML Squid cache? Does anyone know?
[21:19:02] <Susan>	 They probably wouldn't, right...
[21:19:32] <Susan>	 I've become more wary of getting up-to-date data from en.wikipedia.org, heh.
[21:19:41] <robla>	 binasher: interesting point.  someone want to file comment 17 as a separate bug?
[21:19:42] <andre__>	 binasher: hence not sure why to close the bugzilla ticket if it could still be reproduced a few hours ago, but maybe I miseed something in the backlog here
[21:20:13] <binasher>	 Susan: Special:Search does not get squid cached
[21:20:23] <Susan>	 Okay, right.
[21:20:25] <Susan>	 I think I knew that.
[21:20:31] <^demon>	 Special pages aren't squid cached generally, iirc.
[21:20:38] <Susan>	 Most Special pages are booted from the cache, except for the ones that use their own cache system.
[21:20:44] <binasher>	 it's possible that api.php?opensearch queries do get api cached
[21:20:45] <Susan>	 querycache
[21:21:08] <robla>	 andre__: if search is broken *right now*, then that's a relatively urgent thing that might mean someone needs to kick a box somewhere
[21:21:09] <Susan>	 I'm not able to reproduce that bug any longer. Though the larger issue seemed to be search support.
[21:21:10] <binasher>	 but that's mostly just relevant to type-ahead suggestions which are based only on page title match
[21:21:12] <Susan>	 Has a Lucene person been found?
[21:21:42] <Susan>	 I saw an RFP or something.
[21:21:48] <robla>	 andre__: if, on the other hand, it's general malaise about search, that's something we have two job postings for, as well as something Tim is more active on now
[21:21:48] <binasher>	 we've interviewed a few candidates but not yet
[21:21:52] <Susan>	 Lame.
[21:22:01] <Susan>	 Anyway, that seems like the real resolution to that bug. ;-)
[21:22:13] <binasher>	 definitely
[21:22:15] <andre__>	 ah, nice
[21:22:27] <Susan>	 rainman-sr served us well. But all good things pass.
[21:23:08] <robla>	 so....I guess I'll file comment 17 as a bug
[21:23:50] <binasher>	 robla: i think that should go against the MWSearch extension
[21:24:59] <Susan>	 You can probably mark it "easy".
[21:25:33] <Susan>	 Though I'm not sure how easy it actually is to distinguish 0 results due to 0 results or 0 rseults due to a broken search box.
[21:27:00] <Susan>	 results *
[21:27:47] <binasher>	 the should either be a 10s timeout on the lucene query getting hit, or a non-ok return code
[21:28:15] <robla>	 Susan: no...."easy" is for stuff that is easy for someone new to the codebase
[21:28:30] <Susan>	 robla: I'm familiar with the keyword.
[21:28:49] <Susan>	 I guess anything search-related isn't easy these days, though.
[21:28:56] <robla>	 Susan: then why do you keep marking/suggesting non-easy bugs as "easy"?  :-P
[21:29:05] <Susan>	 Heh.
[21:29:18] <Susan>	 Someone suggested I was doing it as a taunt.
[21:29:27] <binasher>	 robla: i think everyone actively taking eng tickets is new to MWSearch :)
[21:29:39] <Susan>	 That it would agitate a developer to fixing the bug sooner, because it was marked easy.
[21:29:57] <Susan>	 Though really I do it because I think the bug is easy, and bugs marked as easy seem to get attention sooner.
[21:30:13] <binasher>	 but this should be an easy task for getting somewhat acquainted to it
[21:33:23] <robla>	 binasher: do you know if it's possible to tell outside of Lucene whether zero results is due to Lucene indexing failure or due to there actually being zero results?
[21:43:54] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:49:22] <gerrit-wm>	 New review: Ori.livneh; "Because the current EventLogging architecture makes it easy to subscribe to the event stream and stu..." [operations/puppet] (production) C: 0;  - https://gerrit.wikimedia.org/r/41206
[21:50:15] <ori-l>	 could i bug someone to merge? https://gerrit.wikimedia.org/r/#/c/41206/ & https://gerrit.wikimedia.org/r/#/c/41204/ (both are one- or two-line changes)
[21:56:21] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.522 seconds
[21:58:16] <binasher>	 robla: a lot of the erroneous 0 result responses are coming from query timeouts, that case should be straight forward to catch
[21:58:36] <nagios-wm>	 PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours
[21:58:52] <robla>	 yup, makes sense
[22:00:04] <robla>	 I've asked Tim to look at it first.  https://bugzilla.wikimedia.org/show_bug.cgi?id=43544
[22:00:39] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41204
[22:01:15] <robla>	 it may also be helpful to have some application-level monitoring if we don't have any yet (e.g. actually query Lucene for something that should return >x results, where x is a reasonably large number)
[22:02:11] <gerrit-wm>	 New review: Asher; "http://mcfunley.com/why-mongodb-never-worked-out-at-etsy" [operations/puppet] (production); V: 2 C: 2;  - https://gerrit.wikimedia.org/r/41206
[22:02:13] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41206
[22:02:42] <ori-l>	 binasher: thanks :)
[22:03:00] <ori-l>	 i forwarded that blog post to the e3 list!
[22:03:18] <binasher>	 oh! hehe
[22:03:22] <ori-l>	 because people had been bugging me about various forms NewHotnessDB
[22:03:44] <ori-l>	 i also enjoyed http://lucumr.pocoo.org/2012/12/29/sql-is-agile/
[22:04:04] <ori-l>	 from mitsuhiko / armin ronacher (the guy who wrote flask, jinja2, etc.)
[22:06:06] <AaronSchulz>	 binasher: lots of 'Error connecting to 10.0.6.73: User 'wikiadmin' has exceeded the 'max_user_connections' resource (current value: 80)'
[22:06:16] <AaronSchulz>	 there are 16 boxes with 12 processes each right?
[22:06:29] <binasher>	 yep
[22:06:44] <AaronSchulz>	 so that limit is kind of low
[22:06:52] <binasher>	 yep
[22:07:23] <binasher>	 i don't want 16*20 job runners working on enwiki at once
[22:08:12] <binasher>	 but a way to manage max concurrent jobs per wiki that doesn't involve db limits would be welcome
[22:08:48] <binasher>	 * 16*12
[22:20:29] <binasher>	 ori-l: i think projects either map best to a column oriented db (aka analytics putting structured log data in cassandra if they deploy it) or they should use mysql, if for no other reasons than what's argued in those two blog posts.  for the first case, we should probably select and standardize on one db
[22:22:34] <ori-l>	 binasher: yes, i'm strongly committed to mysql; the change set is mostly to satisfy my personal curiosity
[22:23:19] <gerrit-wm>	 New patchset: Dereckson; "(bug 40879) Enable Collection on ba.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/41809
[22:23:43] <ori-l>	 there's a slight impedance mismatch when converting from JSON event data to SQL, which is really a variant of the standard object-relational impedence mismatch
[22:23:52] <ori-l>	 i.e., how do you model nested objects, things like that
[22:24:24] <ori-l>	 but because we use JSON schema to specify constraints on the structure of events, it's been possible for us to identify a subset that can translates cleanly
[22:28:09] <ori-l>	 and with that in place, it's kind of a no-brainer: putting the data in mysql means being able to easily join it with production data, not having to enforce constraints by hand, plus the ability for analysts to use their SQL knowledge for working with the data
[22:28:57] <binasher>	 at some point, there will probably be a mongodb vs. cassandra throwdown
[22:29:28] <ori-l>	 can i watch?
[22:31:22] <ori-l>	 i think people often don't realize that a lot of NoSQL simply means offloading work from computers onto people (relational query planner -> writing mapreduce jobs, guaranteed constraints vs. having to write a ton of error-handling code to deal w/inconsistencies, etc.)
[22:31:25] <ori-l>	 anyways, </rant>
[22:31:27] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:33:06] <AaronSchulz>	 it may be a rant but it's true
[22:35:23] <ori-l>	 AaronSchulz: i unfortunately realized all this after spending two months writing all data into redis and then showing it to analysts who were a) uniformly very impressed, and b) never went on to do anything with it
[22:36:01] <ori-l>	 i looked around for a document store with a strongly relational model but it seems like the current crop is strongly focussed on making CAP tradeoffs to scale
[22:38:45] <binasher>	 it's a rant that more people need to hear :)
[22:39:33] <nagios-wm>	 PROBLEM - Puppet freshness on sq81 is CRITICAL: Puppet has not run in the last 10 hours
[22:39:52] <ori-l>	 binasher: http://browsertoolkit.com/fault-tolerance.png
[22:40:36] <binasher>	 love that!
[22:42:15] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.602 seconds
[22:47:57] <Susan>	 <binasher> i don't want 16*20 job runners working on enwiki at once # I'm not sure I understand why. That's too many?
[22:52:36] <nagios-wm>	 PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours
[22:55:14] <binasher>	 Susan: some job types are brokenly resource intense and having a ton of them running at once was impacting the site.  i.e. jobs calling BacklinkCache::getLinks running select with order by on 10+ mil row sets with no limit.  the limit on concurrent job execution is a symptom of jobs behaving badly.
[22:55:36] <nagios-wm>	 PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours
[22:55:56] <Susan>	 Hmm, right.
[22:56:24] <Susan>	 Is there better job queue monitoring these days?
[22:56:35] <Susan>	 I remember the biggest issue used to be even figuring out what the oldest job was.
[22:56:48] <Susan>	 Or even getting an accurate count of the jobs in the queue...
[22:57:08] <gerrit-wm>	 New patchset: Dereckson; "(bug 43532) Enable PostEdit on ml.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/41812
[22:57:31] <RD>	 7.
[22:59:02] <Susan>	 https://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=statistics&format=jsonfm
[22:59:22] <Susan>	 Seems it's around 200K right now.
[22:59:39] <RD>	 I was close.
[23:00:25] <binasher>	 it's better than it was, but not good enough to point out jobs behaving badly
[23:01:31] <Susan>	 https://bugzilla.wikimedia.org/show_bug.cgi?id=9518
[23:01:36] <nagios-wm>	 PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours
[23:01:39] <binasher>	 the need to limit concurrency is unfortunate, especially since the change to use refreshLinks instead of just refreshLinks2 greatly increased the number of jobs that get created
[23:01:49] <Susan>	 "Woefully" is a Rob Church word to use, heh.
[23:02:57] <binasher>	 we don't use the data behind Special:Statistics internally
[23:03:11] <Susan>	 It was removed from Special:Statistics.
[23:03:16] <Susan>	 What's used internally?
[23:03:43] <binasher>	 select count(*) from job :)
[23:03:48] <Susan>	 Heh.
[23:06:03] <binasher>	 siteinfo statistics are from innodb table stats which are just rough inaccurate estimates, pulled from a random slave
[23:06:12] <Susan>	 Right.
[23:07:11] <Susan>	 So I guess there's a job_id column that be used to approximately job age. Still no timestamp field, though?
[23:07:20] <Susan>	 Hm.
[23:07:43] <binasher>	 the json should probably include "this is all a wild guess": 1
[23:07:48] <Susan>	 Susan: That sentence makes no sense.
[23:07:56] <Susan>	 Though I did include most of the needed words.
[23:08:04] <Susan>	 Heh.
[23:08:29] <Susan>	 I'm not sure whether the value still makes sense in the API.
[23:08:39] <Susan>	 The use-cases are still unclear. Much as they were in the GUI.
[23:09:29] <binasher>	 mysql:wikiadmin@db63 [enwiki]> select min(job_timestamp) from job where job_attempts < 3;
[23:09:30] <binasher>	 +--------------------+
[23:09:31] <binasher>	 | min(job_timestamp) |
[23:09:34] <binasher>	 +--------------------+
[23:09:34] <binasher>	 | 20121213101221     |
[23:09:35] <binasher>	 +--------------------+
[23:09:36] <binasher>	 1 row in set (0.12 sec)
[23:09:43] <Susan>	 Hmmm.
[23:10:04] <Susan>	 https://www.mediawiki.org/wiki/Manual:Job_table So it is.
[23:10:08] <Susan>	 I should learn to read.
[23:10:30] <binasher>	 the very old stuff currently in enwiki.job are webVideoTranscode jobs
[23:10:39] <binasher>	 AaronSchulz: is webVideoTranscode broken?
[23:11:14] <Susan>	 20121213 is much less happy than 20121231.
[23:12:00] <AaronSchulz>	 binasher: what does job_attempts say?
[23:13:13] <binasher>	 AaronSchulz: 1
[23:13:19] <binasher>	 mysql:wikiadmin@db63 [enwiki]> select min(job_timestamp) from job where job_attempts = 0 and job_cmd != "webVideoTranscode";
[23:13:19] <binasher>	 +--------------------+
[23:13:21] <binasher>	 | min(job_timestamp) |
[23:13:22] <binasher>	 +--------------------+
[23:13:23] <binasher>	 | 20121231221020     |
[23:13:24] <binasher>	 +--------------------+
[23:13:46] <binasher>	 the oldest job that hasn't been attempted yet is about an hour old
[23:14:33] <gerrit-wm>	 New patchset: Dereckson; "(bug 42721) tr.wikisource has been renamed into Vikikaynak" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/41813
[23:14:35] <Susan>	 I hadn't realized job_timestamp and job_attempts had been added. So there's good progress being made. That's nice. :-)
[23:15:26] <binasher>	 AaronSchulz: is JobQueueDB::claimRandom generally grabbing the last or the first now, or is it fairly random? i forget where we left that
[23:15:42] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:17:56] <Susan>	 https://bugzilla.wikimedia.org/show_bug.cgi?id=42614 ?
[23:18:31] <AaronSchulz>	 binasher: it's last and first
[23:19:52] <AaronSchulz>	 https://gerrit.wikimedia.org/r/#/c/38466/4 is still pending, though Tim and I had doubts about that and the current state
[23:20:19] <binasher>	 AaronSchulz: i liked your idea of just pulling the first with a random offset.. if number of jobs mapped to a memcached counter, it could be ok
[23:22:05] <binasher>	 AaronSchulz: i disagree with tim's last comment on the change
[23:22:59] <AaronSchulz>	 yeah, it doesn't need an order, mysql will pick an index
[23:23:05] <AaronSchulz>	 in this case the cmd_token_id one
[23:23:30] <AaronSchulz>	 though I wonder what other rdbms will do, since it is not well defined in the strict sense
[23:29:57] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds
[23:38:30] <nagios-wm>	 PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours
[23:44:17] <Guest46146>	 robla: yep, those were to push out a new version of lucene
[23:44:31] <Guest46146>	 robla: well, the restarts were because they were acting funky
[23:45:41] * robla  smiles at the thought of a search engine acting funky  :-)
[23:45:57] <notpeter>	 they were in dire need of a shower, I tell you
[23:46:28] <robla>	 24 results, fo' sho'
[23:46:34] <notpeter>	 hahaha
[23:48:36] <robla>	 thanks for the update.  I think, after asking you about that, we established that things are (as of this instant) in ok shape, but something we could use a little more monitoring of