[00:49:00] <wikibugs>	 6operations, 10MediaWiki-Special-pages: Query pages that use categorylinks tables have not been updated since nov 1 - https://phabricator.wikimedia.org/T119276#1823390 (10zhuyifei1999)
[00:49:02] <wikibugs>	 6operations, 10Wikimedia-General-or-Unknown, 7user-notice: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1823391 (10zhuyifei1999)
[01:11:39] <icinga-wm>	 PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: puppet fail
[01:38:08] <icinga-wm>	 RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[01:56:59] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1042 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[02:02:49] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1042 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[02:16:07] <logmsgbot>	 !log l10nupdate@tin Synchronized php-1.27.0-wmf.7/cache/l10n: l10nupdate for 1.27.0-wmf.7 (duration: 05m 22s)
[02:16:14] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:16:23] <refffg>	 hi
[02:47:08] <icinga-wm>	 PROBLEM - puppet last run on mw2064 is CRITICAL: CRITICAL: puppet fail
[03:05:40] <icinga-wm>	 PROBLEM - puppet last run on cp2009 is CRITICAL: CRITICAL: puppet fail
[03:15:19] <icinga-wm>	 RECOVERY - puppet last run on mw2064 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[03:32:08] <icinga-wm>	 RECOVERY - puppet last run on cp2009 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[03:35:28] <icinga-wm>	 PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Puppet has 1 failures
[03:37:08] <icinga-wm>	 PROBLEM - puppet last run on mw1248 is CRITICAL: CRITICAL: Puppet has 1 failures
[03:46:58] <liangent>	 @seen jynus
[03:46:58] <wm-bot>	 liangent: Last time I saw jynus they were quitting the network with reason: Quit: Leaving N/A at 11/20/2015 10:29:14 AM (1d17h17m43s ago)
[04:01:48] <icinga-wm>	 RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[04:03:28] <icinga-wm>	 RECOVERY - puppet last run on mw1248 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[04:06:59] <icinga-wm>	 PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 12.00% of data above the critical threshold [100000000.0]
[04:31:59] <grrrit-wm>	 (03CR) 10Dereckson: Namespace configuration for ur.wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254665 (https://phabricator.wikimedia.org/T119308) (owner: 10Dereckson)
[04:42:49] <icinga-wm>	 RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0]
[05:10:20] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1045 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[05:28:59] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1045 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[06:12:18] <icinga-wm>	 PROBLEM - MySQL Processlist on db1033 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 0 copy to table, 456 statistics
[06:14:08] <icinga-wm>	 RECOVERY - MySQL Processlist on db1033 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 0 statistics
[06:30:49] <icinga-wm>	 PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:31:29] <icinga-wm>	 PROBLEM - puppet last run on sca1001 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:39] <icinga-wm>	 PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:58] <icinga-wm>	 PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:58] <icinga-wm>	 PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 3 failures
[06:32:28] <icinga-wm>	 PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:33:30] <icinga-wm>	 PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:33:58] <icinga-wm>	 PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:34:29] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds
[06:55:28] <icinga-wm>	 RECOVERY - puppet last run on db1056 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[06:56:09] <icinga-wm>	 RECOVERY - puppet last run on sca1001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[06:56:19] <icinga-wm>	 RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[06:56:38] <icinga-wm>	 RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[06:56:59] <icinga-wm>	 RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[06:58:08] <icinga-wm>	 RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:29] <icinga-wm>	 RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:29] <icinga-wm>	 RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:01:09] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds
[07:06:49] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds
[07:10:38] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 5 below the confidence bounds
[08:05:08] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed
[08:18:09] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active
[08:31:40] <icinga-wm>	 PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: puppet fail
[08:47:29] <wllm>	 Hey, are you people aware of the bad behavior in #wikipedia and #wikipedia-en?
[08:48:19] <wllm>	 The ops there just seem to be too in to the whole petty tyrannt thing.
[08:48:43] <wllm>	 Now, I know it’s a *bit* off topic here. . .
[08:48:54] <p858snake>	 #wikimedia-ops is where you want
[08:48:55] <wllm>	 But why shoud we put up with it?
[08:48:59] <wllm>	 Naw.
[08:49:00] <p858snake>	 completely off topic for here
[08:49:13] <wllm>	 They don’t want to hear it there.
[08:49:22] <wllm>	 So I’ll say it here.
[08:49:48] <wllm>	 Y’all want IRC to be effective for all Wikipedians, right?
[08:49:53] <wllm>	 So do I.
[08:49:54] <p858snake>	 you can always email otrs, but it is completely off topic for this channel
[08:50:01] <wllm>	 OK.
[08:50:03] <wllm>	 Ban me?
[08:50:36] <wllm>	 Cause that’s what the rest of them do.
[08:50:43] <p858snake>	 why would we ban you from this channel, you havn't acted inappropriately in this channel yet
[08:50:51] <wllm>	 Great.
[08:51:00] <wllm>	 Good to hear it.
[08:51:48] <wllm>	 I’m really sorry for crashing the party here. . .
[08:52:15] <wllm>	 But no other channel wants to discuss the bad behavior of ops on Wikipedia channels. . .
[08:52:29] <wllm>	 And I think it’s a very important topic.
[08:53:06] <wllm>	 So, why in the *world* do we put up with it?
[08:53:46] <p858snake>	 As i've pointed out, this is the wrong channel and completely off topic in here, email OTRS or the mailing lists 
[08:53:53] <wllm>	 Sure.
[08:54:08] <wllm>	 So, p858snake, what should be done about me?
[08:54:27] <p858snake>	 email OTRS or mailing lists.
[08:54:35] <wllm>	 Cause I’m not going to stop talking about this.
[08:54:41] <wllm>	 Nope.
[08:54:53] <wllm>	 I’ll talk about it here.
[08:55:15] <p858snake>	 continuing completely off topic discussions in this channel can result in being banned from this channel
[08:55:21] <wllm>	 Aha.
[08:55:36] <wllm>	 Go ahead, then. . .
[08:55:51] <wllm>	 Ban your problems away.
[08:56:00] <wllm>	 Everyone else does.
[08:56:09] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds
[08:57:22] <wllm>	 See, thing is, people don’t seem to like hearing criticism of Wikipedia or the WMF. . . 
[08:57:57] <wllm>	 Like how we stopped measuring the gender gap after all the criticism following the 2011 survey.
[08:57:59] <icinga-wm>	 RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:58:07] <wllm>	 Coincidence?
[08:58:54] <wllm>	 Hmmm. . . I could easily get distracted by this techie stuff.
[08:59:41] <wllm>	 Didn’t know puppet was used on the WMF servers.
[08:59:56] <wllm>	 Anyways, back to the important stuff. . .
[09:00:39] <wllm>	 Wikipedia IRC channels are a bad joke.
[09:01:17] <wllm>	 Hey, what up, Deez_Nuts?!?
[09:01:22] <wllm>	 Just kidding.
[09:01:38] <wllm>	 I don’t know anything about Deez_Nuts.
[09:01:59] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds
[09:02:30] <wllm>	 But Jamesofur I do know.
[09:03:57] <wllm>	 Ah. . . JD|cloud! What’s up? Another petty tyrannt. . . 
[09:04:29] <wllm>	 Nemo_bis: Long time.
[09:05:23] <wllm>	 SMalyshev. Yeah, y’all need to sit down and listen to this person.
[09:05:46] <wllm>	 Stas knows what he’s talking about.
[09:06:49] <wllm>	 He also has the courage to use his words.
[09:08:16] <wllm>	 The rest of you I don’t really know. You’ve neither tried to silence me or reach out to me on-wiki.
[09:12:11] <Nemo_bis>	 wllm: you must be confused, this channel has nothing to do with all of that.
[09:12:16] <Nemo_bis>	 Please move to #wikimedia-ops
[09:12:23] <wllm>	 Nope.
[09:12:26] <wllm>	 Banned there.
[09:12:33] <Nemo_bis>	 Then just go away. :)
[09:12:34] <wllm>	 For criticizing ops.
[09:12:37] <wllm>	 No.
[09:12:42] <Nemo_bis>	 We also have #wikimedia-overflow
[09:12:52] <wllm>	 Nope. Banned there.
[09:13:05] <wllm>	 I guess what I have to say just isn’t popular.
[09:13:21] <Nemo_bis>	 You can make your own channel
[09:13:26] <wllm>	 Nope.
[09:13:42] <Nemo_bis>	 Well anyway, /ignore wfm
[09:13:47] <wllm>	 I’d rather make my point to more Wikipedians.
[09:14:40] <wllm>	 We’ve talked before, haven’t we, Nemo_bis 
[09:14:42] <wllm>	 ?
[09:17:07] <wllm>	 The problem is. . .
[09:17:31] <wllm>	 When some of us put up with petty tyrannts, the rest of us have to deal with them.
[09:17:47] <wllm>	 And frankly, I’m tired of dealing with them.
[09:19:16] <wllm>	 Wikipedia IRC channels have to change.
[09:19:26] <wllm>	 Or they need to be cut off.
[09:20:42] <wllm>	 This is great. . .
[09:20:58] <wllm>	 I’m banned from channels I’ve never even joined!!!
[09:21:29] <wllm>	 Wow. This is just done, tho, isn’t it. . .
[09:22:07] <wllm>	 I’ll just keep coming back, insisting that these channels reflect the values of the Wikipedia community. . . 
[09:22:29] <wllm>	 And, frankly, there’s not a damn thing anyone can do about it.
[09:23:08] <wllm>	 And here we go. . .
[09:23:48] <wllm>	 Y’all know that I can get in to any public IRC channel, right?
[09:24:32] <wllm>	 I won’t stop embarrassing people who are behaving badly.
[09:39:40] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[09:50:08] <icinga-wm>	 PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 630
[09:55:08] <icinga-wm>	 RECOVERY - check_mysql on db1008 is OK: Uptime: 747042 Threads: 2 Questions: 12712773 Slow queries: 5013 Opens: 22815 Flush tables: 2 Open tables: 64 Queries per second avg: 17.017 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[09:55:55] <Nemo_bis>	 relrod: 
[09:56:02] <Nemo_bis>	 nevermind
[09:57:30] <YuviPanda>	 (db1008 is doing ok, nbd)
[10:08:33] <grrrit-wm>	 (03PS1) 10Nemo bis: [Planet Wikimedia] Add MisterSanderson to Portuguese planet [puppet] - 10https://gerrit.wikimedia.org/r/254676 
[10:09:19] <grrrit-wm>	 (03CR) 10Nemo bis: [C: 031] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/254653 (owner: 10Addshore)
[11:20:29] <icinga-wm>	 PROBLEM - puppet last run on db2065 is CRITICAL: CRITICAL: puppet fail
[11:30:29] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0]
[11:39:58] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[11:46:58] <icinga-wm>	 RECOVERY - puppet last run on db2065 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[11:56:38] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed
[12:17:21] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active
[12:32:39] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed
[12:33:29] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds
[12:37:18] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 6 below the confidence bounds
[12:46:39] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds
[12:47:40] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active
[12:52:20] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds
[12:56:09] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds
[13:05:49] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[13:19:02] <wikibugs>	 7Puppet, 6Labs, 7Documentation: doc.wikimedia.org puppet documentation for labs/lvm/srv.html gives a 404 - https://phabricator.wikimedia.org/T119329#1823612 (10Reedy)
[13:19:16] <wikibugs>	 7Puppet, 6Labs, 7Documentation: doc.wikimedia.org puppet documentation for labs/lvm/srv.html gives a 404 - https://phabricator.wikimedia.org/T119329#1823613 (10Addshore) I guess this should instead be linked to https://doc.wikimedia.org/puppet/classes/role/labs/lvm/srv.html from wikitech
[13:22:11] <wikibugs>	 7Puppet, 6Labs, 10MediaWiki-extensions-OpenStackManager, 7Documentation: doc.wikimedia.org puppet documentation for labs/lvm/srv.html gives a 404 - https://phabricator.wikimedia.org/T119329#1823624 (10Reedy)
[14:03:49] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed
[14:18:49] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active
[14:45:58] <icinga-wm>	 RECOVERY - swift eqiad-prod object availability on graphite1001 is OK: OK: Less than 1.00% under the threshold [95.0]
[15:15:50] <icinga-wm>	 PROBLEM - HHVM rendering on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:17:10] <icinga-wm>	 PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:17:40] <icinga-wm>	 PROBLEM - puppet last run on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:17:58] <icinga-wm>	 PROBLEM - SSH on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:18:59] <icinga-wm>	 PROBLEM - Disk space on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:19:19] <icinga-wm>	 PROBLEM - RAID on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:23:09] <icinga-wm>	 PROBLEM - configured eth on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:23:20] <icinga-wm>	 PROBLEM - DPKG on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:23:28] <icinga-wm>	 PROBLEM - salt-minion processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:23:28] <icinga-wm>	 PROBLEM - nutcracker process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:23:48] <icinga-wm>	 PROBLEM - HHVM processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:24:49] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:24:50] <icinga-wm>	 PROBLEM - nutcracker port on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:24:58] <icinga-wm>	 PROBLEM - dhclient process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:42:30] <icinga-wm>	 RECOVERY - HHVM processes on mw1140 is OK: PROCS OK: 6 processes with command name hhvm
[15:43:30] <icinga-wm>	 RECOVERY - Disk space on mw1140 is OK: DISK OK
[15:43:48] <icinga-wm>	 RECOVERY - dhclient process on mw1140 is OK: PROCS OK: 0 processes with command name dhclient
[15:43:48] <icinga-wm>	 RECOVERY - nutcracker port on mw1140 is OK: TCP OK - 0.000 second response time on port 11212
[15:43:49] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1140 is OK: OK: nf_conntrack is 0 % full
[15:43:58] <icinga-wm>	 RECOVERY - configured eth on mw1140 is OK: OK - interfaces up
[15:44:08] <icinga-wm>	 RECOVERY - DPKG on mw1140 is OK: All packages OK
[15:44:10] <icinga-wm>	 RECOVERY - salt-minion processes on mw1140 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[15:44:11] <icinga-wm>	 RECOVERY - nutcracker process on mw1140 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[15:44:18] <icinga-wm>	 RECOVERY - SSH on mw1140 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0)
[15:45:39] <icinga-wm>	 RECOVERY - RAID on mw1140 is OK: OK: no RAID installed
[15:45:58] <icinga-wm>	 RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures
[15:47:19] <icinga-wm>	 RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.154 second response time
[15:47:48] <icinga-wm>	 RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 67249 bytes in 1.006 second response time
[15:48:48] <icinga-wm>	 PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: puppet fail
[15:58:10] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds
[15:58:50] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed
[16:05:22] <Steinsplitter>	 Reedy: can you delete https://phabricator.wikimedia.org/T119301 ?
[16:05:42] <Reedy>	 We'd need to find out why swift is erroring
[16:06:14] <Steinsplitter>	 ok
[16:07:03] <Reedy>	 let me have a look if I can see it
[16:09:01] <Reedy>	 I can't see any mentions of that file in the swift logs
[16:11:05] <Reedy>	 Steinsplitter: Are you a commons admin?
[16:11:13] <Steinsplitter>	 yes
[16:11:57] <Reedy>	 Steinsplitter: Can you try to delete it now, gonna watch the logs
[16:12:34] <Steinsplitter>	 Error deleting file: Could not delete file "mwstore://local-swift-eqiad/local-public/3/3e/JGY5201.jpg".
[16:13:01] <Reedy>	 Curious, nothing in the logs on fluorine
[16:14:04] <Reedy>	 2015-11-22 16:12:28 mw1107 commonswiki FileOperation INFO: MoveFileOp failed (batch #755apujbg5adwyasa6fkdr4alcjg2cj): {"src":"mwstore://local-swift-eqiad/local-public/3/3e/JGY5201.jpg","dst":"mwstore://local-swift-eqiad/local-deleted/9/h/8/9h8ilwnt0zdaql9q6x2lkjzv77m4ack.jpg","overwriteSame":true,"dstExists":true,"failedAction":"attempt"}  
[16:15:05] <wikibugs>	 6operations, 6Commons, 10Wikimedia-Media-storage: Cannot delete file: File:JGY5201.jpg - https://phabricator.wikimedia.org/T119301#1823673 (10Reedy) Nothing in the SwiftBackend logs, or in the archived logs  There was something in the FileOperation logs  ``` 2015-11-22 16:12:28 mw1107 commonswiki FileOperati...
[16:15:19] <icinga-wm>	 RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[16:17:49] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active
[16:20:14] <Reedy>	 Steinsplitter: Has a file of the same name already been deleted?
[16:20:16] <Reedy>	 "dstExists":true
[16:20:41] <Reedy>	 Though, I don't know how swift chooses that destination name
[16:21:49] <Steinsplitter>	 No files deleted  having SHA1=51257469e3a08fe573755196f6027ca692a38f64.
[16:22:38] <Steinsplitter>	 meal now, will be back in some minutes :)
[16:48:57] <wllm>	 Hey all. . .
[16:49:45] <wllm>	 Not going away.
[16:50:07] <wllm>	 Barras2: What up, my petty tyrannt?!?
[16:50:12] <Krenair>	 !ops
[16:50:16] <wllm>	 Here to ban me?
[16:50:22] <wllm>	 !ops
[16:50:46] <wllm>	 You people have a real problem on your hands.
[16:50:52] <Krenair>	 You?
[16:51:00] <wllm>	 Something you can’t just ban away.
[16:51:06] <wllm>	 Yup. You got it.
[16:52:14] <Reedy>	 He seems to just use the same username, but a different hostmask
[16:54:49] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[16:57:42] <Glaisher>	 Maybe this channel should also opt-in to global bans?
[16:58:04] <Reedy>	 wouldn't be the worst idea
[16:58:36] <Barras2>	 Should be done by one of the ops in here... or one of the ops makes me an op and I take care of it ;)
[16:59:22] <RD>	 I'll ask jeff to add it when I see him next.
[16:59:40] <RD>	 Unless another op does it first but .... looking at the list I don't see that happening
[17:04:40] <Krenair>	 RD, I suppose Jeff is the only one you really have frequent contact with?
[17:04:59] <Krenair>	 I don't see him speaking here much
[17:05:10] <RD>	 yeah
[17:05:30] <RD>	 Anybody else would be like "and you are...?"
[17:05:37] <RD>	 lol
[17:05:38] <Barras2>	 If needed, I can do it... But there are so many people with flags here that I don't really see the need to act myself.
[17:06:01] <Krenair>	 I would be careful to filter out the inactive people among the existing ops, but yes.
[17:06:07] <RD>	 If you know anybody, Krenair, tell them to op up and enter this command  /mode +b $j:#wikimedia-bans
[17:06:12] <RD>	 But let us know if that happens
[17:07:24] <RD>	 Reedy can enter it in -dev, where he was just booted from too
[17:07:27] <RD>	 If he wants to!
[17:09:48] <icinga-wm>	 PROBLEM - puppet last run on mw1100 is CRITICAL: CRITICAL: Puppet has 1 failures
[17:22:59] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [500.0]
[17:34:28] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[17:36:09] <icinga-wm>	 RECOVERY - puppet last run on mw1100 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:39:10] <icinga-wm>	 PROBLEM - puppet last run on mw2063 is CRITICAL: CRITICAL: puppet fail
[18:03:46] <ori>	 RD: cool, did not know about that
[18:04:44] <RD>	 Yeah, freenode introduced it about a year ago, ish.  Definitely helps with cross channel people.
[18:07:58] <icinga-wm>	 RECOVERY - puppet last run on mw2063 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:20:29] <grrrit-wm>	 (03PS1) 10Brian Wolff: Increae $wgCopyUploadTimeout to 90 seconds (from default 25) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254700 (https://phabricator.wikimedia.org/T118887) 
[18:23:17] <grrrit-wm>	 (03PS2) 10Brian Wolff: Increase $wgCopyUploadTimeout to 90 seconds (from default 25) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254700 (https://phabricator.wikimedia.org/T118887) 
[18:23:43] <grrrit-wm>	 (03CR) 10Reedy: [C: 031] Increase $wgCopyUploadTimeout to 90 seconds (from default 25) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254700 (https://phabricator.wikimedia.org/T118887) (owner: 10Brian Wolff)
[18:36:12] <ori>	 bawolff: I'm really not familiar with the upload code, so forgive me if I ask stupid questions. T118887, you wrote: "upload by url is still going to suck for super large files, and that's inherent with the current design where we have to fetch the file synchronously with the current request"
[18:36:20] <ori>	 what's the deal with UploadFromUrlJob, then? is it not in use?
[18:36:35] <bawolff>	 Its not
[18:36:37] <bawolff>	 afaik
[18:36:52] <bawolff>	 I think someone half finished support for that, and it got like half reverted
[18:37:22] <bawolff>	 a long time ago
[18:37:47] <bawolff>	 See b20a0d33
[18:40:42] <ori>	 hold my beer, I'm going in
[18:41:01] <Reedy>	 It doesn't look very plubmed in
[18:41:06] <ori>	 so: b20a0d3327 is "Follow-up r81612, disable $wgAllowAsyncCopyUploads"
[18:41:06] <Reedy>	 ori: Bit early for beer isn't it? ;)
[18:41:29] <ori>	 then: https://www.mediawiki.org/wiki/Special:Code/MediaWiki/81612#c14654
[18:42:17] <bawolff>	 upload code is scary
[18:42:58] <bawolff>	 hmm, i guess it was all reverted because people wanted to wait for echo
[18:43:14] <ori>	 which takes us to https://www.mediawiki.org/wiki/Special:Code/MediaWiki/66438
[18:43:16] <ori>	 yeah
[18:46:05] <wikibugs>	 6operations, 10Wikimedia-SVG-rendering, 7Upstream: Filter effect Gaussian blur filter not rendered correctly for small to medium thumbnail sizes - https://phabricator.wikimedia.org/T44090#1823746 (10Dvorapa) p:5Lowest>3High
[18:46:33] <wikibugs>	 6operations, 10Wikimedia-SVG-rendering, 7Upstream: Filter effect Gaussian blur filter not rendered correctly for small to medium thumbnail sizes - https://phabricator.wikimedia.org/T44090#461916 (10Dvorapa) p:5High>3Normal
[18:50:16] <wikibugs>	 6operations, 10Wikimedia-SVG-rendering, 7Upstream: Filter effect Gaussian blur filter not rendered correctly for small to medium thumbnail sizes - https://phabricator.wikimedia.org/T44090#1823750 (10Bawolff) >BTW not happy this issue is marked as a low priority :/  Don't get too excited. Priorities are meani...
[18:53:59] <ori>	 so if i have the story straight, what happened is:(
[18:54:06] <ori>	 - want async upload by url
[18:54:07] <ori>	 - need a way to notify user once done
[18:54:07] <ori>	 - add User::leaveUserMessage() (r66438)
[18:54:08] <ori>	 - add UploadFromUrlJob
[18:54:11] <ori>	 - Aryeh / ^demon / brion agree that User::leaveUserMessage() is badly designed, r66438 reverted. 
[18:54:16] <ori>	 - This leaves UploadFromUrlJob broken, so btongminh disables in r81615 
[18:54:24] <ori>	 - over the course of millenia, the winds of time blow the sand of obscurity over the code, until all traces of the once-flourishing UploadFromUrlJob civilization are buried
[18:58:40] <Dereckson>	 nice to know such job exists
[18:59:15] <Dereckson>	 what about replace User::leaveUserMessage by an Echo notification?
[19:00:08] <Dereckson>	 18:42:57 < bawolff> hmm, i guess it was all reverted because people wanted to wait for echo
[19:00:15] <Dereckson>	 That answers my question.
[19:00:18] <ori>	 seems optimal, i don't know Echo well enough to know if there is a clean way to do that from Core
[19:00:47] <akoopal>	 hi, I just asked a question in tech about a display problem with an image on nlwiki, anybody from operators who can have a look?
[19:01:01] <Dereckson>	 an hook uploadJobDone? Or something more generic asyncJobDone?
[19:01:09] <bawolff>	 I wouldn't exactly trust the remaning async upload by url code. Its probably super bit-rotted at this point
[19:01:15] <ori>	 Dereckson: this should be done, and it needs a task. Any chance you could file one? I think bawolff's change is OK to merge provided there is a comment pointing to a task
[19:01:44] <bawolff>	 and the other async upload code that was merged around that time ended up being rather buggy
[19:01:53] * bawolff will file a task
[19:03:28] <ori>	 thanks, i really appreciate it. i'd file one myself but if i don't take my 4-year-old out now he may beat me to death
[19:04:18] <wikibugs>	 6operations, 10Wikimedia-SVG-rendering, 7Upstream: Filter effect Gaussian blur filter not rendered correctly for small to medium thumbnail sizes - https://phabricator.wikimedia.org/T44090#1823768 (10Dvorapa) >>! In T44090#1823750, @Bawolff wrote: >>BTW not happy this issue is marked as a low priority :/ >  >...
[19:04:30] <WilS>	 Why did we stop measuring the gender gap?
[19:04:41] <WilS>	 What do we have to hide?
[19:05:18] <Krenair>	 What does this have to do with server operations?
[19:05:22] <WilS>	 RD: Ban me again?
[19:05:25] <WilS>	 Everything.
[19:05:46] <WilS>	 We don't have many women using our servers to edit.
[19:06:06] <WilS>	 Are we a bunch of cowards?
[19:06:43] <Romaine>	 are you just annoyed???
[19:06:48] <WilS>	 No.
[19:06:51] <WilS>	 I'm concerned.
[19:07:06] <WilS>	 You're annoyed. ;)
[19:07:17] <WilS>	 Actually, I don't know if you are. . .
[19:07:18] <Dereckson>	 WilS: you could take the issue on more general channels or on meta, #wikimedia-operations is not to make editorial or inclusive policies, but to maintain servers
[19:07:21] <WilS>	 But some are.
[19:07:24] <WilS>	 Nope.
[19:07:33] <WilS>	 Been banned for saying the same thing.
[19:07:41] <WilS>	 So, I'll say it where I can.
[19:07:46] <bawolff>	 Hmm, so this is what its like to hang out in the popular channels ;)
[19:07:56] <WilS>	 And, if it's not within the WP community. . .
[19:08:08] <WilS>	 Then it will have to be with journalists.
[19:08:47] <grrrit-wm>	 (03PS3) 10Brian Wolff: Increase $wgCopyUploadTimeout to 90 seconds (from default 25) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254700 (https://phabricator.wikimedia.org/T118887) 
[19:08:56] <WilS>	 Fortunately, my personal circumstances make it easy for me to get attention.
[19:08:58] <Romaine>	 I can imagine why people do not listen to you, people do not like to listen to people who talk about threats to go to ... whatever
[19:09:05] <WilS>	 Nope.
[19:09:11] <WilS>	 No threat.
[19:09:14] <WilS>	 That's a fact.
[19:09:29] <WilS>	 We're going to face this one.
[19:09:39] <Katie>	 Nobody fixed the ban list in here yet?
[19:09:55] <WilS>	 We're not going to just stop measuring it as a way to "fix" the problem.
[19:10:03] <Katie>	 Oh, wmfgc can do it.
[19:10:17] <WilS>	 2011 was the last thorough demographic editor survey. . .
[19:10:27] <WilS>	 A lot of criticism in the press. . .
[19:10:38] <WilS>	 And we just stopped surveying. . .
[19:11:08] <WilS>	 Coincidence?
[19:11:28] <akoopal>	 WilS: again, the ops just execute, they don't decide, you are in the wrong channel
[19:11:37] <WilS>	 Sure. . .
[19:11:44] <WilS>	 But that's the problem. . .
[19:11:50] <RD>	 File a bug.
[19:11:51] <akoopal>	 and irritating people really doesn't help your case
[19:12:00] <WilS>	 I've been shut out of every other channel. . .
[19:12:08] <WilS>	 Yeah.
[19:12:12] <Romaine>	 Akoopal asked a question a few minutes ago, we have a problem with this image: https://nl.wikipedia.org/wiki/Gebruiker:Akoopal/Kladblok
[19:12:16] <akoopal>	 if you behave everywhere the same
[19:12:25] <ori>	 Romaine: what's the problem?
[19:12:29] <WilS>	 We're "irritated" by the gender gap?
[19:12:31] <Romaine>	 it seems to be uploaded correctly, but only shown with thumb in wrong dimensions
[19:12:41] <Katie>	 Hmmm.
[19:12:43] <akoopal>	 ori: the imaage is horizontally cropped
[19:13:00] <akoopal>	 if you do 'view image' it shows correctly
[19:13:11] <akoopal>	 ehh, stretched, not cropped
[19:15:05] <WilS>	 In 2013 Sue Gardner said this: "I didn't solve it. We didn't solve it. The Wikimedia Foundation didn't solve it. The solution won't come from the Wikimedia Foundation."
[19:15:17] <WilS>	 Referring to the gender gap.
[19:15:37] <WilS>	 Of course, 2013 was the year we stopped measuring it.
[19:15:41] <Romaine>	 but I have seen this kind of images before
[19:15:47] <WilS>	 So, yeah. . .
[19:15:54] <WilS>	 Did we just give up in 2013?
[19:15:57] <Romaine>	 in previous years we had more of such images that were stretched
[19:16:19] <Romaine>	 it only happens with thumb it seems,
[19:16:34] <WilS>	 Are we a bunch of hypocrites?
[19:16:39] <akoopal>	 I tried (preview only) also with 180px
[19:16:52] <WilS>	 Saying that we value NPOV, while 9 out of 10 of us are men?
[19:17:17] <Romaine>	 we tried to purge the cache but did not work it seems
[19:17:40] <WilS>	 And since this is a logged channel that any journalist can refer to later. . .
[19:18:10] <WilS>	 Is stretching images more important than the fact that there are almost no women editors on Wikipedia?
[19:18:11] <Katie>	 No feeding the trolls, please.
[19:18:18] <WilS>	 Sure.
[19:18:26] <Katie>	 We'll find an op to fix the ban list in the next day or two.
[19:18:35] <WilS>	 And shut me up?
[19:18:42] <akoopal>	 ori: seen what I mean
[19:19:11] <akoopal>	 WilS: this channels is about technical issues, you are highly off topic
[19:19:22] <WilS>	 It's all logged, folks. . .
[19:19:22] <bawolff>	 Romaine: Stretched images can happen if cache doesn't get cleared properly
[19:19:54] <Romaine>	 I suspected something like that yes, If I recall correctly it was earlier also cache issue
[19:19:56] <WilS>	 And Katie/MZMcBride. . .
[19:20:06] <WilS>	 You've tried to shut me up before.
[19:20:07] <bawolff>	 Romaine: e.g. Someone uploads an image with different dimensions, but the page its used in cache is old, so the <img> tag has wrong width or height, making it look stretched
[19:20:16] <WilS>	 And you attacked me on wikimedia-l.
[19:20:36] <Romaine>	 is there a way how we can fix the image shown?
[19:20:37] <WilS>	 That really wasn't very bright of you.
[19:21:22] <Reedy>	 c: Can you apply the global bans?
[19:21:28] <Reedy>	 uh, banlist
[19:21:39] <bawolff>	 Romaine: Now a days, we have better support for clearing cache, so its not supposed to happen anymore. (Although right now there's reports of lots of cache clearing bugs at commons, so it might be related to that)
[19:21:59] <Reedy>	 Thankyou
[19:22:02] <Romaine>	 in the past persiods I had no issues with it indeed
[19:22:04] <c>	 Reedy: why you no haz ops here?
[19:22:09] <Katie>	 I'm not sure I see the crop issue.
[19:22:30] <Romaine>	 so it is just a matter of waiting bawolff? or can we do something?
[19:22:34] <akoopal>	 Katie: not crop, stretching
[19:22:35] <Katie>	 I tried [[File:Police - rue de la République Saint-Denis - 18 nov 2015.jpg|500px|text]]
[19:22:39] <Reedy>	 c: heh, infinite quiestion :)
[19:23:03] <Romaine>	 Katie: it is with thumb
[19:23:09] <akoopal>	 Katie: try 180px or thumb
[19:23:48] <bawolff>	 Its only esams
[19:24:01] <bawolff>	 The image from esams has a different hash
[19:24:32] <Katie>	 Hmm, looks the same to me at 'thumb' and 220px...
[19:24:47] <Katie>	 Oh, I doubt I'm hitting esams. :-)
[19:24:57] <bawolff>	 I'm looking at 200px thumb from esams
[19:25:19] <akoopal>	 bawolff: ahh, just that I remember correctly, that is the europe-cluster?
[19:25:32] <bawolff>	 akoopal: Yep, that's europe
[19:26:05] <akoopal>	 do you need a phabricator-ticket?
[19:26:06] <bawolff>	 basically you're experiancing https://phabricator.wikimedia.org/T119038
[19:26:28] <bawolff>	 * I mean I'm looking at 220px, not 200px
[19:27:14] <bawolff>	 oh wait, they're all broken
[19:27:15] <Katie>	 Heh.
[19:27:25] <akoopal>	 hmmm
[19:27:37] <bawolff>	 you should be getting a file with hash 08d600f6f8ff992e3e05a00fadad89c6
[19:28:01] <bawolff>	 esams gets 86f0799f3c6bae1a895e6eca88af7257, eqiad, codfw, ulsfo get 8afb5a1c2d7e8dd120453141eca5b22c
[19:28:16] <bawolff>	 anyhow, cache clearing on images is really broken right now
[19:28:45] <Romaine>	 I assume this is something to be fixed  by the operators and nothing to do for the local users?
[19:28:52] <Katie>	 I find Labs so aggravating.
[19:29:05] <Katie>	 Every time I want to do something like share a screenshot, the Web process is nowhere to be found.
[19:29:09] <bawolff>	 yeah, pretty much. You can work around it by trying to use an odd width (so like 221px)
[19:29:40] <akoopal>	 this one has been worked around by using the cropped version
[19:29:43] <bawolff>	 Romaine: Wait, it might work now. I purged the image a bunch of times, I think one got through
[19:29:52] <akoopal>	 that doesn't have the problem
[19:30:04] <bawolff>	 try doing a hard refresh of the page now
[19:30:13] <Katie>	 https://tools.wmflabs.org/mzmcbride/files/thumbs-on-nl-wiki-2015-11-22.png is what I see.
[19:30:37] <Romaine>	 bawolff: seems solved now for this image
[19:30:46] <akoopal>	 yep, thanks
[19:31:15] <Romaine>	 thanks for the help
[19:31:48] <Katie>	 bawolff: Purged how?
[19:32:01] <akoopal>	 and as the main issue is tracked by a ticket, I am sure it will be investigated daytime
[19:32:04] <Romaine>	 probably file page on Commons
[19:32:24] <Romaine>	 that was namely done in previous occasions, but seem not to work this time for me
[19:33:34] <bawolff>	 Katie: First you go to something like https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/Police_-_rue_de_la_R%C3%A9publique_Saint-Denis_-_18_nov_2015.jpg/220px-Police_-_rue_de_la_R%C3%A9publique_Saint-Denis_-_18_nov_2015.jpg?RANDOMNUMBERHERE to ensure that there is a thumbnail in swift layer, and then you do ?action=purge on the file description page
[19:33:46] <Katie>	 ?action=purge works on thumb URLs?
[19:33:53] <Katie>	 I've only ever used it on the file description page.
[19:34:00] <Katie>	 Oh, that's what you said.
[19:34:20] <Katie>	 So you're just cache-busting.
[19:35:10] <wikibugs>	 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, 10Traffic: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1823813 (10Bawolff) So I was just helping someone with a thumb caching issue on irc -   https://upload.wikimedia.org/wikipedia/...
[19:36:14] <bawolff>	 Katie: yeah, ?action=purge on a file description page only purges thumbnails that have entries in swift. Sometimes you'll see cases where there is an entry in varnish, but not swift, in which case bypassing varnish forces there to be an entry in swift, so the later purge actually works
[19:36:48] <bawolff>	 Katie: btw, if your curious in testing these things, you can test getting thumbs from esams using a command like wget -S 'https://upload-lb.esams.wikimedia.org/wikipedia/commons/thumb/4/47/Police_-_rue_de_la_R%C3%A9publique_Saint-Denis_-_18_nov_2015.jpg/220px-Police_-_rue_de_la_R%C3%A9publique_Saint-Denis_-_18_nov_2015.jpg' --header 'host: upload.wikimedia.org' --no-check-certificate
[19:37:44] <Katie>	 Tricky, tricky.
[19:39:38] <icinga-wm>	 PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: puppet fail
[19:44:16] <akoopal>	 bawolff: thanks for the help before I forget :-)
[19:44:23] <bawolff>	 no problem
[19:44:31] <YuviPanda>	 Reedy / ori I think I gave you both op
[19:44:34] <YuviPanda>	 can you check?
[19:44:44] <RD>	 [14:44:41]	ChanServ	20 ori +Aiortv [modified 1m 16s ago]
[19:44:45] <RD>	 [14:44:41]	ChanServ	21 Reedy +Aiortv [modified 24s ago]
[19:44:58] <RD>	  /cs access #wikimedia-operations list
[19:45:12] <Reedy>	 Cheers YuviPanda
[19:45:17] <YuviPanda>	 :D but I guess I want them to op themselves and check?
[19:45:33] <Reedy>	 :D
[19:45:36] <RD>	 lol
[19:45:46] <akoopal>	 grin :-)
[19:45:46] <YuviPanda>	 so that seems to work!
[19:45:47] <RD>	 It works.
[19:45:49] <RD>	 :p
[19:53:55] <bawolff>	 ori: Do we still use cdb files, or are we trying to get away from them (asking in relation to I26a9e8f2)?
[19:54:26] <Reedy>	 He's AFK
[20:02:39] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0]
[20:08:18] <icinga-wm>	 PROBLEM - HHVM rendering on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:09:00] <icinga-wm>	 PROBLEM - Apache HTTP on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:09:49] <icinga-wm>	 PROBLEM - RAID on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:09:58] <icinga-wm>	 PROBLEM - configured eth on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:09:59] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:10:29] <icinga-wm>	 PROBLEM - SSH on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:10:29] <icinga-wm>	 PROBLEM - salt-minion processes on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:10:39] <icinga-wm>	 PROBLEM - HHVM processes on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:10:50] <icinga-wm>	 PROBLEM - nutcracker port on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:10:50] <icinga-wm>	 PROBLEM - dhclient process on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:10:59] <icinga-wm>	 PROBLEM - nutcracker process on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:11:09] <icinga-wm>	 PROBLEM - DPKG on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:11:09] <icinga-wm>	 PROBLEM - Disk space on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:11:09] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds
[20:11:59] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [5000000.0]
[20:12:33] <greg-g>	 YuviPanda: ^
[20:12:34] <greg-g>	 ?
[20:12:47] <YuviPanda>	 augh
[20:12:54] * YuviPanda does a quick restart before running away
[20:13:17] <YuviPanda>	 ah
[20:13:19] <YuviPanda>	 host is down
[20:13:32] <YuviPanda>	 greg-g: it's ok then, just hardware issues
[20:13:34] <greg-g>	 right, but then the http error rate and kafka? unrelated?
[20:13:38] <YuviPanda>	 we can afford to lose one and not have a problem
[20:13:45] <YuviPanda>	 the kafka one is definitely unrelated
[20:13:49] <YuviPanda>	 I'm looking at the http one now
[20:14:00] <YuviPanda>	 our http errors have been so low that super tiny spikes are triggering anomaly
[20:14:24] <YuviPanda>	 (https://grafana.wikimedia.org/dashboard/db/varnish-http-errors is the place to look at)
[20:15:18] <YuviPanda>	 looks fine to me
[20:15:40] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 1.00% above the threshold [1000000.0]
[20:16:12] <YuviPanda>	 I'm going to leave mw1123 unack'd since I'm not going to work on it atm
[20:16:48] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds
[20:16:54] <YuviPanda>	 do page me (emails will reach me too) if there are more cascading failures
[20:17:17] <greg-g>	 YuviPanda: looks good, go afk
[20:17:18] <Reedy>	 I'll file a task for mw1123
[20:17:57] <Reedy>	 Hmm, it responds to ping
[20:19:08] <wikibugs>	 6operations: mw1123 unresponsive - https://phabricator.wikimedia.org/T119339#1823832 (10Reedy) 3NEW
[20:19:11] <icinga-wm>	 RECOVERY - RAID on mw1123 is OK: OK: no RAID installed
[20:19:11] <icinga-wm>	 RECOVERY - configured eth on mw1123 is OK: OK - interfaces up
[20:19:11] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1123 is OK: OK: nf_conntrack is 0 % full
[20:19:16] <Reedy>	 Oh..
[20:19:39] <icinga-wm>	 RECOVERY - SSH on mw1123 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0)
[20:19:39] <icinga-wm>	 RECOVERY - salt-minion processes on mw1123 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[20:19:49] <icinga-wm>	 RECOVERY - HHVM processes on mw1123 is OK: PROCS OK: 6 processes with command name hhvm
[20:19:54] <YuviPanda>	 want me to take it out of rotation if it is flappy?
[20:20:08] <icinga-wm>	 RECOVERY - nutcracker port on mw1123 is OK: TCP OK - 0.000 second response time on port 11212
[20:20:08] <icinga-wm>	 RECOVERY - dhclient process on mw1123 is OK: PROCS OK: 0 processes with command name dhclient
[20:20:08] <icinga-wm>	 RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.456 second response time
[20:20:11] * YuviPanda considers
[20:20:18] <icinga-wm>	 RECOVERY - nutcracker process on mw1123 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[20:20:19] <icinga-wm>	 RECOVERY - Disk space on mw1123 is OK: DISK OK
[20:20:20] <icinga-wm>	 RECOVERY - DPKG on mw1123 is OK: All packages OK
[20:20:30] <Reedy>	 It's the first time it's gone down it seems...
[20:20:41] <Reedy>	 load average: 37.38, 60.69, 46.79
[20:20:44] <Reedy>	 I guess it was busy
[20:20:57] <YuviPanda>	 ah
[20:21:00] <YuviPanda>	 yeah
[20:21:06] <YuviPanda>	 hhm was killed for going OOM
[20:21:08] <icinga-wm>	 RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[20:21:10] <icinga-wm>	 RECOVERY - HHVM rendering on mw1123 is OK: HTTP OK: HTTP/1.1 200 OK - 67262 bytes in 0.536 second response time
[20:21:10] <Reedy>	 seems to be hanging again
[20:21:39] <YuviPanda>	 I'm ssh'd in, seems ok
[20:21:44] <YuviPanda>	 hhvm using up 500% of CPU
[20:21:53] <Reedy>	 14594 www-data  20   0 14.625g 1.583g  31508 S  2233 13.5   6:26.50 hhvm                                                                                                          
[20:21:55] <Reedy>	 2233%
[20:21:58] <Reedy>	 :D
[20:22:09] <Reedy>	 I was ssh'd in, then it dropped, but my tin ssh didn't
[20:22:19] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds
[20:22:25] <YuviPanda>	 I just looked at 1124
[20:22:28] <YuviPanda>	 and it's at 600%
[20:22:50] <icinga-wm>	 PROBLEM - puppet last run on mw2098 is CRITICAL: CRITICAL: puppet fail
[20:26:13] <Krinkle>	 https://grafana.wikimedia.org/dashboard/db/server-board?var-server=mw1123&from=now-1h
[20:27:21] <wikibugs>	 6operations: mw1123 unresponsive - https://phabricator.wikimedia.org/T119339#1823861 (10Reedy) 5Open>3stalled ``` [20:20:29] <Reedy> It's the first time it's gone down it seems... [20:20:41] <Reedy> load average: 37.38, 60.69, 46.79 [20:20:44] <Reedy> I guess it was busy [20:20:57] <YuviPanda> ah [20:21:00]...
[20:37:38] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds
[20:51:09] <icinga-wm>	 RECOVERY - puppet last run on mw2098 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:08:21] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Increase $wgCopyUploadTimeout to 90 seconds (from default 25) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254700 (https://phabricator.wikimedia.org/T118887) (owner: 10Brian Wolff)
[21:08:45] <bawolff>	 huh, that was a little delayed...
[21:13:35] <bawolff>	 Is it a known issue that integration.wikimedia.org seems to be 503ing ( https://integration.wikimedia.org/ci/job/operations-mw-config-phpunit/2783/console )
[21:14:02] <Reedy>	 jenkins died?
[21:14:08] <Reedy>	 I think Paladox was complaining earlier
[21:14:12] <Reedy>	 I can reboot it
[21:15:05] <bawolff>	 It took 2 hours for jenkins-bot to post on https://gerrit.wikimedia.org/r/#/c/254700/, which seems abnormal
[21:15:33] * Reedy looks for the docs
[21:16:12] * Krinkle lets Reedy figure it out
[21:16:29] <Krinkle>	 It's good to have someone else be familiar with it as well :)
[21:16:43] <Reedy>	 sudo reboot
[21:16:46] <Reedy>	 RESOLVED FIXED
[21:17:13] <Reedy>	 https://integration.wikimedia.org/ci/safeRestart doesn't want to load
[21:17:26] <Krinkle>	 ssh gallium
[21:17:39] <Reedy>	 yeah
[21:17:47] <Reedy>	 https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Restart
[21:17:59] <Krinkle>	 https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Restart
[21:18:01] <Krinkle>	 Righit
[21:18:15] * Reedy waits for ssh to log me in
[21:19:26] <Reedy>	 Sams-MacBook-Pro:~ reedy$ ssh gallium.eqiad.wmnet
[21:19:26] <Reedy>	 channel 0: open failed: administratively prohibited: open failed
[21:19:26] <Reedy>	 stdio forwarding failed
[21:19:26] <Reedy>	 ssh_exchange_identification: Connection closed by remote host
[21:19:38] <Reedy>	 Sigh
[21:19:50] <Krinkle>	 wfm
[21:20:25] <Krinkle>	 I'm sshing into gallium.wikimedia.org though
[21:20:32] <Krinkle>	 via hooft
[21:20:39] <Krinkle>	 proxyCommand matching *.wikimedia.org
[21:21:06] <Krinkle>	 https://github.com/Krinkle/dotfiles/blob/master/hosts/KrinkleMac/templates/sshconfig#L20-L32
[21:21:14] <Reedy>	 yup, that works
[21:21:26] <Krinkle>	 (slightly dated from mine, but accurate for this purpose)
[21:21:41] <Reedy>	 I suspect the issue was me using the wmnet hostname
[21:23:38] <Reedy>	 and my ssh conn died
[21:25:54] <Reedy>	 ffs
[21:26:03] <Reedy>	 I killed jenkins and it died again
[21:28:28] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[21:28:49] <Reedy>	 reedy@gallium:~$ sudo -u jenkins /etc/init.d/jenkins start
[21:28:49] <Reedy>	 The jenkins init script can only be run as root
[21:29:40] <Reedy>	 !bug 1
[21:29:40] <wm-bot>	 https://bugzilla.wikimedia.org/show_bug.cgi?id=1
[21:30:35] <Reedy>	 !log restarting stuck Jenkins
[21:30:38] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:31:18] <Reedy>	 It's off again
[21:32:17] <Krinkle>	 OK now?
[21:32:33] <grrrit-wm>	 (03CR) 10Reedy: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254700 (https://phabricator.wikimedia.org/T118887) (owner: 10Brian Wolff)
[21:33:07] <wikibugs>	 6operations, 6Labs, 10Tool-Labs, 7Database: Replication broken om labsdb1002. - https://phabricator.wikimedia.org/T119315#1823353 (10Multichill)
[21:33:25] <Reedy>	 Looks to be
[21:35:24] <legoktm>	 [13:21:41] <Reedy> I suspect the issue was me using the wmnet hostname <-- yeah, you have to use zuul.eqiad.wmnet for that
[21:36:41] <Reedy>	 bleugh
[21:37:32] <wikibugs>	 6operations, 6Labs, 10Tool-Labs, 7Database: Replication broken on labsdb1002. - https://phabricator.wikimedia.org/T119315#1823944 (10Giftpflanze)
[21:38:16] <Luke081515>	 Can someone look at T119315?
[21:38:50] <legoktm>	 https://wikitech.wikimedia.org/w/index.php?title=Gallium&type=revision&diff=207383&oldid=143020
[21:39:29] <Reedy>	 Luke081515: There's no one about
[21:39:38] <Reedy>	 It'll be dealt with tomorrow
[21:40:07] <Luke081515>	 hm, ok
[22:47:02] <wikibugs>	 6operations, 7Database, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1824064 (10JanZerebecki) Maybe we should decouple expiry from connecting. For example we could implement checking expiry by a Jenkins job that uses the fact that we have all certificates...
[23:00:03] <ori>	 YuviPanda: works, thanks
[23:07:27] <Reedy>	 ori: You're supposed to kick YuviPanda to test it
[23:12:02] <grrrit-wm>	 (03PS4) 10Ori.livneh: Increase $wgCopyUploadTimeout to 90 seconds (from default 25) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254700 (https://phabricator.wikimedia.org/T118887) (owner: 10Brian Wolff)
[23:12:53] <grrrit-wm>	 (03CR) 10Ori.livneh: "(PS4: add links to T119336 and T118887 in comment.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254700 (https://phabricator.wikimedia.org/T118887) (owner: 10Brian Wolff)