[00:09:18] <icinga-wm>	 PROBLEM - puppet last run on cp4003 is CRITICAL puppet fail
[00:26:57] <icinga-wm>	 RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures
[00:33:04] <grrrit-wm>	 (03PS1) 10Yuvipanda: [WIP]mesos: Add simple mesos module [puppet] - 10https://gerrit.wikimedia.org/r/208483 
[00:54:17] <grrrit-wm>	 (03PS2) 10Yuvipanda: [WIP]mesos: Add simple mesos module [puppet] - 10https://gerrit.wikimedia.org/r/208483 
[00:54:55] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] [WIP]mesos: Add simple mesos module [puppet] - 10https://gerrit.wikimedia.org/r/208483 (owner: 10Yuvipanda)
[00:55:46] <grrrit-wm>	 (03Abandoned) 10Yuvipanda: [WMF-Patch] Get rid of autoinstall functionality [puppet/mesos] - 10https://gerrit.wikimedia.org/r/208478 (owner: 10Yuvipanda)
[00:55:50] <grrrit-wm>	 (03Abandoned) 10Yuvipanda: [WIP]mesos: Add simple mesos module [puppet] - 10https://gerrit.wikimedia.org/r/208483 (owner: 10Yuvipanda)
[00:56:06] <grrrit-wm>	 (03Restored) 10Yuvipanda: [WIP]mesos: Add simple mesos module [puppet] - 10https://gerrit.wikimedia.org/r/208483 (owner: 10Yuvipanda)
[00:56:18] <grrrit-wm>	 (03Abandoned) 10Yuvipanda: Add .gitreview file [puppet/mesos] - 10https://gerrit.wikimedia.org/r/208477 (owner: 10Yuvipanda)
[00:56:29] <grrrit-wm>	 (03Abandoned) 10Yuvipanda: Make stdlib a submodule [puppet] - 10https://gerrit.wikimedia.org/r/208476 (owner: 10Yuvipanda)
[00:56:34] <grrrit-wm>	 (03Abandoned) 10Yuvipanda: zookeeper: Support installing on debian [puppet] - 10https://gerrit.wikimedia.org/r/208475 (owner: 10Yuvipanda)
[00:56:49] <grrrit-wm>	 (03Abandoned) 10Yuvipanda: mesos: import module + add simple role [puppet] - 10https://gerrit.wikimedia.org/r/208472 (owner: 10Yuvipanda)
[02:22:25] <logmsgbot>	 !log l10nupdate Synchronized php-1.26wmf3/cache/l10n: (no message) (duration: 08m 11s)
[02:22:53] <morebots>	 Logged the message, Master
[02:27:06] <logmsgbot>	 !log LocalisationUpdate completed (1.26wmf3) at 2015-05-03 02:26:02+00:00
[02:27:13] <morebots>	 Logged the message, Master
[02:44:19] <logmsgbot>	 !log l10nupdate Synchronized php-1.26wmf4/cache/l10n: (no message) (duration: 07m 11s)
[02:44:32] <morebots>	 Logged the message, Master
[02:45:39] <grrrit-wm>	 (03PS1) 10Springle: repool db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208490 
[02:46:07] <grrrit-wm>	 (03CR) 10Springle: [C: 032] repool db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208490 (owner: 10Springle)
[02:46:12] <grrrit-wm>	 (03Merged) 10jenkins-bot: repool db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208490 (owner: 10Springle)
[02:47:38] <logmsgbot>	 !log springle Synchronized wmf-config/db-eqiad.php: repool db1068, warm up (duration: 00m 15s)
[02:47:46] <morebots>	 Logged the message, Master
[02:48:33] <logmsgbot>	 !log LocalisationUpdate completed (1.26wmf4) at 2015-05-03 02:47:30+00:00
[02:48:42] <morebots>	 Logged the message, Master
[02:58:49] <wikibugs>	 7Blocked-on-Operations, 6operations, 6Scrum-of-Scrums, 10incident-20150410-flowdataloss, and 2 others: Better backup coverage for X1 database cluster - https://phabricator.wikimedia.org/T95835#1255067 (10Mattflaschen) Thanks, we appreciate this.
[02:59:32] <grrrit-wm>	 (03CR) 1020after4: [C: 031] "This looks cool, but I need to test more before +2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208263 (owner: 10Ori.livneh)
[03:00:17] <wikibugs>	 6operations, 6CA-team, 6Commons, 6Reading-Infrastructure-Team: db1068 (s4/commonswiki slave) is missing data about at least 6 users - https://phabricator.wikimedia.org/T91920#1255068 (10Springle) 5Open>3Resolved db1068 is recloned and repooled.  The exact cause is still unknown. Krenair's SAL link sugg...
[03:05:21] <grrrit-wm>	 (03PS1) 10Springle: deploy db2048, db2049, db2050, db2051, db2052, db2053, db2054 [puppet] - 10https://gerrit.wikimedia.org/r/208491 
[03:05:48] <icinga-wm>	 PROBLEM - puppet last run on cp3045 is CRITICAL puppet fail
[03:06:13] <grrrit-wm>	 (03CR) 10Springle: [C: 032] deploy db2048, db2049, db2050, db2051, db2052, db2053, db2054 [puppet] - 10https://gerrit.wikimedia.org/r/208491 (owner: 10Springle)
[03:08:17] <icinga-wm>	 PROBLEM - puppet last run on db2067 is CRITICAL puppet fail
[03:12:38] <wikibugs>	 10Ops-Access-Requests, 6operations: Add Tilman to "researchers" group on stat1003 - https://phabricator.wikimedia.org/T97916#1255074 (10Tbayer) 3NEW
[03:23:39] <icinga-wm>	 RECOVERY - puppet last run on db2067 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures
[03:23:49] <icinga-wm>	 RECOVERY - puppet last run on cp3045 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures
[03:34:48] <icinga-wm>	 PROBLEM - puppet last run on analytics1037 is CRITICAL Puppet has 2 failures
[03:34:58] <icinga-wm>	 PROBLEM - puppet last run on mw1188 is CRITICAL Puppet has 1 failures
[03:35:08] <icinga-wm>	 PROBLEM - puppet last run on mw1087 is CRITICAL Puppet has 1 failures
[03:35:08] <icinga-wm>	 PROBLEM - puppet last run on db2069 is CRITICAL Puppet has 1 failures
[03:35:18] <icinga-wm>	 PROBLEM - puppet last run on acamar is CRITICAL Puppet has 1 failures
[03:35:58] <icinga-wm>	 PROBLEM - puppet last run on mw1053 is CRITICAL Puppet has 2 failures
[03:50:59] <icinga-wm>	 RECOVERY - puppet last run on analytics1037 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures
[03:51:00] <icinga-wm>	 RECOVERY - puppet last run on mw1188 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures
[03:51:18] <icinga-wm>	 RECOVERY - puppet last run on mw1087 is OK Puppet is currently enabled, last run 1 second ago with 0 failures
[03:51:29] <icinga-wm>	 RECOVERY - puppet last run on acamar is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures
[03:52:08] <icinga-wm>	 RECOVERY - puppet last run on mw1053 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures
[03:52:49] <icinga-wm>	 RECOVERY - puppet last run on db2069 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[04:00:17] <grrrit-wm>	 (03PS1) 10Springle: depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208493 
[04:00:45] <grrrit-wm>	 (03CR) 10Springle: [C: 032] depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208493 (owner: 10Springle)
[04:00:51] <grrrit-wm>	 (03Merged) 10jenkins-bot: depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208493 (owner: 10Springle)
[04:01:45] <logmsgbot>	 !log springle Synchronized wmf-config/db-eqiad.php: depool db1070 (duration: 00m 16s)
[04:01:53] <morebots>	 Logged the message, Master
[04:03:59] <grrrit-wm>	 (03PS1) 10Springle: reassign db1070 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/208495 
[04:05:55] <grrrit-wm>	 (03CR) 10Springle: [C: 032] reassign db1070 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/208495 (owner: 10Springle)
[04:09:11] <White_Master>	 wat
[04:09:17] <White_Master>	 sooorry
[04:09:49] <icinga-wm>	 PROBLEM - puppet last run on cp4004 is CRITICAL puppet fail
[04:09:52] <White_Master>	 DD:
[04:17:48] <icinga-wm>	 PROBLEM - puppet last run on cp4020 is CRITICAL puppet fail
[04:27:44] <icinga-wm>	 RECOVERY - puppet last run on cp4004 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures
[04:28:42] <springle>	 !log xtrabackup clone db1049 to db1070
[04:28:50] <morebots>	 Logged the message, Master
[04:33:21] <icinga-wm>	 RECOVERY - puppet last run on cp4020 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures
[05:06:22] <icinga-wm>	 PROBLEM - puppet last run on carbon is CRITICAL Puppet has 1 failures
[05:14:16] <logmsgbot>	 !log LocalisationUpdate ResourceLoader cache refresh completed at Sun May  3 05:13:13 UTC 2015 (duration 13m 12s)
[05:14:27] <morebots>	 Logged the message, Master
[05:24:01] <icinga-wm>	 RECOVERY - puppet last run on carbon is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[05:29:42] <Kriss>	 Welcome ! A new page has been opened http://kristjanrobam.16mb.com . If you have time , please comment on it. Thank you .
[06:29:52] <icinga-wm>	 PROBLEM - puppet last run on mw2013 is CRITICAL Puppet has 1 failures
[06:30:42] <icinga-wm>	 PROBLEM - puppet last run on mw1046 is CRITICAL Puppet has 1 failures
[06:30:51] <icinga-wm>	 PROBLEM - puppet last run on mw2023 is CRITICAL Puppet has 1 failures
[06:31:12] <icinga-wm>	 PROBLEM - puppet last run on cp4001 is CRITICAL puppet fail
[06:33:52] <icinga-wm>	 PROBLEM - puppet last run on mw1025 is CRITICAL Puppet has 2 failures
[06:34:13] <icinga-wm>	 PROBLEM - puppet last run on mw2206 is CRITICAL Puppet has 2 failures
[06:35:02] <icinga-wm>	 PROBLEM - puppet last run on mw2022 is CRITICAL Puppet has 1 failures
[06:35:12] <icinga-wm>	 PROBLEM - puppet last run on mw1052 is CRITICAL Puppet has 1 failures
[06:35:52] <icinga-wm>	 PROBLEM - puppet last run on mw2079 is CRITICAL Puppet has 2 failures
[06:45:12] <icinga-wm>	 RECOVERY - puppet last run on mw1046 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures
[06:46:01] <icinga-wm>	 RECOVERY - puppet last run on mw2013 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures
[06:46:42] <icinga-wm>	 RECOVERY - puppet last run on mw1025 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures
[06:46:52] <icinga-wm>	 RECOVERY - puppet last run on mw2023 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:47:12] <icinga-wm>	 RECOVERY - puppet last run on mw2206 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures
[06:47:12] <icinga-wm>	 RECOVERY - puppet last run on mw2079 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures
[06:48:12] <icinga-wm>	 RECOVERY - puppet last run on mw2022 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:48:12] <icinga-wm>	 RECOVERY - puppet last run on mw1052 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:49:12] <icinga-wm>	 RECOVERY - puppet last run on cp4001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:55:57] <Steinsplitter>	 API request failed (internal_api_error_DBQueryError): [f4355014] Database query error
[06:56:07] <Steinsplitter>	 :/ get a dotzend of such error warnings today
[06:56:21] <Steinsplitter>	 API request failed (internal_api_error_DBQueryError): [a5677c97] Database query error
[06:56:23] <Steinsplitter>	 etc.
[07:10:01] <icinga-wm>	 PROBLEM - puppet last run on cp4004 is CRITICAL puppet fail
[07:27:42] <icinga-wm>	 RECOVERY - puppet last run on cp4004 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures
[07:58:01] <icinga-wm>	 PROBLEM - puppet last run on db2054 is CRITICAL puppet fail
[08:15:32] <icinga-wm>	 RECOVERY - puppet last run on db2054 is OK Puppet is currently enabled, last run 1 second ago with 0 failures
[08:30:53] <wikibugs>	 6operations, 10MediaWiki-extensions-SecurePoll, 12Elections, 7I18n, and 2 others: Cannot select language on votewiki - https://phabricator.wikimedia.org/T97923#1255223 (10Nemo_bis) > I don't think allowing ULS to be used by anons helps because ULS uses setlang= which does not translate the SecurePoll conte...
[08:45:18] <wikibugs>	 6operations, 10Wikimedia-Site-requests, 7I18n, 7Varnish: Anonymous users can't pick language on WMF wikis ($wgULSAnonCanChangeLanguage is set to false) - https://phabricator.wikimedia.org/T58464#1255257 (10Nemo_bis) Ok, I see that with [[https://wikimediafoundation.org/wiki/Staff_and_contractors|the new st...
[08:48:23] <wikibugs>	 6operations, 10MediaWiki-extensions-SecurePoll, 12Elections, 7I18n, and 2 others: Cannot select language on votewiki - https://phabricator.wikimedia.org/T97923#1255260 (10Varnent) >>! In T97923#1255223, @Nemo_bis wrote: > Ok, wgUseSiteJs is false. I see no compelling reason to keep it so, if setting it to...
[09:06:45] <wikibugs>	 6operations, 10MediaWiki-extensions-SecurePoll, 12Elections, 7I18n, and 2 others: Cannot select language on votewiki - https://phabricator.wikimedia.org/T97923#1255281 (10Jalexander) >>! In T97923#1255223, @Nemo_bis wrote: >> I don't think allowing ULS to be used by anons helps because ULS uses setlang= wh...
[09:14:10] <Jianhui67>	 heya people
[09:14:20] <Jianhui67>	 commons seems to be lacking and slow for me
[09:14:48] * Steinsplitter confirms
[09:14:50] <Steinsplitter>	 the api
[09:16:01] <Jianhui67>	 anyone here?
[10:01:11] <icinga-wm>	 PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0]
[10:02:33] <Nemo_bis>	 Curious (unrelated to Commons issues) https://atlas.ripe.net/measurements/1994022/#!map
[10:25:21] <icinga-wm>	 PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 62.50% of data above the critical threshold [24.0]
[10:29:13] <icinga-wm>	 PROBLEM - puppet last run on cp3014 is CRITICAL puppet fail
[10:30:12] <sjoerddebruin>	 Wikimedia is slow as hell today.
[10:38:42] <sjoerddebruin>	 I want to do stuff today...
[10:39:39] <Steinsplitter>	 this is the 4'th report i heare today
[10:39:46] <Steinsplitter>	 looks like all ops sleeping
[10:40:04] <Jianhui67>	 exactly sjoerddebruin
[10:41:21] <icinga-wm>	 PROBLEM - puppet last run on mw1103 is CRITICAL Puppet has 1 failures
[10:47:01] <icinga-wm>	 RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures
[10:47:08] <sjoerddebruin>	 Hm.
[10:57:12] <icinga-wm>	 RECOVERY - puppet last run on mw1103 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures
[11:05:32] <icinga-wm>	 RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0]
[11:11:16] <wikibugs>	 6operations, 10Wikidata: edits seem to be very slow - https://phabricator.wikimedia.org/T97930#1255356 (10Lydia_Pintscher)
[11:11:36] <Steinsplitter>	 API request failed (internal_api_error_DBQueryError): [f65e4865] Database query error
[11:11:37] <Steinsplitter>	 ......
[11:11:40] <wikibugs>	 6operations, 10Wikidata: edits seem to be very slow - https://phabricator.wikimedia.org/T97930#1255319 (10Lydia_Pintscher) This doesn't seem to be a Wikidata issue only. Others are complaining about similar issues on other wikis as well.
[11:11:46] <matanya>	 Steinsplitter: did you look for me ?
[11:12:20] <Steinsplitter>	 matanya, not fore you but for some tech (i, Jianhui67, and sjoerddebruin)
[11:19:17] <wikibugs>	 6operations, 10Wikidata: edits seem to be very slow - https://phabricator.wikimedia.org/T97930#1255362 (10Steinsplitter) Editing is very slow on commons too, and other wikis.  (Other users can confirm this / @Jianhui67 , @Sjoerddebruin)
[11:20:11] <wikibugs>	 6operations, 10Wikidata: edits seem to be very slow - https://phabricator.wikimedia.org/T97930#1255364 (10Steinsplitter) API too: ``` API request failed (internal_api_error_DBQueryError): [f65e4865] Database query error ```
[11:20:44] <wikibugs>	 6operations, 6Commons, 10Wikidata: edits seem to be very slow - https://phabricator.wikimedia.org/T97930#1255365 (10Steinsplitter) p:5Triage>3High
[11:22:38] <wikibugs>	 6operations, 6Commons, 10Wikidata: edits seem to be very slow - https://phabricator.wikimedia.org/T97930#1255368 (10Jianhui67) Well, VFC on Commons was lacking for me just now.
[11:31:50] <wikibugs>	 6operations, 6Commons, 10Wikidata, 7Performance: edits seem to be very slow - https://phabricator.wikimedia.org/T97930#1255369 (10Multichill)
[11:34:47] <Steinsplitter>	 strange traceroute: http://pastebin.com/yQ43uhrQ o_O
[11:40:39] <wikibugs>	 6operations, 10wikitech.wikimedia.org, 7LDAP: Remove Erik Moeller's Production Shell Access - https://phabricator.wikimedia.org/T97864#1255372 (10Krenair) @Chip: When creating new tasks, you must set projects, otherwise no one will be notified that you created it. This is especially important with access rem...
[11:40:51] <wikibugs>	 6operations: Remove Erik Moeller's Production Shell Access - https://phabricator.wikimedia.org/T97864#1255374 (10Krenair)
[11:44:12] <wikibugs>	 6operations: Remove Erik Moeller's Production Shell Access - https://phabricator.wikimedia.org/T97864#1255376 (10Krenair) To be honest, I wouldn't be surprised if this access pre-dated Erik's role as WMF staff (if it does, it can't be simply removed as part of WMF internal offboarding...)
[11:54:35] <multichill>	 Good whatever timezone you're in. Anyone awake? Please have a look at https://phabricator.wikimedia.org/T97930 . Users on different sites (Wikidata / Commons) are reporting slow saves and also database errors
[11:57:32] <multichill>	 bblack: For when you wake up ^
[12:02:21] <MaxSem>	 springle, yt? ^^
[12:05:21] <Steinsplitter>	 look at traceroute: http://pastebin.com/yQ43uhrQ    ...
[12:06:02] <Steinsplitter>	 there is something slow in europa...
[12:09:38] <_joe_>	 hey, just got here
[12:09:49] <_joe_>	 why do you say slaves are slow?
[12:10:16] <_joe_>	 only editing is slow or also reading?
[12:10:34] <multichill>	 _joe_: Seems to be editing only
[12:10:56] <_joe_>	 so I'd say a master database more than a slave :)
[12:11:33] <multichill>	 I'm not assuming anything, I'm just reporting what I'm observing so that people who can look under the hood can find the problem. See https://phabricator.wikimedia.org/T9793
[12:11:50] <Krenair>	 https://dbtree.wikimedia.org/ doesn't show any immediately obvious issues with masters
[12:13:04] <_joe_>	 I'm trying to take a look
[12:13:12] <Krenair>	 But this has been seen on both ... commons and wikidata, I think?
[12:13:28] <multichill>	 You could have a look at the error log if you see databas exceptions. Yes, both Commons and Wikidata
[12:13:36] <Krenair>	 I.e. s4 and s5
[12:13:50] <Krenair>	 if it's a database performance problem it's likely affecting dewiki as well as wikidatawiki
[12:14:08] <_joe_>	 yeah I'm looking at applications, actually our backends are less loaded than yesterday because I'd say there is some issue with the jobrunners not queueing jobs :)
[12:14:32] <_joe_>	 Krenair: I am looking at s4 and s5 masters on tendril and I don't see any smoking guns
[12:14:34] <Krenair>	 Actually I've seen that there are a *lot* of enwiki refreshLinks jobs queued
[12:15:02] <Krenair>	 The number was even higher a few days ago though so might not be related
[12:15:04] <_joe_>	 Krenair: that's since Thursday at least
[12:15:09] <_joe_>	 yep
[12:15:24] <Krenair>	 ooh
[12:15:34] <Krenair>	 There's something
[12:15:41] <Krenair>	 _joe_, fluorine.eqiad.wmnet:/a/mw-log/hhvm.log
[12:15:55] <Krenair>	 being  spammed with redis connection timeouts
[12:15:57] <_joe_>	 Krenair: ok I'll take a look
[12:16:03] <Krenair>	 that'd cause slowness
[12:16:07] <_joe_>	 ah. yes
[12:16:28] <_joe_>	 Krenair: thanks
[12:16:30] <Krenair>	 All for one server, which I can ping...
[12:17:16] <Krenair>	 But why are there no alarm bells going off in here about redis being broken?
[12:18:06] <_joe_>	 Krenair: because redis is theoretically working AFAICT
[12:18:16] <Krenair>	 well the mw servers disagree :)
[12:18:47] <Krenair>	 aha
[12:18:52] <_joe_>	 Krenair: no the problem is different
[12:19:06] <_joe_>	 I guess the redis server has all connection slots occupied
[12:19:14] <Krenair>	 Did you do something _joe_?
[12:19:28] <Krenair>	 It's stopped spamming about redis
[12:19:30] <_joe_>	 Krenair: I logged onto the redis server?
[12:19:32] <_joe_>	 :P
[12:20:28] <Krenair>	 tail -f shows nothing new about redis, last entry was 12:17:46
[12:20:52] <_joe_>	 yeah, but as I told you, it seems like a lot of connections to redis are going on
[12:21:18] <_joe_>	 so I don't think something is really wrong there
[12:21:41] <_joe_>	 Warning: timed out after 2 seconds when connecting to 10.64.32.76 [110]: Connection timed out
[12:21:47] <_joe_>	 2 seconds of timeout
[12:21:55] <_joe_>	 so yeah that may be a reason for the slowness
[12:22:09] <MaxSem>	 ewww
[12:22:21] <Krenair>	 https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Redis%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1430655692&g=mem_report&z=large looks fun
[12:22:28] <wikibugs>	 6operations, 6WMF-Legal, 6WMF-NDA-Requests: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#1255442 (10Multichill) >>! In T87097#999146, @Aklapper wrote: > Added Multichill to the group. Welcome :)  multichill@tools-bastion-01:~$ getent group nda nda:*:1002:jeremyb,parent5446,addsh...
[12:22:34] <_joe_>	 ok I think I have a reason for the problem
[12:22:46] <_joe_>	 the jobrunners are not working, apparently
[12:22:52] <icinga-wm>	 PROBLEM - RAID on db1004 is CRITICAL 1 failed LD(s) (Degraded)
[12:23:08] <wikibugs>	 6operations, 6WMF-Legal, 6WMF-NDA-Requests: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#1255443 (10Krenair) That's separate... You can probably be added there via a quick request to #operations however?
[12:23:48] <Krenair>	 Well yeah the job runners wouldn't be working if they can't connect to redis
[12:24:19] <_joe_>	 Krenair: it seems like the redis servers are overwhelmed with connections
[12:24:39] <_joe_>	 so they can't communicate with ganglia
[12:24:56] <wikibugs>	 6operations, 6WMF-Legal, 6WMF-NDA-Requests: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#1255445 (10Multichill) So we have multiple locations to register membership of the same group? Cute.  >>! In T87097#1255443, @Krenair wrote: > That's separate... You can probably be added th...
[12:25:47] <_joe_>	 20:25 windowcat: Updated jobrunners to c95d565e242e6fa3706c088ddab1cc6f716408e1
[12:25:54] <_joe_>	 this seems to be the issue
[12:26:10] <wikibugs>	 6operations, 6WMF-Legal, 6WMF-NDA-Requests: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#1255446 (10Multichill) 5Resolved>3Open
[12:26:45] <_joe_>	 Krenair: looking at the jobrunners now
[12:27:11] <wikibugs>	 6operations, 6WMF-Legal, 6WMF-NDA-Requests: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#1255450 (10Krenair) It's not really the same group. LDAP ops+nda+wmf roughly makes up #WMF-NDA, and even then it's not exactly the same. The LDAP groups grant you access to various misc. ser...
[12:27:56] <MaxSem>	 that change is https://gerrit.wikimedia.org/r/#/c/208408/2/redisJobChronService
[12:28:27] <_joe_>	 MaxSem: do you know how to deploy the jobrunner?
[12:28:53] <Krenair>	 https://wikitech.wikimedia.org/wiki/Jobrunner#Deployment
[12:29:24] <_joe_>	 ok let me try one thing, then I'll revert
[12:30:27] <MaxSem>	 ugh, self-merged and immediately deployed
[12:30:57] <MaxSem>	 tbh, nobody else knows shit about jobrunners anyway
[12:30:58] <Krenair>	 Don't ops do that?
[12:31:12] <Krenair>	 Though it is a mediawiki/* repo, so...
[12:31:25] <wikibugs>	 6operations, 6Commons, 10Wikidata, 7Performance: edits seem to be very slow - https://phabricator.wikimedia.org/T97930#1255455 (10Joe) We have identified an issue with connections to Redis, probably due to the fact that the jobrunners are not working since last night. We're working on it now.
[12:32:12] <_joe_>	 nope
[12:32:25] <_joe_>	 aaron is the main maintainer of the jobrunners
[12:32:34] <_joe_>	 but they stopped working immediately after the deploy
[12:32:35] <_joe_>	 so...
[12:32:53] <_joe_>	 MaxSem: I'm reverting that
[12:33:03] <MaxSem>	 _joe_, concur
[12:33:48] <_joe_>	 MaxSem: https://gerrit.wikimedia.org/r/#/c/208516/ care to +1?
[12:34:18] <MaxSem>	 _joe_, can't you just deploy a previous rev in trebuchet?
[12:34:57] <MaxSem>	 whatever, do what's faster
[12:35:04] <_joe_>	 MaxSem: not sure, I've not a great experience with trebuchet :)
[12:35:14] <MaxSem>	 merged
[12:35:58] <_joe_>	 MaxSem: thanks
[12:38:00] <_joe_>	 !log deploying I969fe8d329c1bcbb919a54cb225200ba0e006a03 to the jobrunners trying to make them work again
[12:38:08] <morebots>	 Logged the message, Master
[12:40:13] <_joe_>	 ok jobrunners are working again
[12:40:20] <MaxSem>	 :P
[12:41:28] <_joe_>	 mmmh still something fishy I'd say
[12:41:29] <hoo>	 Do we have any error logs regarding what went wrong?
[12:42:16] <hoo>	 Oh, it's obvious
[12:42:18] <hoo>	 nvm
[12:42:22] <_joe_>	 hoo: not really, I am just correlating the time of day when the problem happened with the release
[12:42:56] <_joe_>	 but redis issues not going away
[12:43:23] <Krenair>	 connection slots still filling up?
[12:43:39] <_joe_>	 Krenair: I'm going to get back to it
[12:44:41] <sjoerddebruin>	 I just want to tell you all good luck. We're all counting on you.
[12:44:52] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[12:45:00] <Nemo_bis>	 Too bad https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=terbium.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1398774923&v=600868&m=Global_JobQueue_length&z=large no longer works
[12:45:00] <_joe_>	 sjoerddebruin: sorry to be late but I got home quite late yesterday :)
[12:45:21] <_joe_>	 ok the redis server on rdb1001 is kaputt
[12:45:29] <sjoerddebruin>	 Yikes
[12:47:20] <_joe_>	 ok something tells me this may have solved something
[12:47:36] <_joe_>	 !log restarting redis server on rdb1001, lagging on the most basic queries
[12:47:43] <morebots>	 Logged the message, Master
[12:48:11] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0]
[12:48:59] <_joe_>	 sjoerddebruin: still seeing slow edits?
[12:49:10] <sjoerddebruin>	 Will look.
[12:49:16] <_joe_>	 thanks :)
[12:49:36] <_joe_>	 because well, redis errors are gone in the logs
[12:49:38] <sjoerddebruin>	 Nulledit seems a lot faster on nlwiki.
[12:50:00] <sjoerddebruin>	 Wikidata is still slow with saving.
[12:51:09] <_joe_>	 mh
[12:51:24] <_joe_>	 yeah the redis failures are back, damn
[12:52:20] <_joe_>	 redis is maxing out one cpu on rdb1001
[12:52:34] <hoo>	 _joe_: Is the redis server the bottleneck or is it being hammered down?
[12:52:41] <MaxSem>	 wake up Aaron?
[12:53:02] <_joe_>	 hoo: I'd say it is the bottleneck
[12:53:17] <hoo>	 We can fail over to the second one
[12:53:18] <_joe_>	 but the reason why it fails may be app-related
[12:53:24] <hoo>	 ok
[12:53:26] <_joe_>	 MaxSem: no I'm not sure it's his fault yet
[12:53:32] <_joe_>	 MaxSem: maybe in a few
[12:55:08] <_joe_>	 https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&h=rdb1001.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=network_report&c=Redis+eqiad ok something is definitely hammering redis
[12:55:18] <_joe_>	 wtf is happening?
[12:55:46] <MaxSem>	 _joe_, did you restart the jobrunners after deploying?
[12:55:54] <_joe_>	 MaxSem: yes, of course
[12:55:55] <hoo>	 _joe_: Did you kill hhvm on the job runners, yet? The timeout is set to 2s, but it still might be hanging for some reason
[12:56:10] <_joe_>	 hoo: no I didn't kill hhvm on the JR
[12:56:19] <_joe_>	 but that's a sensible idea anyways
[12:56:57] <_joe_>	 https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=rdb1001.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=network_report&c=Redis+eqiad ugh
[12:58:21] <_joe_>	 it's probably one huge data structure in redis that has grown too long
[12:58:35] <Krenair>	 redis falls over if you do that? :S
[12:58:39] <_joe_>	 hoo, MaxSem do you know what are the names of the job queue keys?
[12:58:54] <_joe_>	 Krenair: yeah it may 
[12:59:20] <Krenair>	 something like this: https://phabricator.wikimedia.org/T87040#984019
[12:59:46] <hoo>	 _joe_: mh, one sec
[12:59:56] <Krenair>	 you can get a list of job types running with `mwscript showJobs.php <wiki name> --group`
[13:00:18] <hoo>	 l-unclaimed
[13:00:27] <hoo>	 _joe_: ^
[13:00:28] <Krenair>	 I wouldn't be surprised if enwiki refreshLinks is taking up a non-trivial amount of space
[13:00:35] <_joe_>	 hoo: what does that mean?
[13:00:52] <hoo>	 that's the queue key
[13:00:58] <hoo>	 but it's getting mapped to something, probably
[13:01:43] <hoo>	 wfForeignMemcKey( $db, $prefix, 'jobqueue', $type, $prop )
[13:01:45] <hoo>	 :/
[13:03:13] <_joe_>	 hoo: I'm tempted to drop the current data in redis (after saving the AOF file) to see if that solves the problem :)
[13:03:44] <hoo>	 _joe_: Well, in that case we can also switch over to the secondary redis shard
[13:04:19] <_joe_>	 hoo: ok I think I might have found one of the problems
[13:04:44] <MaxSem>	 _joe_, with a lot of renames due to SUL migration, would recommend doing that only in dire circumstances
[13:04:45] <_joe_>	 we have 200K keys 'commonswiki:jobqueue:*'
[13:04:49] <_joe_>	 most restbase jobs
[13:04:55] <hoo>	 enwiki has way more
[13:05:02] <_joe_>	 sorry, we're now at 600K
[13:05:12] <_joe_>	 yeah it's not sane
[13:05:22] <Krenair>	 MaxSem, do we still have a lot of renames ongoing?
[13:05:25] <_joe_>	 we should drop those maybe?
[13:05:46] <MaxSem>	 Krenair, manual followups
[13:05:47] <_joe_>	 TIL: use an actual queue instead of redis
[13:06:04] <_joe_>	 commonswiki:jobqueue:RestbaseUpdateJobOnDependencyChange:rootjob:5c3c218255f2fb56e075120477e79092808d10bd
[13:06:10] <hoo>	 _joe_: Some types we can drop, dropping others will cause major harm
[13:06:16] <MaxSem>	 what if the problem is in volume of requests
[13:06:17] <MaxSem>	 ?
[13:06:19] <_joe_>	 I'm at more than 1M of those
[13:06:36] <_joe_>	 MaxSem: from whom?
[13:06:36] <hoo>	 Dropping those will not make the world explode, I think
[13:06:41] <Krenair>	 Not exactly at a ridiculously high rate: https://meta.wikimedia.org/wiki/Special:Log/gblrename
[13:06:44] <_joe_>	 hoo: I concur
[13:08:04] <Krenair>	 Do we entirely rely on redis for keeping job data, or is it all backed by mysql?
[13:08:14] <_joe_>	 Krenair: we use redis
[13:08:14] <hoo>	 Redis only
[13:08:53] <icinga-wm>	 PROBLEM - puppet last run on mw2143 is CRITICAL puppet fail
[13:09:13] <_joe_>	 hoo: btw, bot redis shards are suffering the same issue
[13:10:22] <hoo>	 That's awry
[13:11:08] <hoo>	 _joe_: We can dump the list of jobs and then drop them, probably
[13:11:24] <hoo>	 For jobs that aren't extremely important for data consistency
[13:11:27] <_joe_>	 hoo: ok I am doing that
[13:11:54] <hoo>	 Almost 7m refreshlinks jobs on enwiki
[13:12:28] <_joe_>	 yeah and 10m (when I stopped it) restbase jobs queued 
[13:18:19] <Krenair>	 hoo, there was up to about 13m just a day or so ago..
[13:19:39] <hoo>	 _joe_: Shall we lower the timeout? Also I think we should disable user renames right now
[13:19:56] <_joe_>	 I'm not sure user renames are the problem here
[13:20:05] <hoo>	 Well, they're almost certainly not
[13:20:10] <_joe_>	 the problem are restbase/parsoid jobs being queued and not processed
[13:20:16] <hoo>	 but if one of these fail, we have to recover per hand
[13:20:29] <MaxSem>	 user renames are just the type of jobs we don't ever want to lose
[13:20:46] <hoo>	 fail to be run of fail to be enqueued
[13:20:48] <hoo>	 yes, that
[13:21:35] <Krenair>	 We should probably have a list of such job types somewhere prominent
[13:25:10] <grrrit-wm>	 (03PS1) 10Hoo man: Temporary disable global renames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208518 
[13:25:23] <hoo>	 _joe_: ^ Ok with you? I don't really want to clean up after thes
[13:25:24] <hoo>	 e
[13:25:30] <hoo>	 that patch is just revoking permissions
[13:26:09] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 031] Temporary disable global renames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208518 (owner: 10Hoo man)
[13:26:17] <_joe_>	 hoo: need me to deploy>
[13:26:21] <_joe_>	 *?
[13:26:24] <_joe_>	 it seems ok to me
[13:26:25] <hoo>	 No, I can do that
[13:26:31] <_joe_>	 sorry I'm looking at redis itself
[13:26:32] <grrrit-wm>	 (03CR) 10Hoo man: [C: 032] Temporary disable global renames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208518 (owner: 10Hoo man)
[13:26:38] <grrrit-wm>	 (03Merged) 10jenkins-bot: Temporary disable global renames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208518 (owner: 10Hoo man)
[13:27:32] <logmsgbot>	 !log hoo Synchronized wmf-config/: Temporary disable global renames (duration: 00m 16s)
[13:27:37] <morebots>	 Logged the message, Master
[13:28:32] <icinga-wm>	 RECOVERY - puppet last run on mw2143 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:31:33] <_joe_>	 hoo: we'd really need to wipe out at least one redis queue
[13:31:46] <_joe_>	 we were wondering what that may mean
[13:31:57] <_joe_>	 in terms of impact
[13:32:12] <hoo>	 Totally depends on the type of jobs you would remove
[13:32:39] <bblack>	 is there any easy way to flush out only the jobs that are non-critical?
[13:32:45] <hoo>	 Removing refreshlinks for example isn't nice, but doesn't cause major harm, we did that often in the past
[13:32:50] <_joe_>	 bblack: no it's a pain
[13:33:09] <bblack>	 maybe there should be separate queues, or some metadata indication of what's expandable, for future reference
[13:33:16] <bblack>	 s/expandable/expendable/
[13:33:42] <_joe_>	 bblack: the best thing I can think of now would be - make rdb1002 not a slave of rdb1001, wipe its queue, point mediawiki and the jobrunners to it
[13:33:48] <_joe_>	 but I need a break
[13:34:09] <bblack>	 ok so perhaps we could just delete refreshlinks, if those are a significant fraction?
[13:34:11] <_joe_>	 I'll be back shortly, we're not on fire anyway and I am too tired to think 
[13:34:25] <_joe_>	 bblack: I suspect the restbase queue is by fare the more harmful
[13:34:40] <bblack>	 well I have very little clue with how to operate on these things at this level
[13:35:49] <_joe_>	 bblack: what preoccupies me is looking at perf on rdb1001 I don't see anything so strange
[13:35:59] <_joe_>	 apart from dictFind using a lot of cpu
[13:36:27] <bblack>	 ^ does sound like something that would happen with lots of keys
[13:36:30] <hoo>	 mh... are we maybe just running into some arbitrary connection limit?
[13:36:45] <_joe_>	 hoo: no, never reached maxclients
[13:37:05] <hoo>	 weird
[13:39:12] <bblack>	 the disk i/o rate for the AOF file doesn't look like anything deadly
[13:40:09] <_joe_>	 bblack: yeah it's not that either
[13:40:53] <_joe_>	 23.91%  redis-server  redis-server        [.] 0x74b7f
[13:40:56] <_joe_>	 this is weird
[13:41:06] <_joe_>	 I have most symbols, but not this one, wtf?
[13:43:42] <MaxSem>	 _joe_, lua interpreter?
[13:43:50] <Krenair>	 looks like AaronSchulz is awake
[13:45:30] <_joe_>	 windowcat: major fuckup with your jobrunner release yesterday, the net effect has been: https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&h=rdb1001.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=network_report&c=Redis+eqiad
[13:45:40] <_joe_>	 and jobrunners broken, and now slow edits
[13:46:04] <_joe_>	 because well, we have a fuckton of parsoid and restbase jobs queued and redis can't handle the number of keys apparently
[13:46:49] <_joe_>	 I'll leave the question of why on earth you a) deployed on saturday b) did that by automerging c) not in an emergency d) not caring to look if everything was ok to a later moment
[13:49:14] <windowcat>	 _joe_: it was deploy to fix constant OOMs (I noticed it indirectly from an issue Hoo reported). All delayed jobs were broken. I was looking at it the afterwards and in some emails with gwicke, but there is still one bit that didn't make sense to me.
[13:50:02] <_joe_>	 windowcat: there are a lot for me actually, the only reason why I pointed to your change is that the breakage timing corresponds
[13:50:04] <windowcat>	 _joe_: anyway, was jonchron restarted after that revert?
[13:50:34] <windowcat>	 jobrunner cpu in ganglia looks the same
[13:50:35] <_joe_>	 no just the jobrunner service
[13:50:47] <_joe_>	 what's jobcron?
[13:51:13] <windowcat>	 _joe_: the one that was changed in that commit, it does the periodic task stuff (undelaying jobs and re-enqueued failed ones)
[13:51:15] <_joe_>	 windowcat: my working hypothesis is that we have way too many keys in redis right now for parsoid and restbase jobs
[13:52:16] <_joe_>	 windowcat: so how do I restart that?
[13:52:19] <_joe_>	 that's new :P
[13:52:21] <bblack>	 root@rdb1001:/tmp# cat /tmp/redis-keys |cut -d: -f1-3|sort|uniq -c|sort -rn
[13:52:24] <bblack>	 1834669 commonswiki:jobqueue:RestbaseUpdateJobOnDependencyChange
[13:52:24] <bblack>	     577 commonswiki:jobqueue:htmlCacheUpdate
[13:52:27] <bblack>	 1462434 commonswiki:jobqueue:ParsoidCacheUpdateJobOnDependencyChange
[13:52:30] <bblack>	  286220 commonswiki:jobqueue:refreshLinks
[13:52:38] <_joe_>	 and this is just commons
[13:52:41] <Krenair>	 Is that something that's supposed to be restarted every time you deploy the jobrunner?
[13:53:14] <windowcat>	 Krenair: if it changes, I guess, the only annoying thing is that the git deploy service restart will only do jobrunner
[13:53:23] <windowcat>	 it needs a virtual service to restart both really
[13:53:28] <Krenair>	 Why is that not documented at https://wikitech.wikimedia.org/wiki/Jobrunner#Deployment ?
[13:54:06] <_joe_>	 !log restarting jobcron on the jobrunners
[13:54:11] <morebots>	 Logged the message, Master
[13:55:13] <_joe_>	 windowcat: I guess the issue now is on redis directly
[13:55:22] <_joe_>	 it has too many keys and it can't handle that
[13:55:25] <windowcat>	 now if they go back to OOMing pre-command, I'd expect load to drop (it doesn't make sense that load was OK before it reached the point of just OOMing, which is what I don't get)
[13:56:01] <windowcat>	 _joe_: you mean too much space or literally to many keys (the number of keys should be modest, the the number of hash/list items will be very high in particularly keys)
[13:56:38] <_joe_>	 I wouldn't define 4M keys for commons only "modest"
[13:56:54] <_joe_>	 windowcat: we want to flush the redis machines from all keys if possible
[13:56:59] <windowcat>	 _joe_: hash subkeys, not keys
[13:57:16] <_joe_>	 oh that too yes
[13:57:30] <_joe_>	 I see in perf redis is spending a lot of time in dictFind
[13:57:48] <_joe_>	 windowcat: and I promise I'll buy you guys a real queue if you behave
[13:57:59] <windowcat>	 _joe_: are you saying there are 4M actual keys? That would only be possible if there were many garbage queues
[13:58:51] <_joe_>	 windowcat: I'm saying that, yes, just for commons
[13:59:04] <_joe_>	 bblack has more info, brb
[13:59:26] <bblack>	 windowcat: redis-cli -a <PWD> KEYS "commonswiki:jobqueue:*" > /tmp/redis-keys
[13:59:34] <bblack>	 ^ is what generated the file the counts above came from
[13:59:57] <hoo>	 Error logs log better right now
[14:00:01] <hoo>	 * look
[14:01:19] <bblack>	 (the actual keys file contents look like this before filtering:)
[14:01:22] <bblack>	 commonswiki:jobqueue:ParsoidCacheUpdateJobOnDependencyChange:rootjob:4080cba0135b6976bb2a557da0ed8dd7b0ab040e
[14:01:25] <bblack>	 commonswiki:jobqueue:RestbaseUpdateJobOnDependencyChange:rootjob:1b4f340b705c3e841b21078676b0bf11d721141e
[14:01:26] <hoo>	 I should not have said that...
[14:01:41] <windowcat>	 ahh, root jobs, not actual queue keys
[14:01:54] <windowcat>	 yeah, there will be many of those (not new)
[14:02:30] <bblack>	 it's normal to have ~4M x "rootjob" for commons sitting in the queue?
[14:02:32] <windowcat>	 they are just timestamp keys that are non-essential...they could be nuked if they caused some problem
[14:03:17] <bblack>	 re-checking with the 4th field included
[14:03:33] <bblack>	 yeah about the same
[14:03:38] <bblack>	 root@rdb1001:~# cat /tmp/redis-keys |cut -d: -f1-4|sort|uniq -c|sort -rn
[14:03:41] <bblack>	 1834665 commonswiki:jobqueue:RestbaseUpdateJobOnDependencyChange:rootjob
[14:03:44] <bblack>	 1462432 commonswiki:jobqueue:ParsoidCacheUpdateJobOnDependencyChange:rootjob
[14:03:47] <bblack>	  286214 commonswiki:jobqueue:refreshLinks:rootjob
[14:03:48] <bblack>	 (+ very long tail of 1x various things)
[14:03:50] <bblack>	     573 commonswiki:jobqueue:htmlCacheUpdate:rootjob
[14:03:53] <bblack>	       1 commonswiki:jobqueue:webVideoTranscode:z-claimed
[14:03:57] <windowcat>	 4M is, off the bat, interesting, since there is normally one per template change
[14:04:43] <windowcat>	 so changing a hugely used template would create many jobs in the queues but not many rootjob keys (which just have a unix timestamp as the value)
[14:05:20] <windowcat>	 on the other hand, they have a very high expiry (relying on LRU), so they could build up over time
[14:05:29] <bblack>	 well it sounds like you understand this much better than I do.  Any ideas on how to fix this?
[14:05:32] <_joe_>	 so yah, can we flush the darn things?
[14:05:58] <bblack>	 I'm still on "our best idea so far is to delete a bunch of parsoid/restbase update rootjob thingies"
[14:06:21] <windowcat>	 not sure how much it will help, but deleting all :rootjob: keys is safe
[14:06:55] <_joe_>	 windowcat: my hipothesis is that your change took all the unclaimed/failed jobs and by error made them individual keys in redis
[14:06:59] <_joe_>	 is that possible?
[14:07:41] <Krenair>	 if it's safe and it might help, let's try it?
[14:07:52] <_joe_>	 bblack: I'd say we shoudl flush those keys, yes
[14:08:17] <bblack>	 ok working on that...
[14:08:31] <windowcat>	 the file I changed doesn't ever read/write rootjob keys, so it couldn't have made those
[14:08:38] <windowcat>	 "	(+ very long tail of 1x various things)"
[14:08:43] * windowcat wonders what those are
[14:10:54] <bblack>	       1 commonswiki:jobqueue:htmlCacheUpdate:l-unclaimed
[14:10:54] <bblack>	       1 commonswiki:jobqueue:htmlCacheUpdate:h-sha1ById
[14:10:54] <bblack>	       1 commonswiki:jobqueue:htmlCacheUpdate:h-idBySha1
[14:10:55] <bblack>	       1 commonswiki:jobqueue:htmlCacheUpdate:h-data
[14:11:05] <bblack>	 things like that, with many other values for htmlCacheUpdate part, etc
[14:11:35] <windowcat>	 right, there a few of those for each non-empty queue, which is unremarkable
[14:16:09] <bblack>	 !log deleting :rootjob: entries for commonswiki from redis
[14:16:13] <morebots>	 Logged the message, Master
[14:17:19] <bblack>	 deletes are getting slower as it goes
[14:17:39] <Nemo_bis>	 15.50 < windowcat> jobrunner cpu in ganglia looks the same
[14:17:45] <Nemo_bis>	 network is very different though
[14:18:43] <_joe_>	 yes thanks nemo
[14:19:00] <_joe_>	 I was about to point that out
[14:19:37] <bblack>	 !log deleting :rootjob: entries for enwiki from redis too
[14:19:40] <morebots>	 Logged the message, Master
[14:19:43] <bblack>	 both are slow now
[14:20:05] <bblack>	 the first 200K or so of the commons ones deleted fast, now trying to do further deletes from the en or commons lists crawls
[14:20:22] <bblack>	 well, then in the time I was typing that, the enwiki one started deleting fast again
[14:20:25] <bblack>	 beats me
[14:20:33] <bblack>	 I'll let that one go for a bit
[14:20:52] <windowcat>	 Nemo_bis: I mentioned I expected i/o to drop since that part of the service will go back to failing before sending any commands
[14:20:59] <bblack>	 enwiki's still ripping through fast now
[14:22:09] <_joe_>	 probably mediawikis are writing to the other node now
[14:22:40] <bblack>	 they move based on some load feedback?
[14:22:58] <_joe_>	 based on timeouts
[14:23:19] <_joe_>	 also, it goes down every 10 minutes
[14:23:21] <bblack>	 en/commons were both in similar ballpark, de 1/4 either of those, so yeah probably no need to go much deeper than commons+en+de
[14:24:40] <_joe_>	 windowcat: do you happen to remember what has that periodicity
[14:25:15] <_joe_>	 maybe is the flush to disk in redis
[14:25:36] <bblack>	 probably
[14:26:02] <bblack>	 one of my deletes just got an isolated Could not connect to Redis at 127.0.0.1:6379: Connection timed out
[14:26:13] <bblack>	 (I'm batching them in 100x keys per cli invocation)
[14:26:20] <windowcat>	 I don't recall in periods of 10 off hand
[14:26:37] <_joe_>	 save ""
[14:26:42] <_joe_>	 so nope
[14:27:02] <bblack>	 over half the enwiki jobs were cleared before the latest delete slowdown
[14:27:37] <bblack>	 and now it has sped back up on processing deletes again
[14:27:38] <windowcat>	 bblack: judging from running 'mwscript runJobs.php enwiki --type enqueue' and http://performance.wikimedia.org/xenon/svgs/daily/2015-05-03.svgz (similar to yesterdays graph) it seems like connections are taking up a huge chunk of time relative to actual commands
[14:28:02] <_joe_>	 windowcat: yeah redis is borked
[14:28:14] <_joe_>	 windowcat: maxing out 1 cpu on those servers
[14:28:23] <bblack>	 now all my deletes are spamming: Could not connect to Redis at 127.0.0.1:6379: Cannot assign requested address
[14:28:30] <bblack>	 did someone just try to restart something?
[14:29:11] <_joe_>	 bblack: not me, but I see some effect on ganglia, rdb1001
[14:29:14] <bblack>	 now they can connect again, I think
[14:29:27] <windowcat>	 of course, it's hard to tell that from xenon alone since it uses stack dump sampling
[14:29:28] <bblack>	 re-fetching key lists so I can restart these deletes from wherever they left off
[14:30:01] <windowcat>	 _joe_: it may be that redis favors commands from clients over new accept requests
[14:30:20] <windowcat>	 so maybe that's normal under too much cpu
[14:30:26] <_joe_>	 windowcat: probably, I don't know redis internals well enough thogh
[14:30:48] <_joe_>	 windowcat: but that was triggered by the JR change you made for sure
[14:31:38] <bblack>	 the random windows of "can't connect to redis" vs "deletes happen fast" vs "deletes happen slow" is disturbing
[14:33:37] <bblack>	 is there a smarter way to wipe just the parsoid and/or restbase rootjobs? do they comprise a larger data structure that can be deleted in one command?
[14:33:53] <_joe_>	 -rw-rw---- 1 redis redis  15G May  3 14:33 rdb1001-6379.aof
[14:34:15] <bblack>	 I'm in another fast-delete window now
[14:34:24] <_joe_>	 bblack: stopping redis, removing this file (or renaming maybe) and then start redis again
[14:34:29] <_joe_>	 that'd be fast
[14:34:31] <_joe_>	 :)
[14:34:32] <bblack>	 enwiki rootjob delete finished
[14:34:34] <bblack>	 commons still going
[14:34:49] <windowcat>	 bblack: you want to nuke all parsoid jobs?
[14:35:31] <_joe_>	 windowcat: yes that's what we're trying to do
[14:35:35] <bblack>	 and restbase
[14:35:36] <_joe_>	 the restbase ones too
[14:35:44] <windowcat>	 nevermind, missed the "root" part
[14:35:47] <_joe_>	 I'm not sure mwscript could pull it off
[14:35:51] <bblack>	 what I'm doing in practice right now is deleting all :rootjob: entries for the top wikis
[14:36:04] <windowcat>	 no faster way to delete "root" jobs that what you are doing
[14:36:05] <_joe_>	 bblack: why not the :jobqueue: ones?
[14:36:12] <windowcat>	 deleting the actual queue is easy from MW
[14:36:21] <bblack>	 well, :jobqueue:rootjob, yes
[14:36:30] <bblack>	 _joe_: it's a subset, and it's the bulk ones
[14:37:00] <_joe_>	 ok sorry
[14:37:07] <bblack>	 !log enwiki + commonswiki jobqueue:*:rootjob wipe complete
[14:37:14] <morebots>	 Logged the message, Master
[14:37:21] <bblack>	 (^ * covers rb/parsoid, and also things like refreshlinks)
[14:37:35] <bblack>	 !log dewiki jobqueue:*:rootjob wipe complete
[14:37:38] <morebots>	 Logged the message, Master
[14:37:49] <bblack>	 give it a couple mins and see if any positive effect?
[14:38:01] <_joe_>	 ok
[14:38:18] <_joe_>	 I don't think that wiil happen, tough
[14:40:11] <bblack>	 ^ pessimist
[14:40:24] <_joe_>	 the weird net graph is still the same
[14:40:26] <_joe_>	 :)
[14:41:14] <bblack>	 the weekly view is interesting too
[14:41:15] <bblack>	 https://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&h=rdb1001.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=network_report&c=Redis+eqiad
[14:41:32] <bblack>	 stable -> dropout (presumably the OOM issue) -> back to normal-ish levels, but all spiky
[14:42:32] <bblack>	 what happened at the start of that dropout (which presumably lead to OOM issues which lead to the saturday path?)
[14:42:40] <bblack>	 s/path/patch/
[14:46:57] <_joe_>	 ok so looking at tcpdump on rdb1001 it seems like we do a lot of EVALS
[14:47:13] <_joe_>	 all to keys like frwiki:jobqueue:enqueue:l-unclaimed
[14:47:25] <_joe_>	 or h-sha1ById
[14:49:32] <_joe_>	 redis 127.0.0.1:6379> llen enwiki:jobqueue:enqueue:l-unclaimed
[14:49:32] <_joe_>	 (integer) 463178
[14:49:50] <_joe_>	 windowcat: isn't that like a lot?
[14:52:59] <windowcat>	 _joe_: trying to see what type those are
[14:53:03] <windowcat>	 that queue is kind of new-ish
[14:53:36] <_joe_>	 windowcat: enwiki:jobqueue:enqueue:h-data is *huge*
[14:53:52] <_joe_>	 I suspect dropping it would be a good thing
[14:54:06] <_joe_>	 219974 keys in an hash that we eval via lua seem too much
[14:54:33] <_joe_>	 sorry, that's for commonswiki
[14:54:50] <bblack>	 also, redis process is always locked at 100% CPU (there's idle CPU on the host, but redis is bound up on a single core somewhere, for something that can't scale across threads/cores)
[14:54:54] <windowcat>	 I was about to suggesting nuking it, triggerOpportunisticLinksUpdate() is the only thing that puts those there in mwcore (they are refreshLinks)
[14:55:08] <bblack>	 and "perf top" points at Lua context switching involved in that
[14:55:14] <windowcat>	 _joe_: should I delete that queue and see what happens?
[14:55:23] <_joe_>	 windowcat: which one?
[14:55:27] <windowcat>	 enqueue
[14:55:46] <windowcat>	 they are just opportunistic page links updates we used to not do before
[14:55:47] <_joe_>	 enwiki and commons probably too, but wait for me to see how big hdata for enwiki is
[14:55:51] <windowcat>	 no one will miss them
[14:56:01] <_joe_>	 yeah 1 sec
[14:56:37] <bblack>	 we really need to separate out the "no one will miss this" stuff from the "this is critical for data integrity" stuff better at some architectural level for our redis stuff
[14:56:42] <bblack>	 like, maybe separate clusters
[14:56:44] <windowcat>	 new JobSpecification( 'refreshLinks', $params, array(), $this->mTitle )
[14:57:06] <_joe_>	 yeah h-data for enwiki has more than one million keys
[14:57:10] <windowcat>	 that's missing removeDuplicates...not sure how much that matters though
[14:57:15] * windowcat will look at that line later
[14:57:21] <windowcat>	 (WikiPage.php)
[14:57:23] <_joe_>	 we should really really review how we use redis
[14:57:47] <_joe_>	 windowcat: go on and nuke it
[15:01:11] <_joe_>	 windowcat: are you doing it? or I can do it quickly for you :P
[15:01:23] <windowcat>	 oh, yeah, done for enwiki/commonswiki
[15:02:13] <_joe_>	 still seeing the huge network traffic and all the issues :(
[15:03:33] <bblack>	 well, huge net traffic is similar to historical network traffic levels.  it's just huge relative to dropout yesterday, and very very spiky compared to flat lines of history
[15:05:08] <_joe_>	 bblack: really?
[15:05:33] <_joe_>	 bblack: you're right
[15:05:49] <_joe_>	 May  3 15:05:41 mw1082:  #012Warning: Failed connecting to redis server at 10.64.32.76: Connection timed out
[15:05:52] <_joe_>	 May  3 15:05:41 mw1176:  #012Warning: timed out after 2 seconds when connecting to 10.64.32.76 [110]: Connection timed out
[15:05:55] <_joe_>	 still seeing these
[15:09:18] <_joe_>	 windowcat: do we cache lua scripts in redis? and if so, how do we do that?
[15:09:27] <sjoerddebruin>	 Websites are still slow tbh
[15:09:40] <_joe_>	 sjoerddebruin: yeah we're not out of the mud yet
[15:09:51] <_joe_>	 sorry, this is more subtle than I anticipated
[15:10:06] <_joe_>	 windowcat: who loads, reviews, maintains such scripts?
[15:10:49] <_joe_>	 what I see is that probably some script is hagning 
[15:10:57] <_joe_>	 how do we update/audit those?
[15:11:01] <windowcat>	 _joe_: they are cache programmatically, the scripts are mostly in jobrunner, with a few smaller ones in JobQueueRedis
[15:11:21] <windowcat>	 the cache clears on restart in redis
[15:11:22] <_joe_>	 windowcat: so did we change any script recently?
[15:11:48] <_joe_>	 ok, so maybe what I saw earlier - 5 minutes of sanity after restarting redis
[15:12:19] <windowcat>	 my last change left the scripts untouched, nothing recent changed lua before that either
[15:12:39] <_joe_>	 ...
[15:12:47] <_joe_>	 so... what is wrong here?
[15:13:02] <_joe_>	 windowcat: how much harm would flushing redis completely clean do?
[15:13:32] <_joe_>	 because I'm hopeful that would solve this
[15:13:48] <bblack>	 it's possible a complete flush wouldn't solve this, of course, because the problem is contention in the lua code for new stuff coming in all the time
[15:14:23] <bblack>	 whatever it is, the pointer we have now is CPU contention in Lua code inside redis.  It may not be that the code itself changed, but that access patterns of things hitting the code has changed.
[15:14:25] <_joe_>	 bblack: well the "after a restart things go smooth for five minutes or so" pattern seems to be interesting
[15:14:42] <_joe_>	 bblack: yep, which ws my original idea btw
[15:14:59] <_joe_>	 that's why I think removing what we have in redis may help, or not
[15:15:36] <_joe_>	 oh guys wait
[15:15:44] <_joe_>	 I didn't really restart the jobcron
[15:15:45] <_joe_>	 meh
[15:15:49] <_joe_>	 I'm too tired :(
[15:15:57] <bblack>	 ... because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don't know we don't know ...
[15:16:08] <windowcat>	 _joe_: has anyone tried just stopping the chron thing for sanity?
[15:16:22] <windowcat>	 in theory it doesn't do anything, but it would be nice to rule that out
[15:16:32] <_joe_>	 windowcat: nope I thought I restarted it, restarting for real now
[15:17:32] <_joe_>	 !log restarted jobchron, not jobcron, this time for real
[15:17:37] <morebots>	 Logged the message, Master
[15:17:49] <bblack>	 heh
[15:18:06] <_joe_>	 sorry guys
[15:18:14] <_joe_>	 3 hours of sleep take their toll :(
[15:18:30] <_joe_>	 I was supposed to sleep after lunch, but I saw the reports
[15:18:37] <bblack>	 I'm pretty fresh, I just have no idea wtf I'm doing
[15:18:39] <_joe_>	 sjoerddebruin: is it still slow now?
[15:19:09] <_joe_>	 (sorry to ab-use of your time, btw)
[15:19:16] <bblack>	 also, I have about 15 minutes left to make a call on whether I stay here through this and abort real-life plans, or take off.  So far it's looking like the former
[15:19:32] <_joe_>	 wait 10 minutes :)
[15:19:45] * windowcat keeps hearing glass break outside
[15:19:53] <_joe_>	 it's fixing itself now
[15:19:53] <sjoerddebruin>	 A lot faster.
[15:19:56] <_joe_>	 sorry guys
[15:19:57] <_joe_>	 meh
[15:20:01] <bblack>	 it's angry editors throwing baseballs at your house windowcat
[15:20:01] <_joe_>	 that was it
[15:20:05] <hoo>	 Enumerating jobs is fast again, so we're probably good again
[15:20:16] <_joe_>	 yeah I restarted the right service this time
[15:20:18] <sjoerddebruin>	 Null edits on nlwiki are instant again instead of 10 seconds.
[15:20:33] <_joe_>	 my brain just read "jobcron" instead of "jobchron"
[15:20:49] <_joe_>	 and then I was too tired to really look it went through :(
[15:20:53] <hoo>	 Ok, if it holds up for the next 30-60min I'll undisable renames
[15:20:53] <_joe_>	 bblack: I guess you can go
[15:22:13] <sjoerddebruin>	 Thanks guys! <3
[15:22:17] <bblack>	 well step 1 in my plans is to take a shower and get dressed, so I'll check in after that and see how it's going
[15:22:43] <_joe_>	 sjoerddebruin: yeah I was like 1 hour late, but blame the NBA playoffs for that :P
[15:25:02] <Krenair>	 guess we'll need some incident documentation for this?
[15:25:27] <Krenair>	 or does it not qualify... hmm
[15:25:42] <wikibugs>	 6operations, 6Commons, 10Wikidata, 7Performance: edits seem to be very slow - https://phabricator.wikimedia.org/T97930#1255502 (10Joe) After a few back and forth, We're pretty sure the cause of the outage was due to the changes in the jobchron service on the jobrunners that were released on saturday via...
[15:25:43] <_joe_>	 Krenair: it does
[15:25:58] <hoo>	 +1 for incident documentation
[15:25:59] <_joe_>	 Krenair: for sure it does, it had a user-facing impact
[15:26:12] <Krenair>	 okay
[15:26:15] <_joe_>	 also, I really want the WMF to review our policy on releases
[15:26:17] <Krenair>	 I wasn't sure exactly where we draw the line
[15:26:25] <_joe_>	 so I want this to be on record
[15:26:39] <windowcat>	 hoo: I'm filing a task for https://gerrit.wikimedia.org/r/#/c/208397/ since that still needs fixing
[15:27:01] <wikibugs>	 6operations, 6Commons, 10Wikidata, 7Performance: edits seem to be very slow - https://phabricator.wikimedia.org/T97930#1255507 (10Joe) Users report editing is fast again.
[15:27:06] <hoo>	 windowcat: Thank you 
[15:27:18] <wikibugs>	 6operations, 6Commons, 10Wikidata, 7Performance: edits seem to be very slow - https://phabricator.wikimedia.org/T97930#1255509 (10Joe) 5Open>3Resolved a:3Joe
[15:27:19] <sjoerddebruin>	 I want this button that wakes all operators. http://www.trendinggear.com/wp-content/uploads/2012/10/Red-Panic-Button.jpg
[15:27:41] <_joe_>	 hoo: I'm off now, I really really need to rest. If it comes back again, please don't esitate to phone me
[15:27:53] <_joe_>	 bblack: you too :)
[15:28:07] <windowcat>	 hoo: how long did delaying not work?
[15:28:35] <hoo>	 windowcat: I think only on Saturday... I can pull some numbers and estimate from taht
[15:28:38] <windowcat>	 either it didn't work at all or it would have made an outage (it had even larger batches than with yesterdays change)
[15:29:28] <hoo>	 _joe_: I don't have your number (as I don't have legit access to officewiki)
[15:29:59] <hoo>	 I could be naughty and pull it via shell though, if really needed, I guess
[15:30:30] <wikibugs>	 6operations, 6Commons, 10Wikidata, 7Performance: edits seem to be very slow - https://phabricator.wikimedia.org/T97930#1255518 (10Krenair) We'll need to create incident documentation for this.
[15:30:42] <wikibugs>	 6operations, 7Performance: edits seem to be very slow - https://phabricator.wikimedia.org/T97930#1255520 (10Krenair)
[15:31:03] <_joe_>	 hoo: oh sorry
[15:32:12] <hoo>	 windowcat: Only for yesterday the numbers look off... but I can't really tell from that
[15:32:24] <wikibugs>	 6operations, 7Performance: edits seem to be very slow - https://phabricator.wikimedia.org/T97930#1255319 (10Krenair) CCing people involved in fixing it
[15:32:25] <hoo>	 I only have the numbers of changes performed by these jobs
[15:32:30] <Steinsplitter>	 _joe_: it still take 4ever deleting stuff on commons
[15:33:09] <windowcat>	 hoo: how did that work at all before? I still don't get that.
[15:33:36] <hoo>	 I have no idea
[15:34:08] <hoo>	 But it worked very well, we didn't have any problems with it since we fixed the title stuff (despite this)
[15:35:59] <bblack>	 ok I'm gonna try to carry on with my day as well.  Call me before you call joe.  he needs sleep.  I'll have my laptop along and can log in pretty quickly if you hit my phone.
[15:36:12] <bblack>	 (hoo has the number, or anyone with officewiki access)
[15:36:43] <wikibugs>	 6operations, 7Performance: edits seem to be very slow - https://phabricator.wikimedia.org/T97930#1255535 (10Steinsplitter) Still problems with deleting stuff on commons [It takes forever]  >>! In T97930#1255507, @Joe wrote: > Users report editing is fast again.
[15:37:02] <wikibugs>	 6operations, 7Performance: edits seem to be very slow - https://phabricator.wikimedia.org/T97930#1255536 (10Krenair) https://wikitech.wikimedia.org/wiki/Incident_documentation/20150503-JobQueueRedis
[15:40:50] * windowcat can't edit https://wikitech.wikimedia.org/w/index.php?title=Jobrunner
[15:41:02] <windowcat>	 constant session errors :/
[15:41:26] <windowcat>	 ah, there we go
[15:45:09] <hoo>	 May  3 15:44:45 mw1199:  #012Warning: Failed connecting to redis server at rcs1002.eqiad.wmnet: Connection timed out
[15:45:38] <hoo>	 Ah, these aren't realted to this, nvm
[15:53:59] <Steinsplitter>	 hoo: on commons file deletion is higly slow. it takes forever to delete a file.
[15:55:17] <hoo>	 Not related to the issue above, I think
[15:55:45] <hoo>	 Could be high database or swift load... can you delete with forceprofile?
[15:56:09] <Krenair>	 hoo, when you get a moment can you contribute what you know to https://wikitech.wikimedia.org/wiki/Incident_documentation/20150503-JobQueueRedis ?
[15:56:31] <Steinsplitter>	 wat is forceprofile?
[15:57:40] <Krenair>	 doesn't forceprofile require a password?
[15:58:02] <hoo>	 Krenair: Nope, but oyu need to set the request header + the GET/POST
[16:02:47] <Krenair>	 krenair@terbium:~$ mwscript showJobs.php enwiki --group
[16:02:47] <Krenair>	 refreshLinks: 13676827 queued; 50 claimed (1 active, 49 abandoned); 0 delayed
[16:02:48] <Krenair>	 hmph
[16:10:49] <wikibugs>	 6operations, 10Wikimedia-Site-requests: refreshLinks.php --dfn-only cron jobs do not seem to be running - https://phabricator.wikimedia.org/T97926#1255573 (10Krenair)
[16:22:04] <grrrit-wm>	 (03PS1) 10Hoo man: Revert "Temporary disable global renames" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208520 
[16:22:12] <grrrit-wm>	 (03PS2) 10Hoo man: Revert "Temporary disable global renames" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208520 
[16:23:01] <grrrit-wm>	 (03CR) 10Hoo man: [C: 032] "Last redis connection error happened more than an hour ago." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208520 (owner: 10Hoo man)
[16:23:07] <grrrit-wm>	 (03Merged) 10jenkins-bot: Revert "Temporary disable global renames" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208520 (owner: 10Hoo man)
[16:23:52] <logmsgbot>	 !log hoo Synchronized wmf-config/: Re-enable global renames (duration: 00m 12s)
[16:23:58] <morebots>	 Logged the message, Master
[16:54:22] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on logstash1005 is OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 5, unassigned_shards: 0, timed_out: False, active_primary_shards: 49, cluster_name: production-logstash-eqiad, relocating_shards: 2, active_shards: 147, initializing_shards: 0, number_of_data_nodes: 5
[16:57:23] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on logstash1006 is OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, timed_out: False, active_primary_shards: 49, cluster_name: production-logstash-eqiad, relocating_shards: 2, active_shards: 147, initializing_shards: 0, number_of_data_nodes: 6
[16:59:40] <sjoerddebruin>	 All things still stable. :)
[17:01:31] <Krenair>	 thanks sjoerddebruin 
[17:05:21] <sjoerddebruin>	 Well, let's start with the work that I wanted to do 12 hours ago.
[17:06:11] <multichill>	 Krenair: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150503-JobQueueRedis contains timestamps, but no timezone. I assume UTC?
[17:09:25] <Krenair>	 multichill, should all be UTC yes
[17:09:58] <Krenair>	 unless I accidentally put a WEST one in there
[17:10:42] <Krenair>	 we always assume utc
[17:12:03] <multichill>	 Looks very UTCish. Didn't know we deployed on Saturday evenings btw...
[17:13:09] <multichill>	 Happy to see it fixed. It's very noticable
[17:13:42] <_joe_>	 Krenair: oh you already edited it? thanks a bunch
[17:13:44] <Krenair>	 multichill, we don't usually
[17:14:01] <Krenair>	 _joe_, I added some stuff there. I don't know all the details and it's incomplete
[17:14:07] <_joe_>	 Krenair: I'll add details tomorrow if something is missing, thanks for the great work though <3
[17:16:08] <Krenair>	 needs a conclusion
[17:16:21] <Krenair>	 I'm not sure the ticket that gwicke added for alerts/monitoring is exactly what we need, maybe
[17:16:34] <Krenair>	 needs detail about why the original jobrunner deployment was done
[17:16:43] <Krenair>	 probably some more actionables that can be added
[17:17:39] <multichill>	 Monitoring should always beat users to reporting problems. Not sure what to do to catch a case like this one.
[17:18:21] <Krenair>	 multichill, well the rate of the error logs shouldn't raised metaphoric eyebrows
[17:18:24] <Krenair>	 should've*
[17:18:38] <Krenair>	 they were being spammed with redis connection timeouts
[17:18:43] <Krenair>	 so clearly redis is missing some monitoring
[17:18:46] <multichill>	 Graph that and put a threshold on it? :P
[17:18:58] <Krenair>	 that's roughly what I had in mind
[17:19:03] <Steinsplitter>	 the first report was here at 6?7? AM UTC.  maybe somone schuld always have service...
[17:19:27] <Krenair>	 Steinsplitter, simply noting issues here doesn't automatically alert ops
[17:19:43] <gwicke>	 multichill, Krenair: we used to have metrics, but I think they changed or were broken with the last rewrite
[17:19:45] <Steinsplitter>	 pinging the "On Ops duty:" ?
[17:19:54] <Krenair>	 they won't necessarily be awake
[17:20:14] <multichill>	 That's why I filed a bug. Did not really push because it was just performance, not total breakdown
[17:20:15] <Steinsplitter>	 there is not 24h somone here? omh.
[17:20:48] <Steinsplitter>	 wonderng how that is possible.... 6th popularst webpage schould have imho :P
[17:20:56] <multichill>	 If I would get paged by a co-worker for reduced performance in the middle of the night, he would probably get a bit of a grumpy response :P
[17:21:32] <Steinsplitter>	 :P
[17:25:51] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[17:27:38] <Krenair>	 Steinsplitter, there might be
[17:27:41] <Krenair>	 I'm not saying there isn't
[17:27:47] <Krenair>	 But I'm not aware of details
[17:29:11] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0]
[17:38:32] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors, please check!
[17:39:24] <legoktm>	 hi
[17:39:45] * Krenair waves
[17:40:17] <yuvipanda>	 hi
[17:40:19] <yuvipanda>	 that’s me on icinga
[17:44:02] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[17:46:05] <legoktm>	 looks like I missed all the fun :P
[17:46:14] <Krenair>	 yep
[17:46:19] <Krenair>	 legoktm, https://wikitech.wikimedia.org/wiki/Incident_documentation/20150503-JobQueueRedis has some basics
[17:46:24] <Krenair>	 also see pm
[17:46:25] <legoktm>	 https://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&c=Miscellaneous+eqiad&h=terbium.eqiad.wmnet&jr=&js=&v=10915270&m=Global+JobQueue+length
[17:49:03] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0]
[17:50:49] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct
[18:03:20] <legoktm>	 is anyone going to send something to the ops list?
[18:14:12] <Krenair>	 legoktm, it's not really done yet
[18:14:18] <Krenair>	 some other people were supposed to fill it out
[18:15:26] <Krenair>	 and someone who knows more than me about what happened needs to review it
[18:52:24] <grrrit-wm>	 (03PS3) 10Yuvipanda: [WIP]mesos: Add simple mesos module [puppet] - 10https://gerrit.wikimedia.org/r/208483 
[18:54:17] <grrrit-wm>	 (03PS4) 10Yuvipanda: [WIP]mesos: Add simple mesos module [puppet] - 10https://gerrit.wikimedia.org/r/208483 
[18:54:54] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] [WIP]mesos: Add simple mesos module [puppet] - 10https://gerrit.wikimedia.org/r/208483 (owner: 10Yuvipanda)
[19:23:36] <wikibugs>	 6operations, 10Wikimedia-Site-requests: refreshLinks.php --dfn-only cron jobs do not seem to be running - https://phabricator.wikimedia.org/T97926#1255734 (10Reedy) https://github.com/wikimedia/operations-puppet/blob/98d7853ce884edd5478d31a72369255b8c7b4f6f/manifests/misc/maintenance.pp#L12  s1 doesn't have a...
[19:28:41] <yuvipanda>	  !log  chown www-data: /var/log/mediawiki/refreshLinks/s3@3.log and s2@2.log for Reedy 
[19:28:50] <yuvipanda>	 !log  chown www-data: /var/log/mediawiki/refreshLinks/s3@3.log and s2@2.log for Reedy
[19:28:56] <morebots>	 Logged the message, Master
[19:29:19] <wikibugs>	 6operations, 10Wikimedia-Site-requests: refreshLinks.php --dfn-only cron jobs do not seem to be running - https://phabricator.wikimedia.org/T97926#1255743 (10Reedy) And yuvi just fixed the permissions on the s2 and s3 logs so hopefully they should start getting written to again
[20:00:15] <grrrit-wm>	 (03PS5) 10Yuvipanda: [WIP]mesos: Add simple mesos module [puppet] - 10https://gerrit.wikimedia.org/r/208483 
[20:00:54] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] [WIP]mesos: Add simple mesos module [puppet] - 10https://gerrit.wikimedia.org/r/208483 (owner: 10Yuvipanda)
[20:23:44] <grrrit-wm>	 (03PS1) 10Matanya: access: Remove Erik Moeller's Production Shell Access [puppet] - 10https://gerrit.wikimedia.org/r/208566 
[20:24:24] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] access: Remove Erik Moeller's Production Shell Access [puppet] - 10https://gerrit.wikimedia.org/r/208566 (owner: 10Matanya)
[20:25:43] <matanya>	 yuvipanda: you broke lint
[20:25:52] <yuvipanda>	 jenkins broken lint :P
[20:25:56] <yuvipanda>	 matanya: I haven’t merged anything!
[20:25:57] <yuvipanda>	 so not me
[20:26:01] <matanya>	 that too
[20:26:13] <yuvipanda>	 I think it’s broken only on some slaves
[20:26:16] <matanya>	 https://integration.wikimedia.org/ci/job/operations-puppet-typos/32235/console
[20:27:04] <matanya>	 I suspect the repor refuses to see Erik go, Eloquence you see ? 
[20:28:38] <Krenair>	 matanya, do we know whether his access is as staff or not?
[20:29:19] <matanya>	 Krenair: i don't know, but suspect it is not
[20:29:31] <Krenair>	 So why are you uploading a commit for it then?
[20:29:44] <yuvipanda>	 I think he had access before he was staf
[20:29:46] <yuvipanda>	 *staff
[20:29:49] <yuvipanda>	 :)
[20:29:59] <Krenair>	 If that's the case then he can't be removed for offboarding...
[20:30:00] <yuvipanda>	 I think he might have had access before there was staff?
[20:30:33] <matanya>	 Krenair: just proposing a patch, ops can take it or leave it
[20:30:40] <Reedy>	 Both will go a long way back
[20:31:56] <matanya>	 Reedy: are you flying around ?
[20:32:05] <Reedy>	 Not currently :P
[20:32:17] <Krenair>	 How's the weather been more recently?
[20:32:45] <Krenair>	 Does CirrusSearch not work on wikitech? hmm
[20:33:09] <Reedy>	 No
[20:33:13] <Reedy>	 (re cirrus)
[20:33:20] <Reedy>	 erik has a pretty low uid
[20:33:55] <Reedy>	 Tim is 501, erik is 503
[20:35:04] <Reedy>	 brion 500
[20:35:40] <matanya>	 The Elders of the Internet
[20:36:13] <matanya>	 (great episode by the way)
[20:36:13] <Reedy>	 mark has 531
[20:36:14] <Reedy>	 haha
[20:37:07] <yuvipanda>	 500 is first!
[20:37:10] <yuvipanda>	 Reedy: who was 502?
[20:37:19] <yuvipanda>	 anything under 500 was system reserved, IIRC
[20:37:19] <matanya>	 yuvipanda: 0 is first ...
[20:37:24] <Reedy>	 https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/data/data.yaml
[20:37:26] <yuvipanda>	 matanya: not users
[20:37:33] <Reedy>	 no one apparently
[20:37:33] <Fiona>	 Depends what you consider a user.
[20:37:36] <Reedy>	 jeluf or someone possibly
[20:37:45] <Fiona>	 My guess was Jens.
[20:38:10] <Fiona>	 But people had shell access before that.
[20:38:21] <Reedy>	 how dare they.
[20:38:24] <Fiona>	 Like Jimmy and James Douglas and such, I think.
[20:38:37] <Fiona>	 The current list is from maybe 2005 on?
[20:38:42] <Fiona>	 2004?
[20:39:50] <matanya>	 late 2004 i guess
[20:40:43] <Fiona>	 Re: Erik: I'd say it's his decision whether he keeps access. Dunno if he's said anything or thought about it.
[20:41:33] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 04-1] "This should not be merged unless we can identify Erik's access as a staff right, or he agrees to it. I would -2 but I don't have the right" [puppet] - 10https://gerrit.wikimedia.org/r/208566 (owner: 10Matanya)
[20:41:57] <ori>	 he's going to want it removed
[20:42:05] <matanya>	 it = ?
[20:42:10] <ori>	 access
[20:42:11] <ori>	 erik
[20:42:11] <Krenair>	 shell access, matanya 
[20:42:38] <matanya>	 Krenair: i think you deseve having the right to give -3 :)
[20:42:45] <Krenair>	 haha
[20:43:26] <matanya>	 any, 1 bug, 1 commit, i think i should call it a day
[20:44:28] <Fiona>	 :-)
[20:45:14] <grrrit-wm>	 (03PS1) 10MaxSem: Remove unused group not needed anymore [puppet] - 10https://gerrit.wikimedia.org/r/208568 
[20:46:01] <Krenair>	 MaxSem, what happened to PDF QA stuff?
[20:46:21] <MaxSem>	 it was a temp server used for experiments
[20:46:33] <Krenair>	 to do with OCG?
[20:46:41] <MaxSem>	 yup
[20:46:44] <Krenair>	 Hmph.
[20:47:01] <MaxSem>	 -qa is misleading:)
[20:47:06] <Krenair>	 That software could do with testing across every single article :/
[20:47:15] <Krenair>	 I wonder if I could run it in labs actually
[20:47:57] <matanya>	 I doubt it
[20:48:03] <Krenair>	 why not?
[20:48:29] <matanya>	 i'm not sure you have the infra there
[20:48:32] <Krenair>	 I can run it from the CLI locally (after having so much fun installing dependencies)
[20:48:53] <MaxSem>	 is npm install fun?
[20:49:04] <matanya>	 "fun"
[20:49:14] <Krenair>	 I don't recall it being that simple :)
[20:49:15] <MaxSem>	 oh, and a bunch of apt-get install
[20:50:02] <MaxSem>	 anyway, about a billion times easier than the fustercluck it replaces
[20:51:39] <Krenair>	 I know very little about the old system other than that barely anybody knew how it worked and it was in pmtpa, so that's unsurprising
[20:52:28] <MaxSem>	 actually, Matt learned how to install and run it when we worked on OCG
[20:52:33] <Reedy>	 why apt-get install anything when you can build from source and hack it yourself at the same time
[20:52:56] <MaxSem>	 build half a gig of latex crap?
[20:53:44] <Krenair>	 But my inbox remembers a non-trivial number of readers complaining that some articles would simply not successfully render as PDFs via OCG
[20:54:01] <matanya>	 why use binary when you can compile your OS alone (ehm, gentoo, ehm)
[20:54:35] <MaxSem>	 Krenair, that's the WMF tradition to consider something complete once it's deployed
[20:54:43] <Krenair>	 heh :)
[20:54:56] <matanya>	 Krenair: it was moste trouble when it was launched, got better since
[21:23:59] <wikibugs>	 6operations, 5Interdatacenter-IPsec, 5Patch-For-Review: Fix ipv6 autoconf issues - https://phabricator.wikimedia.org/T94417#1255853 (10Gage) The token-based solution (Proposal 1) sounds good to me; it seems like the only barrier to adoption is making a policy decision to go with a proposal which doesn't supp...
[22:02:14] <grrrit-wm>	 (03CR) 10Legoktm: Add my script for generating meta:System_administrators#List (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/208395 (owner: 10Alex Monk)
[22:06:58] <grrrit-wm>	 (03CR) 1020after4: [C: 032] "I went through the motions of modifying wikiversions in a mock-deployment and everything seems to work as expected." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208263 (owner: 10Ori.livneh)
[22:08:59] <legoktm>	 twentyafterfour: on a sunday?
[22:10:43] <grrrit-wm>	 (03PS8) 10Alex Monk: Add my script for generating meta:System_administrators#List [puppet] - 10https://gerrit.wikimedia.org/r/208395 
[22:11:10] <icinga-wm>	 PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 1 failures
[22:11:34] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add my script for generating meta:System_administrators#List [puppet] - 10https://gerrit.wikimedia.org/r/208395 (owner: 10Alex Monk)
[22:13:07] <grrrit-wm>	 (03PS9) 10Alex Monk: Add my script for generating meta:System_administrators#List [puppet] - 10https://gerrit.wikimedia.org/r/208395 
[22:13:45] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add my script for generating meta:System_administrators#List [puppet] - 10https://gerrit.wikimedia.org/r/208395 (owner: 10Alex Monk)
[22:14:02] <Krenair>	 oh ffs
[22:14:28] <Krenair>	 now it's something else
[22:14:35] <Krenair>	 22:13:23 stderr: fatal: Failed to resolve 'HEAD' as a valid ref.
[22:14:35] <Krenair>	 22:13:23 Stopping at 'modules/mesos'; script returned non-zero status.
[22:21:01] <twentyafterfour>	 legoktm: on a sunday, yes. I did this testing on my laptop not on tin
[22:21:24] <twentyafterfour>	 and that won't merge until it's dependent patches are +2'd
[22:22:09] <legoktm>	 twentyafterfour: ah alright
[22:22:23] <legoktm>	 I missed that part :P
[22:22:49] <Krenair>	 "You were tipped 0.22 XPM for your commit on Project wikimedia/mediawiki. Please, log in and tell us your primecoin address to get it."
[22:22:53] <Krenair>	 they're still doing that? :/
[22:23:12] <twentyafterfour>	 this does remind me that I really hate the auto-merge behavior ... I should be able to +2 without something getting merged and potentially deployed
[22:23:34] <legoktm>	 twentyafterfour: that's what +1 is for :P
[22:23:38] <twentyafterfour>	 Krenair: I got one of those messages too
[22:23:47] <Krenair>	 I wonder how many pence I would've received by now if I'd bothered to put in an address
[22:23:48] <legoktm>	 Krenair: there's a way to opt out of that on their website
[22:24:31] <Krenair>	 "Your current balance is 0.83 XPM."
[22:25:00] <Krenair>	 0.83XPM = �0.01 apparently
[22:25:46] <twentyafterfour>	 nice
[22:25:50] <icinga-wm>	 RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures
[22:27:40] <wikibugs>	 6operations, 5Interdatacenter-IPsec: Strongswan: security association reauthentication failure - https://phabricator.wikimedia.org/T96111#1255868 (10BBlack) No further recurrence?  I see we have margin in the configuration now too, which is good.  Thoughts on final configuration (as opposed to stress-test) for...
[22:31:11] <Fiona>	 Hmm, didn't there used to be group 3?
[22:31:15] <Fiona>	 For MediaWiki deployments?
[22:33:15] <Krenair>	 Fiona, I think enwiki used to be separate from the other wikipedias
[22:33:38] <Krenair>	 https://www.mediawiki.org/wiki/MediaWiki_1.22/Roadmap - 1.22wmf5
[22:34:14] <Krenair>	 so there was a group 4
[22:34:23] <Fiona>	 You mean four groups?
[22:34:27] <Fiona>	 group 0 through 3 inclusive?
[22:34:31] <Krenair>	 except then in 1.23wmf10 we started using base 0 instead of base 1
[22:34:42] <Krenair>	 it was 1 through 4
[22:34:45] <Fiona>	 Ah, phase v. group.
[22:37:51] <Krenair>	 I think 1.20wmf1 was the first deployment from git?
[22:38:06] <Krenair>	 We switched over on the 21st of March IIRC
[22:38:18] <Krenair>	 and 1.20wmf1 was branched on the 10th of April
[22:38:40] <Krenair>	 In 1.20wmf1 it was group 0 through 4
[22:38:53] <Krenair>	 because commons got it before other non-wikipedia sites
[22:40:00] <Krenair>	 that didn't happen in 1.20wmf2
[22:42:20] <icinga-wm>	 PROBLEM - puppet last run on mw2165 is CRITICAL puppet fail
[22:42:30] <Fiona>	 All right.
[22:43:29] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1002 is CRITICAL 6.67% of data above the critical threshold [500.0]
[22:44:00] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0]
[22:54:40] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1002 is OK Less than 1.00% above the threshold [250.0]
[22:55:19] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
[23:00:10] <icinga-wm>	 RECOVERY - puppet last run on mw2165 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures
[23:54:50] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1002 is CRITICAL 7.14% of data above the critical threshold [500.0]
[23:55:29] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0]