[00:00:04] <jouncebot>	 RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151229T0000).
[00:00:04] <jouncebot>	 awight: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[00:00:14] <Krenair>	 too busy, sorry
[00:07:34] <awight>	 RoanKattouw: ostriches: I'd be happy to deploy my own patch... is that reasonable?
[00:08:51] <ostriches>	 Ah CN you fickle beast
[00:09:23] <awight>	 I'm sure there are more fun extensions to deploy, maybe Wikidata
[00:09:26] <awight>	 *base
[00:09:29] <ostriches>	 awight: I'm assuming it's urgent-ish?
[00:09:33] <awight>	 mmm.  yes
[00:09:36] <AndyRussG>	 Did something go all ficklish?
[00:09:52] <awight>	 This is blocking translations for CentralNotice, and the Wikimedia 15-year anniversary is coming up in 2 weeks.
[00:09:53] <ostriches>	 awight: {{approved}}
[00:09:57] <awight>	 :D
[00:10:06] * awight waves around my seal of temporary approval
[00:12:04] <AndyRussG>	 Quick, to the scanner & 3d printer!
[00:12:48] <awight>	 Good thing I'm doing the deploy--I realize now that I was supposed to link to a gerrit patch for the 1.27.0-wmf.9 backport
[00:13:03] <awight>	 ... and I only have myself to annoy
[00:17:27] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/).
[00:18:39] <icinga-wm>	 RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[00:23:27] <ori>	 YuviPanda: ok to remove redis::legacy now?
[00:23:45] <YuviPanda>	 ori: yup just finished switching over
[00:23:51] <ori>	 yay
[00:23:53] <ori>	 thanks very much
[00:23:56] <icinga-wm>	 RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[00:23:58] <grrrit-wm>	 (03PS4) 10Ori.livneh: remove redis::legacy [puppet] - 10https://gerrit.wikimedia.org/r/257548 
[00:24:32] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] remove redis::legacy [puppet] - 10https://gerrit.wikimedia.org/r/257548 (owner: 10Ori.livneh)
[00:24:41] <YuviPanda>	 ori: np
[00:24:48] <YuviPanda>	 ori: wikibugs is dead tho :D am attempting to fix
[00:25:35] <awight>	 It's a little hard to deploy with this WMF Board-induced adrenaline rush...
[00:25:57] * YuviPanda whips awight
[00:26:04] <AndyRussG>	 awight: just remove whatever servers disagree with you
[00:26:13] <AndyRussG>	 It's in the bylaws
[00:26:41] <AndyRussG>	 P.S. Never check mail while deploying
[00:28:25] <wikibugs>	 6operations, 6Labs: Untangle labs/production roles from labs/instance roles - https://phabricator.wikimedia.org/T119401#1907554 (10yuvipanda) Actually, I guess the confusion is sorted out, so maybe this should be closed?
[00:28:30] <YuviPanda>	 yay
[00:28:31] <YuviPanda>	 wikibugs fixed
[00:54:44] <grrrit-wm>	 (03PS1) 10Ori.livneh: redis: disable transparent hugepages [puppet] - 10https://gerrit.wikimedia.org/r/261301 
[00:55:02] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] redis: disable transparent hugepages [puppet] - 10https://gerrit.wikimedia.org/r/261301 (owner: 10Ori.livneh)
[00:55:24] <icinga-wm>	 PROBLEM - puppet last run on cp2017 is CRITICAL: CRITICAL: puppet fail
[01:01:19] <awight>	 O_o
[01:01:25] <awight>	 How did you know I was reading the bylaws ;)
[01:02:56] <logmsgbot>	 !log awight@tin Synchronized php-1.27.0-wmf.9/extensions/CentralNotice: Update CentralNotice: T122251 (duration: 00m 34s)
[01:03:02] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:04:40] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge.
[01:09:50] <Coren>	 Krenair: Actually, yes.  VE and Flow especially.
[01:09:50] <awight>	 AndyRussG: I've deployed, checking now...
[01:10:01] <Coren>	 Krenair: ty
[01:10:15] <AndyRussG>	 awight: coolio!
[01:10:49] <icinga-wm>	 PROBLEM - puppet last run on mw2211 is CRITICAL: CRITICAL: puppet fail
[01:18:44] <Krenair>	 Coren, so I'm not sure the issue I'm thinking of will immediately bring Flow to your wiki
[01:19:00] <Krenair>	 but the existing problem is that Parsoid does not recognise your wiki, therefore you have no VE and no PDF download
[01:19:09] <Coren>	 Ah.
[01:19:15] <grrrit-wm>	 (03PS1) 10Ori.livneh: redis: use file_line on redis.conf to enable latency monitor [puppet] - 10https://gerrit.wikimedia.org/r/261303 
[01:19:20] <Coren>	 I don't think the PDF download is an issue.
[01:20:03] <Krenair>	 But you want VE
[01:20:19] <Krenair>	 Coren, please see https://phabricator.wikimedia.org/T122548#1907326
[01:20:46] <grrrit-wm>	 (03PS2) 10Ori.livneh: redis: use file_line on redis.conf to enable latency monitor [puppet] - 10https://gerrit.wikimedia.org/r/261303 
[01:21:48] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] redis: use file_line on redis.conf to enable latency monitor [puppet] - 10https://gerrit.wikimedia.org/r/261303 (owner: 10Ori.livneh)
[01:22:00] <wikibugs>	 6operations, 10Parsoid, 10Wikimedia-Site-Requests: please run fetch-sitematrix update - https://phabricator.wikimedia.org/T122548#1907576 (10coren) VE is desired, but as the wiki is very young and the number of non "powerusers" is going to be very low, there is no emergency to deploy outside the normal cycle...
[01:22:17] <grrrit-wm>	 (03PS1) 10Aaron Schulz: Remove redundant RunJobs code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261304 
[01:22:28] <icinga-wm>	 RECOVERY - puppet last run on cp2017 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[01:25:38] <grrrit-wm>	 (03PS1) 10Ori.livneh: Fixup for Icfa1a930df: anchor the 'match' so there is only one match [puppet] - 10https://gerrit.wikimedia.org/r/261305 
[01:25:49] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] Fixup for Icfa1a930df: anchor the 'match' so there is only one match [puppet] - 10https://gerrit.wikimedia.org/r/261305 (owner: 10Ori.livneh)
[01:28:35] <Coren>	 Krenair: tl;dr: not urgently needed, can wait.
[01:28:48] <Krenair>	 ok
[01:33:37] <grrrit-wm>	 (03PS1) 10Aaron Schulz: Enable persistent redis connections for job runners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261306 
[01:38:05] <grrrit-wm>	 (03PS1) 10Ori.livneh: redis: enable latency monitor only on jessies [puppet] - 10https://gerrit.wikimedia.org/r/261309 
[01:38:26] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] redis: enable latency monitor only on jessies [puppet] - 10https://gerrit.wikimedia.org/r/261309 (owner: 10Ori.livneh)
[01:39:55] <icinga-wm>	 RECOVERY - puppet last run on mw2211 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[02:35:12] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 15m 38s)
[02:35:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:41:49] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Dec 29 02:41:49 UTC 2015 (duration 6m 38s)
[02:41:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[06:02:51] <grrrit-wm>	 (03PS2) 10Glaisher: Enable global AbuseFilter at French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257868 (https://phabricator.wikimedia.org/T120568) 
[06:02:53] <grrrit-wm>	 (03PS1) 10Glaisher: Set $wgMathFullRestbaseURL so that MathML works even if VE is disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261321 (https://phabricator.wikimedia.org/T122401) 
[06:03:05] <Glaisher>	 wat
[06:05:13] <grrrit-wm>	 (03PS2) 10Glaisher: Set $wgMathFullRestbaseURL so that MathML works even if VE is disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261321 (https://phabricator.wikimedia.org/T122401) 
[06:22:26] <icinga-wm>	 PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:24:07] <grrrit-wm>	 (03CR) 10Physikerwelt: "I understand the production part:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261321 (https://phabricator.wikimedia.org/T122401) (owner: 10Glaisher)
[06:29:57] <icinga-wm>	 PROBLEM - puppet last run on mw1086 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:16] <grrrit-wm>	 (03CR) 10Glaisher: "Yes, at least most modern browsers should be able to handle it, I think. Links by RL in CSS/JS files (which has a greater impact) are also" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261321 (https://phabricator.wikimedia.org/T122401) (owner: 10Glaisher)
[06:30:38] <icinga-wm>	 PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: puppet fail
[06:31:17] <icinga-wm>	 PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:36] <icinga-wm>	 PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:37] <icinga-wm>	 PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Puppet has 3 failures
[06:31:46] <icinga-wm>	 PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:34] <icinga-wm>	 PROBLEM - puppet last run on mw2043 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:44] <icinga-wm>	 PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:44] <icinga-wm>	 PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:33:55] <icinga-wm>	 PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:36:04] <icinga-wm>	 PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures
[06:48:25] <icinga-wm>	 RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[06:50:56] <icinga-wm>	 PROBLEM - salt-minion processes on technetium is CRITICAL: PROCS CRITICAL: 7 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[06:55:55] <icinga-wm>	 RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[06:56:14] <icinga-wm>	 RECOVERY - puppet last run on mw2043 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[06:56:25] <icinga-wm>	 RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[06:56:25] <icinga-wm>	 RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[06:56:25] <icinga-wm>	 RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:56:44] <icinga-wm>	 RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[06:57:35] <icinga-wm>	 RECOVERY - puppet last run on mw1086 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:35] <icinga-wm>	 RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[06:57:54] <icinga-wm>	 RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:58:36] <icinga-wm>	 RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:59:04] <icinga-wm>	 RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:00:45] <grrrit-wm>	 (03CR) 10Physikerwelt: [C: 031] "OK. Thank you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261321 (https://phabricator.wikimedia.org/T122401) (owner: 10Glaisher)
[07:57:27] <icinga-wm>	 PROBLEM - NTP on technetium is CRITICAL: NTP CRITICAL: No response from NTP server
[08:59:28] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] jessie 2014.7.5 patch for batch cli returns with broken dict [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260371 (owner: 10ArielGlenn)
[09:01:35] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] jessie 2014.7.5 continue reading events even after getting one with wrong tag [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260372 (owner: 10ArielGlenn)
[09:08:31] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] make ping_on_rotate work without minion data cache [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260373 (owner: 10ArielGlenn)
[09:10:35] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] 2014.7.5 jessie, backport patches for singleton SAuth class [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260374 (owner: 10ArielGlenn)
[09:11:33] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] bump version number for wmf build, 2014.7.5+ds-1+wm1 [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260375 (owner: 10ArielGlenn)
[09:23:01] <grrrit-wm>	 (03PS1) 10ArielGlenn: increase zmq queue backlog length for salt cli, based on user config [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/261329 
[09:27:24] <wikibugs>	 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1907827 (10ArielGlenn) https://gerrit.wikimedia.org/r/261329  this fixes the above: the salt command line client was throwing away events in the ZMQ backlog because too many came in at once.  W...
[09:31:38] <grrrit-wm>	 (03PS1) 10Jcrespo: Setting weight values for s6 to original production values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261331 (https://phabricator.wikimedia.org/T105879) 
[09:37:40] <jynus>	 !log changing the mysql master of db2028, from db1030 to db1050
[09:37:44] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:47:37] <grrrit-wm>	 (03PS1) 10ArielGlenn: bump version number for wmf build, 2014.7.5+ds-1+wm2 [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/261334 
[09:49:38] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] increase zmq queue backlog length for salt cli, based on user config [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/261329 (owner: 10ArielGlenn)
[09:50:42] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] bump version number for wmf build, 2014.7.5+ds-1+wm2 [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/261334 (owner: 10ArielGlenn)
[09:51:49] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Setting weight values for s6 to original production values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261331 (https://phabricator.wikimedia.org/T105879) (owner: 10Jcrespo)
[09:51:50] <wikibugs>	 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1907836 (10ArielGlenn) Jessie package tested on neodymium and works as advertised. https://gerrit.wikimedia.org/r/261334
[09:51:57] <icinga-wm>	 PROBLEM - Host mw2031 is DOWN: PING CRITICAL - Packet loss = 100%
[09:52:19] <icinga-wm>	 RECOVERY - Host mw2031 is UP: PING OK - Packet loss = 0%, RTA = 38.05 ms
[09:52:52] <jynus>	 ^looks like network
[09:54:39] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Setting weight values for s6 to original production values (duration: 00m 35s)
[09:54:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:59:22] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] trusty 2014.7.5 patch for batch cli returns with broken dict [debs/salt] (trusty) - 10https://gerrit.wikimedia.org/r/260560 (owner: 10ArielGlenn)
[10:16:07] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] trusty 2014.7.5 continue reading events even after getting one with wrong tag [debs/salt] (trusty) - 10https://gerrit.wikimedia.org/r/260561 (owner: 10ArielGlenn)
[10:33:08] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] make ping_on_rotate work without minion data cache [debs/salt] (trusty) - 10https://gerrit.wikimedia.org/r/260562 (owner: 10ArielGlenn)
[10:33:55] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] 2014.7.5 trusty, backport patches for singleton SAuth class [debs/salt] (trusty) - 10https://gerrit.wikimedia.org/r/260563 (owner: 10ArielGlenn)
[10:34:15] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] bump version number for wmf build, 2014.7.5+ds-1ubuntu1+wm1 [debs/salt] (trusty) - 10https://gerrit.wikimedia.org/r/260564 (owner: 10ArielGlenn)
[10:38:49] <grrrit-wm>	 (03PS1) 10ArielGlenn: increase zmq queue backlog length for salt cli, based on user config [debs/salt] (trusty) - 10https://gerrit.wikimedia.org/r/261343 
[10:38:51] <grrrit-wm>	 (03PS1) 10ArielGlenn: bump version number for wmf build, 2014.7.5+ds-1+wm2 [debs/salt] (trusty) - 10https://gerrit.wikimedia.org/r/261344 
[10:43:09] <wikibugs>	 6operations, 7HTTPS: ssl certificate replacement: dumps.wikimedia.org (expires 2016-02-26) - https://phabricator.wikimedia.org/T122321#1907905 (10ArielGlenn) This can go whenever you like.  People downloading can restart or retry from where they were interrupted.
[10:50:11] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 031] dataset: move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260940 (owner: 10Dzahn)
[10:52:29] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] precise 2014.7.5 patch for batch cli returns with broken dict [debs/salt] (precise) - 10https://gerrit.wikimedia.org/r/260568 (owner: 10ArielGlenn)
[10:53:13] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] precise 2014.7.5 continue reading events even after getting one with wrong tag [debs/salt] (precise) - 10https://gerrit.wikimedia.org/r/260569 (owner: 10ArielGlenn)
[10:55:58] <wikibugs>	 6operations, 10DBA, 5Patch-For-Review: db1022 duplicate key errors - https://phabricator.wikimedia.org/T105879#1907916 (10jcrespo) 5Open>3Resolved
[10:59:13] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] make ping_on_rotate work without minion data cache [debs/salt] (precise) - 10https://gerrit.wikimedia.org/r/260570 (owner: 10ArielGlenn)
[10:59:58] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] 2014.7.5 precise, backport patches for singleton SAuth class [debs/salt] (precise) - 10https://gerrit.wikimedia.org/r/260571 (owner: 10ArielGlenn)
[11:00:17] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] bump version number for wmf build, 2014.7.5+ds-1precise1+wm1 [debs/salt] (precise) - 10https://gerrit.wikimedia.org/r/260572 (owner: 10ArielGlenn)
[11:04:19] <grrrit-wm>	 (03PS1) 10ArielGlenn: increase zmq queue backlog length for salt cli, based on user config [debs/salt] (precise) - 10https://gerrit.wikimedia.org/r/261346 
[11:04:20] <grrrit-wm>	 (03PS1) 10ArielGlenn: bump version number for wmf build, 2014.7.5+ds-1+wm2 [debs/salt] (precise) - 10https://gerrit.wikimedia.org/r/261347 
[11:28:01] <grrrit-wm>	 (03PS1) 10ArielGlenn: restart of salt minion should not kill all subprocesses [puppet] - 10https://gerrit.wikimedia.org/r/261349 
[11:39:49] <grrrit-wm>	 (03PS2) 10ArielGlenn: restart of salt minion should not kill all subprocesses [puppet] - 10https://gerrit.wikimedia.org/r/261349 
[11:40:59] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] restart of salt minion should not kill all subprocesses [puppet] - 10https://gerrit.wikimedia.org/r/261349 (owner: 10ArielGlenn)
[11:43:46] <grrrit-wm>	 (03PS1) 10ArielGlenn: fix typo in contents of salt minion systemd conf file [puppet] - 10https://gerrit.wikimedia.org/r/261352 
[11:45:08] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] fix typo in contents of salt minion systemd conf file [puppet] - 10https://gerrit.wikimedia.org/r/261352 (owner: 10ArielGlenn)
[11:51:09] <apergos>	 it's so quiet in here except for my spam in the channel
[11:51:32] <apergos>	 better not jinx it I guess or we'll have nagios spam, ewww
[11:59:30] <_joe_>	 apergos: I'm here, just lost in netlink niceties :)
[11:59:54] <apergos>	 eww sorry to hear it
[12:01:24] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] increase zmq queue backlog length for salt cli, based on user config [debs/salt] (trusty) - 10https://gerrit.wikimedia.org/r/261343 (owner: 10ArielGlenn)
[12:04:37] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] bump version number for wmf build, 2014.7.5+ds-1+wm2 [debs/salt] (trusty) - 10https://gerrit.wikimedia.org/r/261344 (owner: 10ArielGlenn)
[12:04:38] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] increase zmq queue backlog length for salt cli, based on user config [debs/salt] (precise) - 10https://gerrit.wikimedia.org/r/261346 (owner: 10ArielGlenn)
[12:04:39] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] bump version number for wmf build, 2014.7.5+ds-1+wm2 [debs/salt] (precise) - 10https://gerrit.wikimedia.org/r/261347 (owner: 10ArielGlenn)
[12:09:27] <jynus>	 (I am just running random queries in non-production hosts)
[12:10:18] <apergos>	 go dbs go!
[12:22:20] <jynus>	 I am just trying to advance some work for next year's goal (and required anyway)
[12:22:57] <grrrit-wm>	 (03PS3) 10Reedy: Set $wgMathFullRestbaseURL so that MathML works even if VE is disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261321 (https://phabricator.wikimedia.org/T122401) (owner: 10Glaisher)
[12:23:02] <grrrit-wm>	 (03CR) 10Reedy: [C: 032] Set $wgMathFullRestbaseURL so that MathML works even if VE is disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261321 (https://phabricator.wikimedia.org/T122401) (owner: 10Glaisher)
[12:23:32] <grrrit-wm>	 (03Merged) 10jenkins-bot: Set $wgMathFullRestbaseURL so that MathML works even if VE is disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261321 (https://phabricator.wikimedia.org/T122401) (owner: 10Glaisher)
[12:24:30] <logmsgbot>	 !log reedy@tin Synchronized wmf-config/CommonSettings.php: Attempt to fix math related fatal (duration: 00m 33s)
[12:24:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:30:16] <icinga-wm>	 PROBLEM - puppet last run on technetium is CRITICAL: CRITICAL: Puppet last ran 6 hours ago
[12:41:18] <wikibugs>	 6operations, 6Discovery, 10Maps: Tilerator Error: permission denied for relation planet_osm_polygon - https://phabricator.wikimedia.org/T122270#1908064 (10akosiaris) I 've resynced the databases back then and had to shutdown the pgsql services for that to happen.  The problem started by the osm2pgsql process...
[12:41:26] <wikibugs>	 6operations, 6Discovery, 10Maps: Tilerator Error: permission denied for relation planet_osm_polygon - https://phabricator.wikimedia.org/T122270#1908065 (10akosiaris) 5Open>3Resolved a:3akosiaris
[12:50:44] <icinga-wm>	 PROBLEM - puppet last run on mw1205 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:50:54] <icinga-wm>	 PROBLEM - puppet last run on mw2010 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:51:54] <icinga-wm>	 PROBLEM - puppet last run on mw2004 is CRITICAL: CRITICAL: Puppet has 1 failures
[13:08:29] <apergos>	 !log labcontrol*, neodymium and palladium updated to latest salt packages (wm2), rest of prod to follow
[13:08:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:17:05] <icinga-wm>	 RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:17:06] <icinga-wm>	 RECOVERY - puppet last run on mw2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:17:45] <icinga-wm>	 RECOVERY - puppet last run on mw2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:21:07] <icinga-wm>	 PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[13:23:15] <icinga-wm>	 RECOVERY - DPKG on labmon1001 is OK: All packages OK
[13:29:19] <apergos>	 !log salt wm2 packages now installed on all production hosts except for: mw1041.eqiad.wmnet, technetium.eqiad.wmnet, mw1228.eqiad.wmnet, ms-be1011.eqiad.wmnet
[13:29:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:40:24] <wikibugs>	 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1908117 (10ArielGlenn) The new wm2 packages are now installed on all production hosts except for: mw1041.eqiad.wmnet, technetium.eqiad.wmnet, mw1228.eqiad.wmnet, ms-be1011.eqiad.wmnet.  Status...
[13:54:11] <icinga-wm>	 PROBLEM - salt-minion processes on mira is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[13:54:15] <grrrit-wm>	 (03PS1) 10Cmjohnson: Addin production DNS for pc1004-6 bug: task# T121888 [dns] - 10https://gerrit.wikimedia.org/r/261358 
[13:55:57] <grrrit-wm>	 (03CR) 10Cmjohnson: [C: 032] Addin production DNS for pc1004-6 bug: task# T121888 [dns] - 10https://gerrit.wikimedia.org/r/261358 (owner: 10Cmjohnson)
[14:12:10] <icinga-wm>	 PROBLEM - salt-minion processes on tin is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[14:19:39] <icinga-wm>	 RECOVERY - salt-minion processes on mira is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[14:37:41] <icinga-wm>	 RECOVERY - salt-minion processes on tin is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[15:04:36] <wikibugs>	 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1908140 (10ArielGlenn) Labs salt update  Because (as usual) a pile of instances have issues, I'm doing the old standby of the ssh  loop, which will update salt only on hosts which have the labc...
[15:06:07] <apergos>	 !log labs salt instances salt update in progress.  It's slow and tedious and automated.  A few hundred instances already done, the rest are going one at a time. Only instances that use the labcontrol salt master will be affected.
[15:06:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:07:09] <icinga-wm>	 PROBLEM - dhclient process on technetium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:07:36] <apergos>	 hm I should log that in the labs log woops
[15:14:45] <thedj>	 does any ops person perhaps know where the svn repo's are that were not imported to git ?
[15:17:09] <paravoid>	 phabricator, I think
[15:17:12] <paravoid>	 https://phabricator.wikimedia.org/diffusion/
[15:17:55] <paravoid>	 svn.wikimedia.org nowadays redirects there, so I think that's the canonical location
[15:17:59] <thedj>	 hmm. 
[15:18:19] <thedj>	 well then i guess the repo was never there, or it wasn't imported...
[15:18:42] <paravoid>	 I wouldn't know -- the release engineering team typically handles all that
[15:18:49] <paravoid>	 or Reedy may know more perhaps?
[15:20:10] <thedj>	 i'll just keep poking.. or maybe i should just ask mdale, since it's probably his code anyways
[15:37:21] <grrrit-wm>	 (03PS1) 10Bartosz Dziewoński: Disable cross-wiki upload A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261371 (https://phabricator.wikimedia.org/T120867) 
[15:38:29] <grrrit-wm>	 (03PS2) 10Bartosz Dziewoński: Disable cross-wiki upload A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261371 (https://phabricator.wikimedia.org/T120867) 
[16:00:05] <jouncebot>	 anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151229T1600).
[16:00:05] <jouncebot>	 MatmaRex: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[16:00:14] <MatmaRex>	 hiho. anyone deploying?
[16:00:33] <thcipriani>	 MatmaRex: I can SWAT.
[16:01:10] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261371 (https://phabricator.wikimedia.org/T120867) (owner: 10Bartosz Dziewoński)
[16:01:35] <grrrit-wm>	 (03Merged) 10jenkins-bot: Disable cross-wiki upload A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261371 (https://phabricator.wikimedia.org/T120867) (owner: 10Bartosz Dziewoński)
[16:04:42] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Disable cross-wiki upload A/B test [[gerrit:261371]] (duration: 00m 31s)
[16:04:45] <thcipriani>	 ^ MatmaRex check please
[16:04:47] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:05:03] <MatmaRex>	 on it
[16:07:46] <MatmaRex>	 thcipriani: all fine :)
[16:07:58] <thcipriani>	 MatmaRex: cool, thanks for checking :)
[16:45:15] <grrrit-wm>	 (03PS3) 10EBernhardson: Turn off A/B test for search lang detect via accept-language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258087 (https://phabricator.wikimedia.org/T119529) 
[16:45:29] <grrrit-wm>	 (03CR) 10EBernhardson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258087 (https://phabricator.wikimedia.org/T119529) (owner: 10EBernhardson)
[16:45:59] <grrrit-wm>	 (03Merged) 10jenkins-bot: Turn off A/B test for search lang detect via accept-language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258087 (https://phabricator.wikimedia.org/T119529) (owner: 10EBernhardson)
[16:47:00] <logmsgbot>	 !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Turn off AB test for search lang detect via accept-language (duration: 00m 29s)
[16:47:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:51:41] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: [WiP] add native ipvs manager [debs/pybal] - 10https://gerrit.wikimedia.org/r/261375 
[16:52:52] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] [WiP] add native ipvs manager [debs/pybal] - 10https://gerrit.wikimedia.org/r/261375 (owner: 10Giuseppe Lavagetto)
[17:00:05] <jouncebot>	 _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151229T1700).
[17:00:23] <_joe_>	 uhm, seems blank
[17:00:52] <bd808>	 _joe_: sadly I haven't taught jouncebot to ignore empty time slots yet
[17:01:17] <_joe_>	 bd808: it's allright
[17:01:36] <bd808>	 it would probably need even more metadata in the wiki page which is kind of yuck
[17:02:27] <bd808>	 using the html generated from wikitext as a structured data store is an "interesting" feature of jouncebot 
[18:17:31] <icinga-wm>	 PROBLEM - puppet last run on mc1007 is CRITICAL: CRITICAL: Puppet has 1 failures
[18:41:32] <icinga-wm>	 RECOVERY - puppet last run on mc1007 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[18:51:30] <icinga-wm>	 PROBLEM - RAID on technetium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:03:29] <icinga-wm>	 PROBLEM - DPKG on technetium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:06:39] <icinga-wm>	 PROBLEM - Check size of conntrack table on technetium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:10:00] <SMalyshev>	 godog, parvoid: I was told you may know something about CI network problem in https://phabricator.wikimedia.org/T122594?
[19:11:28] <bd808>	 I wonder if that instance/job is still trying to use the http proxy?
[19:16:31] <andrewbogott>	 bd808, SMalyshev, I can wget those files just fine on a CI slave.  So if it is using the proxy, it need not.
[19:17:21] <bd808>	 *nod* there was some of that ripped out last weekend from when we had some ci slaves inside the prod network
[19:17:41] <bd808>	 this may be something else that has been missed so far
[19:18:02] <andrewbogott>	 yeah, seems likely
[19:19:52] <SMalyshev>	 ok. these files don't change often but the build needs them
[19:27:50] <SMalyshev>	 bd808: so what was the change that happened recently?
[19:28:38] <bd808>	 SMalyshev: https://phabricator.wikimedia.org/T122368
[19:28:53] <SMalyshev>	 bd808: ah, I can't see it
[19:29:33] <bd808>	 TL;DR webproxy.eqiad.wmnet:8080 is not useable as a HTTP proxy from labs any longer
[19:31:04] <apergos>	 yep and a lot of nova cloud or something repos get whined about now from labs
[19:31:36] <SMalyshev>	 bd808: so that may mess things up, definitely. So, if my CI build needs to fetch stuff from outside is there a way to do it now?
[19:31:56] <bd808>	 https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/ci.pp#L172-L177
[19:32:12] <bd808>	 SMalyshev: from labs external http is wide open
[19:32:23] <bd808>	 so we just need to yank out the old proxy config
[19:32:41] <SMalyshev>	 bd808: CI runs on labs, right?
[19:33:23] <SMalyshev>	 so we just need to drop contint::maven_webproxy?
[19:33:25] <bd808>	 SMalyshev: yes. we used to have some Jenkins slaves that were inside the prod network and needed the proxy but now we don't
[19:33:31] <bd808>	 I think so, yes
[19:34:03] <bd808>	 looks like it sets up .m2/settings.xml that will need to be cleaned up
[19:34:37] <wikibugs>	 6operations, 10Continuous-Integration-Infrastructure: Test mwext-qunit-composer  database disk image is malformed - https://phabricator.wikimedia.org/T122599#1908567 (10Paladox) 3NEW
[19:36:07] <SMalyshev>	 bd808: that settings file only contains the proxy. So I think it may be just dropped
[19:36:47] <bd808>	 *nod* sounds right
[19:37:18] <bd808>	 SMalyshev: should I make some puppet patches or do you have time and energy to?
[19:38:25] <grrrit-wm>	 (03PS1) 10Smalyshev: Remove maven webproxy since it is not needed anymore after config changes [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122594) 
[19:38:37] <SMalyshev>	 bd808: I've just made https://gerrit.wikimedia.org/r/261476 
[19:39:45] <SMalyshev>	 wait, maybe I deleted wrong one - which one does not need proxy, production, labs or both?
[19:40:28] <bd808>	 labs does not need it
[19:40:54] <bd808>	 it should stay on master I guess
[19:41:04] <bd808>	 (although we run no jobs there now)
[19:41:38] <grrrit-wm>	 (03PS2) 10Smalyshev: Remove maven webproxy since it is not needed anymore after config changes [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122594) 
[19:42:05] <SMalyshev>	 bd808: ok, so the patch should be ok as in https://gerrit.wikimedia.org/r/#/c/261476/2/manifests/role/ci.pp
[19:42:42] <bd808>	 SMalyshev: can you add in an ensure=>absent for /var/lib/jenkins-slave/.m2/settings.xml too?
[19:43:09] <bd808>	 I think I can cherry-pick to the project to test that and see if it fixes your builds
[19:44:37] <SMalyshev>	 bd808: that should probably be /mnt/home/jenkins-deploy/.m2/settings.xml ?
[19:44:59] <bd808>	 yeah, you are correct
[19:46:50] <grrrit-wm>	 (03PS3) 10Smalyshev: Remove maven webproxy since it is not needed anymore after config changes [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122594) 
[19:46:57] <SMalyshev>	 ok, updated
[19:48:14] <grrrit-wm>	 (03PS4) 10Smalyshev: Remove maven webproxy since it is not needed anymore after config changes [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122594) 
[19:48:56] <SMalyshev>	 bd808: could we test-run it to see if it fixes the problem?
[19:49:12] <bd808>	 SMalyshev: yeah. working on that bit
[19:49:20] <SMalyshev>	 bd808: thanks!
[19:52:35] <grrrit-wm>	 (03CR) 10BryanDavis: "Cherry-picked to integration-puppetmaster for testing" [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122594) (owner: 10Smalyshev)
[19:54:55] <bd808>	 SMalyshev: forcing puppet runs now. It will take a little bit
[19:55:16] * bd808 goes to grab a sandwich while he waits
[19:55:58] <SMalyshev>	 bd808: thanks!
[19:58:15] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Labs ldap: size_limit by quite a bit. [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) 
[20:02:52] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed
[20:06:40] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Labs ldap: size_limit by quite a bit. [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) 
[20:08:13] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Remove maven webproxy since it is not needed anymore after config changes [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122594) (owner: 10Smalyshev)
[20:09:42] <grrrit-wm>	 (03PS3) 10Andrew Bogott: Labs ldap: increase size_limit by quite a bit. [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) 
[20:15:09] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 04-1] "This is needed for OSM which is using a dedicated user, right?" [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) (owner: 10Andrew Bogott)
[20:15:33] <wikibugs>	 6operations, 10Fundraising-Backlog, 10Traffic, 10Unplanned-Sprint-Work, 3Fundraising Sprint Zapp: Firefox SPDY-coalesces requests to geoiplookup over text-lb, causing GeoIP IPv6 failures - https://phabricator.wikimedia.org/T121922#1908701 (10DStrine)
[20:16:14] <grrrit-wm>	 (03CR) 10Andrew Bogott: "Yes, dedicated user. But this patch will also fix https://phabricator.wikimedia.org/T122595 which is (currently) a tool used by lots of " [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) (owner: 10Andrew Bogott)
[20:17:56] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: add akosiaris yubikey [puppet] - 10https://gerrit.wikimedia.org/r/261571 
[20:18:31] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] add akosiaris yubikey [puppet] - 10https://gerrit.wikimedia.org/r/261571 (owner: 10Alexandros Kosiaris)
[20:18:51] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active
[20:19:23] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 031] Labs ldap: increase size_limit by quite a bit. [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) (owner: 10Andrew Bogott)
[20:20:21] <grrrit-wm>	 (03CR) 10Muehlenhoff: "I'd rather rather fix ldaplist to properly use paged requests, I can have a look after the allhands." [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) (owner: 10Andrew Bogott)
[20:25:59] <bd808>	 SMalyshev: my salt fu wasn't quite right. trying to force puppet to run again...
[20:26:36] * apergos peeks in
[20:27:07] <bd808>	 apergos: nothing to see here :) I was trying to force puppet runs inside the intergration labs project and caught some instances that seem to not play well with the salt master (ols precise slaves)
[20:27:11] <bd808>	 *old
[20:27:16] <apergos>	 oh in ci
[20:27:21] <bd808>	 yea
[20:27:28] <apergos>	 I have no where near got to those yet
[20:30:22] <apergos>	 new packages are available 
[20:34:45] <bd808>	 SMalyshev: it worked! https://integration.wikimedia.org/ci/job/wikidata-query-rdf/782/console
[20:34:50] <SMalyshev>	 bd808: excellent!
[20:36:25] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 031] "Verified via cherry-pick -- https://integration.wikimedia.org/ci/job/wikidata-query-rdf/782/console" [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122594) (owner: 10Smalyshev)
[20:37:02] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0]
[20:50:38] <grrrit-wm>	 (03PS4) 10Andrew Bogott: Labs ldap: Repeal size_limit and timeouts for novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) 
[20:52:00] <grrrit-wm>	 (03PS5) 10Andrew Bogott: Labs ldap: Repeal size_limit and timeouts for novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) 
[20:52:29] <grrrit-wm>	 (03PS6) 10Andrew Bogott: Labs ldap: Repeal size_limit and timeouts for novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) 
[20:53:17] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed
[20:54:09] <andrewbogott>	 YuviPanda: do you know what ^ is?
[20:54:42] <YuviPanda>	 it's life
[20:54:49] <YuviPanda>	 am looking, but probably LDAP failure
[20:55:18] <andrewbogott>	 I’ve restarted ldap a few times recently, probably my fault
[20:55:53] <YuviPanda>	 yeah it's LDAP failure
[20:56:10] <andrewbogott>	 is ldap /still/ failing?
[20:56:15] <YuviPanda>	 I should probably merge https://gerrit.wikimedia.org/r/#/c/258658/
[20:56:27] <grrrit-wm>	 (03PS2) 10Yuvipanda: labstore: Do not re-use connections for create-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/258658 
[20:56:34] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Do not re-use connections for create-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/258658 (owner: 10Yuvipanda)
[20:56:37] <andrewbogott>	 oh yeah, probably :)
[20:57:05] <YuviPanda>	 andrewbogott: it gets restarted by puppet usually
[20:59:00] <YuviPanda>	 andrewbogott: seems ok now
[20:59:00] <andrewbogott>	 great — want to +1 https://gerrit.wikimedia.org/r/#/c/261535/ and then I’ll break it some more?
[20:59:30] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active
[21:00:23] <grrrit-wm>	 (03CR) 10Yuvipanda: "The alternative is touching OSM, screw that :D" [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) (owner: 10Andrew Bogott)
[21:01:44] <grrrit-wm>	 (03PS7) 10Andrew Bogott: Labs ldap: Repeal size_limit and timeouts for novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) 
[21:06:11] <andrewbogott>	 hm, and now jenkins is stuck?
[21:07:40] <icinga-wm>	 PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:07:44] <YuviPanda>	 wat wat
[21:08:13] <YuviPanda>	 it's working, just really slow at times
[21:08:23] <YuviPanda>	 because LDAP has become a *lot* slower since the switch
[21:08:36] <YuviPanda>	 sudo takes forever even
[21:08:36] <andrewbogott>	 since what switch?  I  haven’t merged anything yet
[21:08:40] <apergos>	 ignoring tools home page right?
[21:08:41] <YuviPanda>	 andrewbogott: oh, since the OpenLDAP switch
[21:08:46] <andrewbogott>	 oh, yeah
[21:08:47] <YuviPanda>	 apergos: yeah, it's just timeout
[21:08:50] <jynus>	 maybe dowtime if while you are around?
[21:08:51] <apergos>	 k
[21:09:01] <YuviPanda>	 jynus: yeah, am navigating icinga now
[21:09:14] <YuviPanda>	 jynus: although, it should recover on next call and it isn't actually 'down'
[21:09:27] <YuviPanda>	 I'm considering increasing the timeout
[21:09:30] <jynus>	 I know :-)
[21:09:32] <YuviPanda>	 but that's just masking the problem
[21:09:47] <grrrit-wm>	 (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) (owner: 10Andrew Bogott)
[21:09:50] <icinga-wm>	 RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 972619 bytes in 5.699 second response time
[21:10:00] <YuviPanda>	 but I can't really do anything myself about LDAP being slow, so maybe I should let it be to annoy people
[21:10:07] <YuviPanda>	 let me file a bug about it being slow
[21:10:11] <jynus>	 maybe create a separata dynamic page with no content?
[21:10:33] <YuviPanda>	 it's like this big omnibus check
[21:10:44] <YuviPanda>	 it goes down if any of: NFS, LDAP, webproxy, DNS, instances die
[21:12:05] <YuviPanda>	 I can't reproduce sudo being slow now
[21:12:07] <YuviPanda>	 hmm
[21:12:10] <jynus>	 do not listen to me, I am the first one to admit that it is easy to suggest things, not so much to fix them :-)
[21:12:25] <jynus>	 (even in this case where there is nothing to fix)
[21:12:54] <YuviPanda>	 jynus: there's lots of things to fix, just time :)
[21:12:58] <YuviPanda>	 the home page needs rewriting
[21:13:03] <jynus>	 mmmm, sounds familiar
[21:13:13] <YuviPanda>	 I'm hoping to get bd808 involved in rewriting / rethinking tools 'flow' :)
[21:15:23] <wikibugs>	 6operations, 6Labs: Increase timeout for tools-home check - https://phabricator.wikimedia.org/T122615#1908866 (10yuvipanda)
[21:19:37] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Labs ldap: Repeal size_limit and timeouts for novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) (owner: 10Andrew Bogott)
[21:22:31] <icinga-wm>	 PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:23:37] <YuviPanda>	 oookk
[21:23:44] <YuviPanda>	 this time it is actually dead!
[21:24:03] <andrewbogott>	 and yet ldap lives on
[21:24:04] <YuviPanda>	 and back
[21:24:10] <YuviPanda>	 this might be NFS?
[21:24:34] <andrewbogott>	 why should today be any different?
[21:24:40] <icinga-wm>	 RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 972622 bytes in 5.614 second response time
[21:32:20] <YuviPanda>	 andrewbogott: I don't find anything errant on labstore atm
[21:32:46] <andrewbogott>	 YuviPanda: ok.  I think I’m done restarting ldap for the next while
[21:32:48] <andrewbogott>	 maybe things will settle
[21:32:48] <YuviPanda>	 hopefully
[21:32:53] <YuviPanda>	 if it flaps again I'll increase timeout
[21:50:00] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[21:56:10] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [24.0]
[21:56:29] <Coren>	 Is something amiss with LDAP?  I'm seeing odd delays in host name resolution.
[21:56:51] <Coren>	 Ah, might be NFS
[21:57:54] <andrewbogott>	 Coren: I restarted ldap a few times recently, but not in the last 30 minutes.
[21:58:26] <Coren>	 That looked to have been a side effect - I think NFS is crumbling under the load of an errant client atm
[21:58:39] <icinga-wm>	 PROBLEM - puppet last run on ganeti2002 is CRITICAL: CRITICAL: puppet fail
[21:59:06] <Coren>	 Or was; it seems to be less worse right now, maybe.
[22:00:17] <icinga-wm>	 PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:00:17] <Coren>	 ... or not.  Yep, something definitely ailing with NFS right now
[22:00:31] <Coren>	 It comes and goes in bursts.
[22:01:32] <Coren>	 hm.
[22:01:43] <Coren>	 And wikitech too.  That smells increasingly like ldap
[22:01:49] <Coren>	 ^^ andrewbogott
[22:02:11] <icinga-wm>	 RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.027 second response time
[22:02:41] <Coren>	 Any openstack extension page I try to touch on Wikitech that would hit ldap stalls.  Wikitech itself is fine
[22:03:24] <icinga-wm>	 RECOVERY - puppet last run on ganeti2002 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[22:04:07] <Coren>	 And it's back now.  How odd.
[22:04:07] <Coren>	 You might get a hint about what is going on in wikitech logs, though, since I got a 500
[22:04:26] <andrewbogott>	 I thought maybe ganeti2002 hosted serpens...
[22:04:31] <andrewbogott>	 but 2002 seems fine now, at least
[22:05:21] <Coren>	 https://wikitech.wikimedia.org/wiki/Special:NovaSudoer gave me a 500 a few mins ago - IIRC that's one of the "only talks to LDAP" special pages.
[22:05:32] <Coren>	 Maybe the logs will be illuminating?
[22:06:35] <apergos>	 I noticed ldap slowness doing a df on the last couple instances so that sounds about right
[22:07:38] <chasemp>	 andrewbogott: so things recovered but it seems ldap related?
[22:07:59] <icinga-wm>	 PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:08:03] <Coren>	 chasemp: They're still spotty.  I think LDAP is still ill but caching papers over many things
[22:08:10] <andrewbogott>	 chasemp: I still don’t know much.  I did change an ldap config half an hour or so ago
[22:08:18] <chasemp>	 where is that change?
[22:08:30] <andrewbogott>	 https://gerrit.wikimedia.org/r/#/c/261535/
[22:08:56] <chasemp>	 huh
[22:09:31] <Coren>	 Could that cause wikitech to accidentally dos the ldap server/
[22:09:44] <andrewbogott>	 yeah, possibly
[22:10:09] <icinga-wm>	 RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 5.018 second response time
[22:10:19] <chasemp>	 look at this weirdness
[22:10:19] <chasemp>	 https://grafana.wikimedia.org/dashboard/db/openldap-labs
[22:10:41] <chasemp>	 clients are failing over to serpens it seems as they have issues w/ seaborgium iiuc how that failover works
[22:11:08] <andrewbogott>	 Coren: unforunately, Wikitech needs those giant queries to be allowed
[22:11:18] <andrewbogott>	 I can put a hard limit in instead of ‘unlimited’ as a test.  Hang on...
[22:11:35] <Coren>	 andrewbogott: Perhaps we need to add a couple indices then.
[22:11:50] <andrewbogott>	 !log disabling puppet on seaborgium and serpens
[22:11:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:13:26] <andrewbogott>	 !log restarting slapd on seaborgium and serpens
[22:13:31] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:13:41] <Coren>	 andrewbogott: Extra data point: 'ldaplist -l passwd andre' can take 4-5 seconds to return.  Definitely bad.
[22:13:51] <Coren>	 Ah.  Seems better now?
[22:14:20] <andrewbogott>	 still slow for me
[22:14:47] <andrewbogott>	 restarts are always rought, let’s give it a few
[22:15:18] <chasemp>	 andrewbogott: did you revert then?
[22:15:31] <andrewbogott>	 chasemp: no, but I added a hard limit of 10,000 by hand
[22:16:21] <andrewbogott>	 high enough to keep wikitech happy for now but less than ‘unlimited'
[22:16:52] <andrewbogott>	 ok, restarts are done and both ldaps look happy to me
[22:17:07] <Coren>	 Things look better on this end too.
[22:17:27] <andrewbogott>	 It’s more likely the restarts that fixed it though, no reason to think the config change mattered
[22:17:27] <chasemp>	 so what's the theory, wikitech thrashes ldap to teh point where clients start bailing on seaborgium to serpens?
[22:17:36] <chasemp>	 but in general it that failover is not super graceful
[22:17:56] <YuviPanda>	 ugh
[22:18:58] <chasemp>	 andrewbogott: why go forward with https://gerrit.wikimedia.org/r/#/c/261535/7 now instead of waiting till after all hands for moritz to poke?
[22:19:00] <andrewbogott>	 ok, back when chase was seeing the failovers
[22:19:07] <andrewbogott>	 seaborgium was saying this a lot:  Dec 29 22:10:36 seaborgium slapd[6341]: cmp -256, too old
[22:19:25] <andrewbogott>	 chasemp: because look at the attached bug/
[22:19:38] <andrewbogott>	 also, moritz’s ‘after all hands’ comment was regarding a different bug
[22:19:46] <andrewbogott>	 the patch as merged is what he suggested.
[22:19:57] <chasemp>	 gotcha, what a mess
[22:20:12] <YuviPanda>	 looks like the 'too old' message is if replication fails?
[22:20:20] <chasemp>	 that's all I see
[22:20:22] * YuviPanda still has vague bad feelings about these being on ganeti
[22:20:23] <icinga-wm>	 PROBLEM - check_listener_ipn on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:21:42] <wikibugs>	 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1909017 (10ArielGlenn) salt updated and responsive on all lab instances that don't have their own salt master, with the following exceptions:  towtruck.visualeditor.eqiad.wmflabs -- no route to...
[22:21:52] <andrewbogott>	 so yeah, each has been complaining about ‘too old’ for a while
[22:22:02] <chasemp>	 andrewbogott: I'm all tz turned around here
[22:22:15] <chasemp>	 it seems like teh drop off of connections from primary to secondary here https://grafana.wikimedia.org/dashboard/db/openldap-labs
[22:22:19] <chasemp>	 started before you merged...?
[22:22:58] <Coren>	 Also, the restart doesn't seem to have switch many things away from serpens
[22:23:43] <andrewbogott>	 chasemp: I did a hotfix first, it’s still possible that I triggered it
[22:23:55] <chasemp>	 any idea when the hotfix was time wise?
[22:24:16] <andrewbogott>	 let me dig a bit
[22:24:35] <chasemp>	 It would be cool if this coorelated and it wasn't just more mystery
[22:25:14] <icinga-wm>	 PROBLEM - check_listener_ipn on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:25:20] <andrewbogott>	 I identified the issue at 19:40
[22:25:23] <andrewbogott>	 so no earlier than that
[22:26:03] <andrewbogott>	 yeah, that fits — those dips in the first graph are probably restarts?
[22:26:24] <andrewbogott>	 sorry, I mean in the ‘open connections’ seaborgium graph
[22:27:03] <chasemp>	 so it's also possible depending on what order you restarted in
[22:27:12] <andrewbogott>	 um… looking at these graphs it looks like the total amount of traffic increased.   The graphs shoot up on serpens but there’s not a corresponding dip on seaborgium is there?
[22:27:18] <andrewbogott>	 hotfix was only on seaborgium
[22:27:22] <andrewbogott>	 didn’t restart/touch serpens until the merge
[22:27:25] <chasemp>	 huh
[22:28:27] <wikibugs>	 10Ops-Access-Requests, 6operations: Create new puppet group `discovery-analytics-deploy` - https://phabricator.wikimedia.org/T122620#1909028 (10EBernhardson) 3NEW
[22:29:43] <chasemp>	 I don't understand the replication relationship well enough to know if in itself would cause issues
[22:30:13] <icinga-wm>	 RECOVERY - check_listener_ipn on thulium is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 6.013 second response time
[22:30:40] <wikibugs>	 6operations, 10Parsoid, 10Wikimedia-Site-Requests: please run fetch-sitematrix update - https://phabricator.wikimedia.org/T122548#1909046 (10Dzahn) p:5Triage>3Normal
[22:31:32] <andrewbogott>	 replication should happen once/minute
[22:32:53] <grrrit-wm>	 (03PS1) 10Andrew Bogott: openldap: Drop the novaadmin query limit from 'unlimited' [puppet] - 10https://gerrit.wikimedia.org/r/261588 
[22:32:57] <andrewbogott>	 So… ^ is more conservative than ‘unlimited'
[22:33:19] <andrewbogott>	 but I still don’t understand how/where wikitech would be doing queries for more than that many records.
[22:33:45] <andrewbogott>	 It’s weird that traffic is still heavy on serpens
[22:33:54] <YuviPanda>	 does this thing have query logs?
[22:34:01] <icinga-wm>	 PROBLEM - SSH on technetium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:35:37] <andrewbogott>	 ‘this thing’?
[22:35:39] <chasemp>	 YuviPanda: it seems like it can but we don't
[22:35:44] <chasemp>	 assuming ldap
[22:35:51] <andrewbogott>	 there are OSM logs on fluorine
[22:36:09] <YuviPanda>	 we probably should enable logging
[22:36:16] <YuviPanda>	 I bet we'll find crazy-ass-queries hititng
[22:36:18] <YuviPanda>	 it
[22:36:22] <YuviPanda>	 although, OpenDJ handled them just fine
[22:36:24] <andrewbogott>	  /a/mw-log/ldap.log
[22:36:52] <chasemp>	 interesting I didn't know that existed
[22:37:20] <icinga-wm>	 PROBLEM - configured eth on technetium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:37:32] <andrewbogott>	 it’s newish
[22:37:33] <grrrit-wm>	 (03PS1) 10Dzahn: toollabs: increase timeout for tools-home to 20 [puppet] - 10https://gerrit.wikimedia.org/r/261589 (https://phabricator.wikimedia.org/T122615) 
[22:38:06] <andrewbogott>	 man seaborgium is still saying "Dec 29 22:36:43 seaborgium slapd[13647]: connection_read(660): no connection!"
[22:38:08] <andrewbogott>	 now and then
[22:38:12] <wikibugs>	 10Ops-Access-Requests, 6operations: Create new puppet group `discovery-analytics-deploy` - https://phabricator.wikimedia.org/T122620#1909096 (10jcrespo) p:5Triage>3Normal
[22:38:16] <andrewbogott>	 which I assume is failure to contact serpens
[22:38:43] <chasemp>	 I kind of recall those being terminated connections (existing like for timeout)
[22:38:50] <chasemp>	 but maybe I'm wrong
[22:39:01] <chasemp>	 there is a whole rabbit hole of what that can mean
[22:39:12] <chasemp>	 not that that makes it a positive thing
[22:39:19] <andrewbogott>	 they seem to predate anything we’re doing today though
[22:39:23] <andrewbogott>	 so probably ignorable for now
[22:39:49] <chasemp>	 well from the log it shows it does most (all) of it's work using novaadmin account
[22:39:56] <chasemp>	 but doesn't really expose the ldap query
[22:40:21] <wikibugs>	 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1909102 (10ArielGlenn) icinga is fixed, the remaining failure was due to the wrong key name (I saw a batch of "old-style" names without the project in them, for keys and cleaned them out but mi...
[22:41:00] <mutante>	 YuviPanda: https://gerrit.wikimedia.org/r/#/c/261589/
[22:41:11] <mutante>	 as requested
[22:41:21] <icinga-wm>	 RECOVERY - configured eth on technetium is OK: OK - interfaces up
[22:41:34] <YuviPanda>	 mutante: <3
[22:43:14] <chasemp>	 andrewbogott: I think we are in ok territory connection count wise, seems around 18-19 hundred on seaborgium and 6-7 hundred on serpens
[22:43:38] <chasemp>	 which is around the normal 2.5k right?
[22:43:53] <andrewbogott>	 oh, goddamn it, the graphs have different scales
[22:44:01] <chasemp>	 ah yes :)
[22:44:04] <andrewbogott>	 so, yes, you’re right, it looks reasonable
[22:44:18] <chasemp>	 I'm doing a rough aprox on server w/ 
[22:44:18] <chasemp>	 ss | grep ldap | awk '{print $6}' | cut -d ":" -f 1 | sort | wc
[22:44:20] <chasemp>	 also
[22:44:40] <chasemp>	 seems in teh ballpark but I also do not fully grok the failover/failback
[22:45:00] <chasemp>	 I'm tempted to stop ldap on serpens to force a state or normalcy w/ clients on seaborgium to prove it can handle it
[22:45:15] <chasemp>	 I mean a redundant ldap situation where we can't lose one ldap server isn't much good
[22:45:38] <andrewbogott>	 yeah…  I think that’s a good thing to test but let’s do it when Moritz is watching
[22:46:07] <chasemp>	 agreed then if it seems stable as-is we'll live with it
[22:46:14] <andrewbogott>	 meanwhile, I guess I’ll merge https://gerrit.wikimedia.org/r/#/c/261588 ?  That sets the limit to more than it is now (10,000) but less than unlimited
[22:46:15] <chasemp>	 I don't think what it is up is inherently bad
[22:46:24] <chasemp>	 I just want to see the deterministic all clients go to seaborgium work
[22:46:42] <chasemp>	 andrewbogott: tbh it seems prudent ot hotfix to that level first
[22:46:50] <chasemp>	 since we don't know the actual sane limit
[22:46:51] <andrewbogott>	 yeah, ok, will do
[22:47:32] <YuviPanda>	 andrewbogott: should we attempt to implement paging in OSM instead? (if this is indeed the problem)
[22:47:58] <YuviPanda>	 I also think we should either move it off to real hardware or increase the size of the ganeti instances... but I somehow guess that doesn't make the cores any more powerful
[22:48:40] <andrewbogott>	 YuviPanda: maybe… if we can get by with 16,000 then OSM will keep working for a couple of years, which I can only hope is long enough
[22:48:48] <mutante>	 when a ganeti instance is created, part of the commandline that does it is specifying how many "vcpus" it gets
[22:48:51] <mutante>	 if that helps anything
[22:48:59] <andrewbogott>	 But since this has to do with user creation… that’s going to stay in mediawiki potentially forever.
[22:49:51] <icinga-wm>	 PROBLEM - configured eth on technetium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:49:51] <YuviPanda>	 hopefully not 'forever', but forever for at least the next 6-9months I guess
[22:49:51] <andrewbogott>	 mutante: it might, but we need to figure out what’s limiting us first
[22:49:56] <YuviPanda>	 mutante: yeah, but I don't know if slapd uses multiple cores at all
[22:50:02] <chasemp>	 it seems to
[22:50:34] <chasemp>	 but I have seen it just break neck spike once in teh last few minutes 100% across a few cpu's 
[22:50:42] <chasemp>	 I'm in favor of upping the logging level of slapd and doing the logrotate stuff for it
[22:50:58] <chasemp>	 right now I feel blind w/ slapd and maybe that's just not knowing where to look
[22:51:31] <icinga-wm>	 RECOVERY - configured eth on technetium is OK: OK - interfaces up
[22:51:40] * YuviPanda nods furiously
[22:51:56] <andrewbogott>	 ok, both are hotfixed to the state in https://gerrit.wikimedia.org/r/#/c/261588/ now
[22:52:49] <chasemp>	 gotcha
[22:52:50] <andrewbogott>	 we can give it a few minutes and make sure nothing freaks out
[22:53:24] <chasemp>	 andrewbogott: so in watching ldap via the flourine log
[22:53:34] <chasemp>	 any clue what the op that does this kind of thing is
[22:53:36] <chasemp>	 2015-12-29 22:52:43 silver labswiki ldap INFO: 2.1.0 adding tools.isbn
[22:53:36] <chasemp>	 2015-12-29 22:52:43 silver labswiki ldap INFO: 2.1.0 adding tools.wikipedia-library
[22:53:45] <chasemp>	 seems to have context for like every tool in toollabs
[22:53:46] <chasemp>	 and then some
[22:53:51] <chasemp>	 and happens often
[22:54:00] <chasemp>	 I can't for the life of me figure out what this coudl be doing so regularly
[22:54:34] <andrewbogott>	 I don’t know offhand
[22:56:13] <mutante>	 does it look up info for each tool when people load the tools-home page?
[22:56:13] <mutante>	 (and that makes it slow, right)
[22:56:32] <mutante>	 like looking up the maintainer for the list?
[22:56:56] <chasemp>	 ah mabye so and maybe that's why tools-home is such a pos
[22:57:17] <mutante>	 the http check on tools-home is slow becasue there are so many tools 
[22:57:30] <mutante>	 maybe this is also the ldap lookup? guessing though
[22:58:31] <wikibugs>	 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1909152 (10ArielGlenn) I lie, there are about 35 instances not yet upgraded that are however happily salt responsive. I'll have to go in and deal with them by hand.
[23:00:26] <icinga-wm>	 PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:00:49] <wikibugs>	 10Ops-Access-Requests, 6operations: Create new puppet group `discovery-analytics-deploy` - https://phabricator.wikimedia.org/T122620#1909162 (10EBernhardson) can certainly handle this after the all-hands, no need to rush.
[23:00:52] <chasemp>	 ^ andrewbogott?
[23:01:08] <YuviPanda>	 wat
[23:01:17] <chasemp>	 did it ever formally recover tho?
[23:01:34] <YuviPanda>	 yeah it did pretty quickly I think
[23:01:42] <YuviPanda>	 I'm looking at top
[23:01:44] <YuviPanda>	 lots of kworker
[23:01:51] <YuviPanda>	 and at least some ksoftirqd
[23:02:33] <icinga-wm>	 RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.036 second response time
[23:02:37] <chasemp>	 hmm
[23:02:44] <andrewbogott>	 Is it possible that there’s some kind of echo effect?  Like, I restart ldap and 20 minutes later…?
[23:03:13] <YuviPanda>	 this could also have been my 'find'.
[23:03:22] <YuviPanda>	 I killed it as soon as the page popped up
[23:03:23] <andrewbogott>	 I hate that puppet on ganeti2002 failed for no reason in the middle of this
[23:03:31] <chasemp>	 seriously
[23:03:39] <andrewbogott>	 but I can’t make that relate, since even if one of these is on that box it would’ve been serpens
[23:03:58] <YuviPanda>	 we can verify by my running find again, but probably too https://xkcd.com/242/
[23:04:51] <chasemp>	 idk I'm not in love with the 16k limit now vs 10k
[23:05:07] <chasemp>	 but clearly we are bumping into these limits in some query
[23:06:20] <YuviPanda>	 http://www.openldap.org/lists/openldap-software/200707/msg00396.html
[23:06:25] <YuviPanda>	 very clear log levels, openldap
[23:06:38] <YuviPanda>	 how did we turn the *one* aspect of labs that wasn't fucking up constantly into this nightmare?
[23:06:51] <andrewbogott>	 chasemp: meaning you’d prefer 10k?
[23:07:19] <chasemp>	 well, what's the disadvantage I guess? (since we saw it sit fine the longest and believe it's fine for the known trama user limit bug)
[23:07:39] <YuviPanda>	 is the rDNS failures(?) causing gridengine outage related?
[23:07:42] <chasemp>	 alternative emo title, why not 10K?
[23:07:44] <YuviPanda>	 maybe DNS is hitting LDAP heavily?
[23:07:46] <andrewbogott>	 chasemp: anything other than ‘unlimited’ is waiting to bite us
[23:07:52] <chasemp>	 right
[23:07:54] <andrewbogott>	 so, a bigger number means a longer wait :)
[23:08:41] <chasemp>	 but not before we are all back from vaca and travel
[23:08:41] <andrewbogott>	 but, I’m ok with 10k
[23:08:41] <andrewbogott>	 16k is just a rounder number *shrug*
[23:08:42] <icinga-wm>	 PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:08:56] <YuviPanda>	 goddamit
[23:08:57] <chasemp>	 we know it has to be above # of users + reasonable %
[23:08:59] <YuviPanda>	 so it wasn't my 'find'
[23:09:00] <chasemp>	 YuviPanda: find again?
[23:09:02] <chasemp>	 huh
[23:09:03] <chasemp>	 yeah
[23:09:05] <chasemp>	 great
[23:09:14] <grrrit-wm>	 (03PS2) 10Andrew Bogott: openldap: Drop the novaadmin query limit from 'unlimited' [puppet] - 10https://gerrit.wikimedia.org/r/261588 
[23:09:19] <andrewbogott>	 ^ 10k
[23:09:19] <YuviPanda>	 at what point do we wake up moritz?
[23:09:37] <chasemp>	 well, let's sit w/ 10k which we know solves the user id problem
[23:09:39] <chasemp>	 and seemed stable
[23:09:41] <chasemp>	 and go from there
[23:09:55] <YuviPanda>	 NFS instance has a lot of ksoftirqd cpu usage going on now
[23:10:00] <YuviPanda>	 there's also the gridengine outage
[23:10:06] <YuviPanda>	 why do these things happen in clusters....
[23:10:07] <chasemp>	 related or unrelated?
[23:10:12] <grrrit-wm>	 (03PS3) 10Andrew Bogott: openldap: Drop the novaadmin query limit from 'unlimited' [puppet] - 10https://gerrit.wikimedia.org/r/261588 
[23:10:23] <YuviPanda>	 not fully sure
[23:10:24] <andrewbogott>	 can I get some +1s?
[23:10:52] <chasemp>	 (the other alternative is locking down new users till post all staff or real fixes)
[23:10:55] <chasemp>	 that's the true triage fix I guess
[23:11:20] <YuviPanda>	 > http://www.openldap.org/lists/openldap-software/200707/msg00396.html
[23:11:21] <YuviPanda>	 err
[23:11:25] <YuviPanda>	 >    11 root      rt   0       0      0      0 S  54.1  0.0  76:33.36 watchdog/0
[23:11:25] <grrrit-wm>	 (03CR) 10Rush: [C: 031] "well we know we are handing out dupe id's and we think 10k is sane based on hotfix so let's try it" [puppet] - 10https://gerrit.wikimedia.org/r/261588 (owner: 10Andrew Bogott)
[23:11:28] <YuviPanda>	 that's not good...
[23:11:38] <YuviPanda>	 and points somewhat to the same bug as last time, maybe?
[23:11:42] <YuviPanda>	 causing the soft lockup
[23:11:49] <chasemp>	 on nfs?
[23:11:51] <YuviPanda>	 yeah
[23:12:00] <chasemp>	 triggered by ldap shenanigans maybe or who knows
[23:12:05] <chasemp>	 it could also just be issues there
[23:12:13] <chasemp>	 the nfs / ldap tie-in is somewhat opaque to me honestly
[23:12:24] <chasemp>	 but I know enough to know as one goes...so goes the other
[23:12:33] <icinga-wm>	 RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.245 second response time
[23:12:41] <YuviPanda>	 could be, yeah
[23:14:16] <chasemp>	 andrewbogott: back to 10k^?
[23:14:28] <chasemp>	 kicker is I can't imagine taht limit matters unless you are using it which means if 10K "fixes" the outages
[23:14:34] <andrewbogott>	 nope, it’s still at 16384
[23:14:38] <chasemp>	 it's probably at the expense of whatever query is getting hit
[23:14:40] <chasemp>	 huh
[23:14:58] <andrewbogott>	 right, exactly.  I still believe/want to believe that 10k == 16k == unlimited
[23:16:51] <andrewbogott>	 in terms of behavior
[23:16:51] <andrewbogott>	 and that our issues are something to do with restart hiccups or something
[23:16:52] <andrewbogott>	 but… I have no theory at all at this point, really
[23:16:52] <YuviPanda>	 we need to enable query logging before we can say anything, IMO
[23:16:52] * YuviPanda is still trying to dig the gridengine outage
[23:16:53] <YuviPanda>	 whelp
[23:16:53] <YuviPanda>	 that instance is 'stuck'
[23:16:53] <YuviPanda>	 unreachable
[23:16:53] <YuviPanda>	 the master
[23:16:57] <YuviPanda>	 hmm, root login worked after a long delay
[23:16:59] <chasemp>	 ...redundant masters?
[23:17:19] <YuviPanda>	 theoretically, although the auto failover process depends on.... (YOU GUESSED IT)
[23:17:50] <YuviPanda>	 oh, nvm, ssh works was just super slow
[23:17:50] <YuviPanda>	 stuck for like many many seconds
[23:18:53] <icinga-wm>	 PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:18:58] <icinga-wm>	 PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:20:31] <YuviPanda>	 chasemp: andrewbogott I vote to wake up moritz
[23:20:31] <chasemp>	 it's interesting that during all of this has ldap monitoring triggered an alert at all?
[23:20:32] <mutante>	 do you want me to merge the change that raises the timeout for one of those checks?
[23:20:33] <YuviPanda>	 mutante: yes
[23:20:33] <andrewbogott>	 mutante: sure
[23:20:38] <icinga-wm>	 PROBLEM - Persistent high iowait on labstore1001 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [60.0]
[23:21:02] <chasemp>	 is mortiz on vacation?
[23:21:04] <icinga-wm>	 RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 973217 bytes in 6.012 second response time
[23:21:07] <YuviPanda>	 well, you were too, chasemp :)
[23:21:14] * andrewbogott restarts slapd yet again
[23:21:22] <chasemp>	 andrewbogott: 10k?
[23:21:28] <andrewbogott>	 yeah
[23:21:48] <chasemp>	 k
[23:21:58] <grrrit-wm>	 (03PS2) 10Dzahn: toollabs: increase timeout for tools-home to 20 [puppet] - 10https://gerrit.wikimedia.org/r/261589 (https://phabricator.wikimedia.org/T122615) 
[23:22:00] <chasemp>	 can we apply to both and restart both and just hunker down here
[23:22:07] <chasemp>	 because it's possible wikitech tries to user serpens right
[23:22:11] <andrewbogott>	 yeah, that’s what I did
[23:22:13] <chasemp>	 w/ all the non-hotfixed things
[23:22:13] <chasemp>	 ok
[23:22:34] <grrrit-wm>	 (03PS3) 10Dzahn: toollabs: increase timeout for tools-home to 20 [puppet] - 10https://gerrit.wikimedia.org/r/261589 (https://phabricator.wikimedia.org/T122615) 
[23:22:34] <andrewbogott>	 I haven’t merged the puppet change only because jenkins hasn’t verified yet
[23:22:42] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] toollabs: increase timeout for tools-home to 20 [puppet] - 10https://gerrit.wikimedia.org/r/261589 (https://phabricator.wikimedia.org/T122615) (owner: 10Dzahn)
[23:23:30] <chasemp>	 andrewbogott: if you are ok w/ it
[23:23:41] <chasemp>	 let's just hunker down as-is with no more restarts at 10k
[23:23:41] <chasemp>	 and see what happens
[23:23:56] <chasemp>	 let's change nothing at all here and see how it goes for a minute :)
[23:24:01] * andrewbogott nods
[23:25:52] <mutante>	 Leeeeeeroy Jeeenkins...
[23:26:48] <icinga-wm>	 RECOVERY - Persistent high iowait on labstore1001 is OK: OK: Less than 50.00% above the threshold [40.0]
[23:27:38] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[23:28:38] <grrrit-wm>	 (03CR) 10Dzahn: [V: 032] "jenkins is awol" [puppet] - 10https://gerrit.wikimedia.org/r/261589 (https://phabricator.wikimedia.org/T122615) (owner: 10Dzahn)
[23:30:05] <mutante>	 chasemp: btw, also high load on labstore2001, just that notifications look disabled
[23:30:18] <andrewbogott>	 I know how to get jenkins running again, but I only know how to do it the bad way
[23:30:20] <mutante>	 but maybe it says something that it's on both
[23:30:46] <YuviPanda>	 labstore2001 is just misnamed, should've been called labstorebackup2001
[23:31:16] <chasemp>	 it's odd it would be affected
[23:31:55] <chasemp>	 YuviPanda: did you set up the nfs check for tools.wmflabs.org?
[23:32:24] <YuviPanda>	 ?
[23:32:34] <YuviPanda>	 the tools-checker one?
[23:32:36] <YuviPanda>	 yeah
[23:32:40] <chasemp>	 that check that is failing for nfs w/ tools.wmflabs.org as a canary
[23:32:42] <YuviPanda>	 yeah
[23:32:43] <icinga-wm>	 RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.022 second response time
[23:32:45] <chasemp>	 it hasn't come back
[23:32:46] <chasemp>	 there
[23:32:54] <chasemp>	 I was wondering if you would look manually at whatever it checks
[23:32:58] <chasemp>	 as I wasn't convinced
[23:33:04] <chasemp>	 but maybe it was just icinga being icinga
[23:33:17] <mutante>	 that one is also paging separately and won't be affected by what i merged
[23:56:13] <andrewbogott>	 !log restarting nodepool on labnodepool1001
[23:56:18] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master