[00:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151229T0000). [00:00:04] awight: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:14] too busy, sorry [00:07:34] RoanKattouw: ostriches: I'd be happy to deploy my own patch... is that reasonable? [00:08:51] Ah CN you fickle beast [00:09:23] I'm sure there are more fun extensions to deploy, maybe Wikidata [00:09:26] *base [00:09:29] awight: I'm assuming it's urgent-ish? [00:09:33] mmm. yes [00:09:36] Did something go all ficklish? [00:09:52] This is blocking translations for CentralNotice, and the Wikimedia 15-year anniversary is coming up in 2 weeks. [00:09:53] awight: {{approved}} [00:09:57] :D [00:10:06] * awight waves around my seal of temporary approval [00:12:04] Quick, to the scanner & 3d printer! [00:12:48] Good thing I'm doing the deploy--I realize now that I was supposed to link to a gerrit patch for the 1.27.0-wmf.9 backport [00:13:03] ... and I only have myself to annoy [00:17:27] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [00:18:39] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:23:27] YuviPanda: ok to remove redis::legacy now? [00:23:45] ori: yup just finished switching over [00:23:51] yay [00:23:53] thanks very much [00:23:56] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [00:23:58] (03PS4) 10Ori.livneh: remove redis::legacy [puppet] - 10https://gerrit.wikimedia.org/r/257548 [00:24:32] (03CR) 10Ori.livneh: [C: 032 V: 032] remove redis::legacy [puppet] - 10https://gerrit.wikimedia.org/r/257548 (owner: 10Ori.livneh) [00:24:41] ori: np [00:24:48] ori: wikibugs is dead tho :D am attempting to fix [00:25:35] It's a little hard to deploy with this WMF Board-induced adrenaline rush... [00:25:57] * YuviPanda whips awight [00:26:04] awight: just remove whatever servers disagree with you [00:26:13] It's in the bylaws [00:26:41] P.S. Never check mail while deploying [00:28:25] 6operations, 6Labs: Untangle labs/production roles from labs/instance roles - https://phabricator.wikimedia.org/T119401#1907554 (10yuvipanda) Actually, I guess the confusion is sorted out, so maybe this should be closed? [00:28:30] yay [00:28:31] wikibugs fixed [00:54:44] (03PS1) 10Ori.livneh: redis: disable transparent hugepages [puppet] - 10https://gerrit.wikimedia.org/r/261301 [00:55:02] (03CR) 10Ori.livneh: [C: 032 V: 032] redis: disable transparent hugepages [puppet] - 10https://gerrit.wikimedia.org/r/261301 (owner: 10Ori.livneh) [00:55:24] PROBLEM - puppet last run on cp2017 is CRITICAL: CRITICAL: puppet fail [01:01:19] O_o [01:01:25] How did you know I was reading the bylaws ;) [01:02:56] !log awight@tin Synchronized php-1.27.0-wmf.9/extensions/CentralNotice: Update CentralNotice: T122251 (duration: 00m 34s) [01:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:04:40] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [01:09:50] Krenair: Actually, yes. VE and Flow especially. [01:09:50] AndyRussG: I've deployed, checking now... [01:10:01] Krenair: ty [01:10:15] awight: coolio! [01:10:49] PROBLEM - puppet last run on mw2211 is CRITICAL: CRITICAL: puppet fail [01:18:44] Coren, so I'm not sure the issue I'm thinking of will immediately bring Flow to your wiki [01:19:00] but the existing problem is that Parsoid does not recognise your wiki, therefore you have no VE and no PDF download [01:19:09] Ah. [01:19:15] (03PS1) 10Ori.livneh: redis: use file_line on redis.conf to enable latency monitor [puppet] - 10https://gerrit.wikimedia.org/r/261303 [01:19:20] I don't think the PDF download is an issue. [01:20:03] But you want VE [01:20:19] Coren, please see https://phabricator.wikimedia.org/T122548#1907326 [01:20:46] (03PS2) 10Ori.livneh: redis: use file_line on redis.conf to enable latency monitor [puppet] - 10https://gerrit.wikimedia.org/r/261303 [01:21:48] (03CR) 10Ori.livneh: [C: 032 V: 032] redis: use file_line on redis.conf to enable latency monitor [puppet] - 10https://gerrit.wikimedia.org/r/261303 (owner: 10Ori.livneh) [01:22:00] 6operations, 10Parsoid, 10Wikimedia-Site-Requests: please run fetch-sitematrix update - https://phabricator.wikimedia.org/T122548#1907576 (10coren) VE is desired, but as the wiki is very young and the number of non "powerusers" is going to be very low, there is no emergency to deploy outside the normal cycle... [01:22:17] (03PS1) 10Aaron Schulz: Remove redundant RunJobs code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261304 [01:22:28] RECOVERY - puppet last run on cp2017 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [01:25:38] (03PS1) 10Ori.livneh: Fixup for Icfa1a930df: anchor the 'match' so there is only one match [puppet] - 10https://gerrit.wikimedia.org/r/261305 [01:25:49] (03CR) 10Ori.livneh: [C: 032 V: 032] Fixup for Icfa1a930df: anchor the 'match' so there is only one match [puppet] - 10https://gerrit.wikimedia.org/r/261305 (owner: 10Ori.livneh) [01:28:35] Krenair: tl;dr: not urgently needed, can wait. [01:28:48] ok [01:33:37] (03PS1) 10Aaron Schulz: Enable persistent redis connections for job runners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261306 [01:38:05] (03PS1) 10Ori.livneh: redis: enable latency monitor only on jessies [puppet] - 10https://gerrit.wikimedia.org/r/261309 [01:38:26] (03CR) 10Ori.livneh: [C: 032 V: 032] redis: enable latency monitor only on jessies [puppet] - 10https://gerrit.wikimedia.org/r/261309 (owner: 10Ori.livneh) [01:39:55] RECOVERY - puppet last run on mw2211 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:35:12] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 15m 38s) [02:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:41:49] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Dec 29 02:41:49 UTC 2015 (duration 6m 38s) [02:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:02:51] (03PS2) 10Glaisher: Enable global AbuseFilter at French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257868 (https://phabricator.wikimedia.org/T120568) [06:02:53] (03PS1) 10Glaisher: Set $wgMathFullRestbaseURL so that MathML works even if VE is disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261321 (https://phabricator.wikimedia.org/T122401) [06:03:05] wat [06:05:13] (03PS2) 10Glaisher: Set $wgMathFullRestbaseURL so that MathML works even if VE is disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261321 (https://phabricator.wikimedia.org/T122401) [06:22:26] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:24:07] (03CR) 10Physikerwelt: "I understand the production part:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261321 (https://phabricator.wikimedia.org/T122401) (owner: 10Glaisher) [06:29:57] PROBLEM - puppet last run on mw1086 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:16] (03CR) 10Glaisher: "Yes, at least most modern browsers should be able to handle it, I think. Links by RL in CSS/JS files (which has a greater impact) are also" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261321 (https://phabricator.wikimedia.org/T122401) (owner: 10Glaisher) [06:30:38] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: puppet fail [06:31:17] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:36] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:37] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:46] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:34] PROBLEM - puppet last run on mw2043 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:44] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:44] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:55] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:04] PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures [06:48:25] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:50:56] PROBLEM - salt-minion processes on technetium is CRITICAL: PROCS CRITICAL: 7 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [06:55:55] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:56:14] RECOVERY - puppet last run on mw2043 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:56:25] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:56:25] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:56:25] RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:44] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:57:35] RECOVERY - puppet last run on mw1086 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:35] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:57:54] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:36] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:04] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:45] (03CR) 10Physikerwelt: [C: 031] "OK. Thank you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261321 (https://phabricator.wikimedia.org/T122401) (owner: 10Glaisher) [07:57:27] PROBLEM - NTP on technetium is CRITICAL: NTP CRITICAL: No response from NTP server [08:59:28] (03CR) 10ArielGlenn: [C: 032 V: 032] jessie 2014.7.5 patch for batch cli returns with broken dict [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260371 (owner: 10ArielGlenn) [09:01:35] (03CR) 10ArielGlenn: [C: 032 V: 032] jessie 2014.7.5 continue reading events even after getting one with wrong tag [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260372 (owner: 10ArielGlenn) [09:08:31] (03CR) 10ArielGlenn: [C: 032 V: 032] make ping_on_rotate work without minion data cache [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260373 (owner: 10ArielGlenn) [09:10:35] (03CR) 10ArielGlenn: [C: 032 V: 032] 2014.7.5 jessie, backport patches for singleton SAuth class [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260374 (owner: 10ArielGlenn) [09:11:33] (03CR) 10ArielGlenn: [C: 032 V: 032] bump version number for wmf build, 2014.7.5+ds-1+wm1 [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260375 (owner: 10ArielGlenn) [09:23:01] (03PS1) 10ArielGlenn: increase zmq queue backlog length for salt cli, based on user config [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/261329 [09:27:24] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1907827 (10ArielGlenn) https://gerrit.wikimedia.org/r/261329 this fixes the above: the salt command line client was throwing away events in the ZMQ backlog because too many came in at once. W... [09:31:38] (03PS1) 10Jcrespo: Setting weight values for s6 to original production values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261331 (https://phabricator.wikimedia.org/T105879) [09:37:40] !log changing the mysql master of db2028, from db1030 to db1050 [09:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:47:37] (03PS1) 10ArielGlenn: bump version number for wmf build, 2014.7.5+ds-1+wm2 [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/261334 [09:49:38] (03CR) 10ArielGlenn: [C: 032 V: 032] increase zmq queue backlog length for salt cli, based on user config [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/261329 (owner: 10ArielGlenn) [09:50:42] (03CR) 10ArielGlenn: [C: 032 V: 032] bump version number for wmf build, 2014.7.5+ds-1+wm2 [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/261334 (owner: 10ArielGlenn) [09:51:49] (03CR) 10Jcrespo: [C: 032] Setting weight values for s6 to original production values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261331 (https://phabricator.wikimedia.org/T105879) (owner: 10Jcrespo) [09:51:50] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1907836 (10ArielGlenn) Jessie package tested on neodymium and works as advertised. https://gerrit.wikimedia.org/r/261334 [09:51:57] PROBLEM - Host mw2031 is DOWN: PING CRITICAL - Packet loss = 100% [09:52:19] RECOVERY - Host mw2031 is UP: PING OK - Packet loss = 0%, RTA = 38.05 ms [09:52:52] ^looks like network [09:54:39] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Setting weight values for s6 to original production values (duration: 00m 35s) [09:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:59:22] (03CR) 10ArielGlenn: [C: 032 V: 032] trusty 2014.7.5 patch for batch cli returns with broken dict [debs/salt] (trusty) - 10https://gerrit.wikimedia.org/r/260560 (owner: 10ArielGlenn) [10:16:07] (03CR) 10ArielGlenn: [C: 032 V: 032] trusty 2014.7.5 continue reading events even after getting one with wrong tag [debs/salt] (trusty) - 10https://gerrit.wikimedia.org/r/260561 (owner: 10ArielGlenn) [10:33:08] (03CR) 10ArielGlenn: [C: 032 V: 032] make ping_on_rotate work without minion data cache [debs/salt] (trusty) - 10https://gerrit.wikimedia.org/r/260562 (owner: 10ArielGlenn) [10:33:55] (03CR) 10ArielGlenn: [C: 032 V: 032] 2014.7.5 trusty, backport patches for singleton SAuth class [debs/salt] (trusty) - 10https://gerrit.wikimedia.org/r/260563 (owner: 10ArielGlenn) [10:34:15] (03CR) 10ArielGlenn: [C: 032 V: 032] bump version number for wmf build, 2014.7.5+ds-1ubuntu1+wm1 [debs/salt] (trusty) - 10https://gerrit.wikimedia.org/r/260564 (owner: 10ArielGlenn) [10:38:49] (03PS1) 10ArielGlenn: increase zmq queue backlog length for salt cli, based on user config [debs/salt] (trusty) - 10https://gerrit.wikimedia.org/r/261343 [10:38:51] (03PS1) 10ArielGlenn: bump version number for wmf build, 2014.7.5+ds-1+wm2 [debs/salt] (trusty) - 10https://gerrit.wikimedia.org/r/261344 [10:43:09] 6operations, 7HTTPS: ssl certificate replacement: dumps.wikimedia.org (expires 2016-02-26) - https://phabricator.wikimedia.org/T122321#1907905 (10ArielGlenn) This can go whenever you like. People downloading can restart or retry from where they were interrupted. [10:50:11] (03CR) 10ArielGlenn: [C: 031] dataset: move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260940 (owner: 10Dzahn) [10:52:29] (03CR) 10ArielGlenn: [C: 032 V: 032] precise 2014.7.5 patch for batch cli returns with broken dict [debs/salt] (precise) - 10https://gerrit.wikimedia.org/r/260568 (owner: 10ArielGlenn) [10:53:13] (03CR) 10ArielGlenn: [C: 032 V: 032] precise 2014.7.5 continue reading events even after getting one with wrong tag [debs/salt] (precise) - 10https://gerrit.wikimedia.org/r/260569 (owner: 10ArielGlenn) [10:55:58] 6operations, 10DBA, 5Patch-For-Review: db1022 duplicate key errors - https://phabricator.wikimedia.org/T105879#1907916 (10jcrespo) 5Open>3Resolved [10:59:13] (03CR) 10ArielGlenn: [C: 032 V: 032] make ping_on_rotate work without minion data cache [debs/salt] (precise) - 10https://gerrit.wikimedia.org/r/260570 (owner: 10ArielGlenn) [10:59:58] (03CR) 10ArielGlenn: [C: 032 V: 032] 2014.7.5 precise, backport patches for singleton SAuth class [debs/salt] (precise) - 10https://gerrit.wikimedia.org/r/260571 (owner: 10ArielGlenn) [11:00:17] (03CR) 10ArielGlenn: [C: 032 V: 032] bump version number for wmf build, 2014.7.5+ds-1precise1+wm1 [debs/salt] (precise) - 10https://gerrit.wikimedia.org/r/260572 (owner: 10ArielGlenn) [11:04:19] (03PS1) 10ArielGlenn: increase zmq queue backlog length for salt cli, based on user config [debs/salt] (precise) - 10https://gerrit.wikimedia.org/r/261346 [11:04:20] (03PS1) 10ArielGlenn: bump version number for wmf build, 2014.7.5+ds-1+wm2 [debs/salt] (precise) - 10https://gerrit.wikimedia.org/r/261347 [11:28:01] (03PS1) 10ArielGlenn: restart of salt minion should not kill all subprocesses [puppet] - 10https://gerrit.wikimedia.org/r/261349 [11:39:49] (03PS2) 10ArielGlenn: restart of salt minion should not kill all subprocesses [puppet] - 10https://gerrit.wikimedia.org/r/261349 [11:40:59] (03CR) 10ArielGlenn: [C: 032] restart of salt minion should not kill all subprocesses [puppet] - 10https://gerrit.wikimedia.org/r/261349 (owner: 10ArielGlenn) [11:43:46] (03PS1) 10ArielGlenn: fix typo in contents of salt minion systemd conf file [puppet] - 10https://gerrit.wikimedia.org/r/261352 [11:45:08] (03CR) 10ArielGlenn: [C: 032] fix typo in contents of salt minion systemd conf file [puppet] - 10https://gerrit.wikimedia.org/r/261352 (owner: 10ArielGlenn) [11:51:09] it's so quiet in here except for my spam in the channel [11:51:32] better not jinx it I guess or we'll have nagios spam, ewww [11:59:30] <_joe_> apergos: I'm here, just lost in netlink niceties :) [11:59:54] eww sorry to hear it [12:01:24] (03CR) 10ArielGlenn: [C: 032 V: 032] increase zmq queue backlog length for salt cli, based on user config [debs/salt] (trusty) - 10https://gerrit.wikimedia.org/r/261343 (owner: 10ArielGlenn) [12:04:37] (03CR) 10ArielGlenn: [C: 032 V: 032] bump version number for wmf build, 2014.7.5+ds-1+wm2 [debs/salt] (trusty) - 10https://gerrit.wikimedia.org/r/261344 (owner: 10ArielGlenn) [12:04:38] (03CR) 10ArielGlenn: [C: 032 V: 032] increase zmq queue backlog length for salt cli, based on user config [debs/salt] (precise) - 10https://gerrit.wikimedia.org/r/261346 (owner: 10ArielGlenn) [12:04:39] (03CR) 10ArielGlenn: [C: 032 V: 032] bump version number for wmf build, 2014.7.5+ds-1+wm2 [debs/salt] (precise) - 10https://gerrit.wikimedia.org/r/261347 (owner: 10ArielGlenn) [12:09:27] (I am just running random queries in non-production hosts) [12:10:18] go dbs go! [12:22:20] I am just trying to advance some work for next year's goal (and required anyway) [12:22:57] (03PS3) 10Reedy: Set $wgMathFullRestbaseURL so that MathML works even if VE is disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261321 (https://phabricator.wikimedia.org/T122401) (owner: 10Glaisher) [12:23:02] (03CR) 10Reedy: [C: 032] Set $wgMathFullRestbaseURL so that MathML works even if VE is disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261321 (https://phabricator.wikimedia.org/T122401) (owner: 10Glaisher) [12:23:32] (03Merged) 10jenkins-bot: Set $wgMathFullRestbaseURL so that MathML works even if VE is disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261321 (https://phabricator.wikimedia.org/T122401) (owner: 10Glaisher) [12:24:30] !log reedy@tin Synchronized wmf-config/CommonSettings.php: Attempt to fix math related fatal (duration: 00m 33s) [12:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:30:16] PROBLEM - puppet last run on technetium is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:41:18] 6operations, 6Discovery, 10Maps: Tilerator Error: permission denied for relation planet_osm_polygon - https://phabricator.wikimedia.org/T122270#1908064 (10akosiaris) I 've resynced the databases back then and had to shutdown the pgsql services for that to happen. The problem started by the osm2pgsql process... [12:41:26] 6operations, 6Discovery, 10Maps: Tilerator Error: permission denied for relation planet_osm_polygon - https://phabricator.wikimedia.org/T122270#1908065 (10akosiaris) 5Open>3Resolved a:3akosiaris [12:50:44] PROBLEM - puppet last run on mw1205 is CRITICAL: CRITICAL: Puppet has 1 failures [12:50:54] PROBLEM - puppet last run on mw2010 is CRITICAL: CRITICAL: Puppet has 1 failures [12:51:54] PROBLEM - puppet last run on mw2004 is CRITICAL: CRITICAL: Puppet has 1 failures [13:08:29] !log labcontrol*, neodymium and palladium updated to latest salt packages (wm2), rest of prod to follow [13:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:17:05] RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:17:06] RECOVERY - puppet last run on mw2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:17:45] RECOVERY - puppet last run on mw2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:21:07] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:23:15] RECOVERY - DPKG on labmon1001 is OK: All packages OK [13:29:19] !log salt wm2 packages now installed on all production hosts except for: mw1041.eqiad.wmnet, technetium.eqiad.wmnet, mw1228.eqiad.wmnet, ms-be1011.eqiad.wmnet [13:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:40:24] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1908117 (10ArielGlenn) The new wm2 packages are now installed on all production hosts except for: mw1041.eqiad.wmnet, technetium.eqiad.wmnet, mw1228.eqiad.wmnet, ms-be1011.eqiad.wmnet. Status... [13:54:11] PROBLEM - salt-minion processes on mira is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:54:15] (03PS1) 10Cmjohnson: Addin production DNS for pc1004-6 bug: task# T121888 [dns] - 10https://gerrit.wikimedia.org/r/261358 [13:55:57] (03CR) 10Cmjohnson: [C: 032] Addin production DNS for pc1004-6 bug: task# T121888 [dns] - 10https://gerrit.wikimedia.org/r/261358 (owner: 10Cmjohnson) [14:12:10] PROBLEM - salt-minion processes on tin is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:19:39] RECOVERY - salt-minion processes on mira is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:37:41] RECOVERY - salt-minion processes on tin is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:04:36] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1908140 (10ArielGlenn) Labs salt update Because (as usual) a pile of instances have issues, I'm doing the old standby of the ssh loop, which will update salt only on hosts which have the labc... [15:06:07] !log labs salt instances salt update in progress. It's slow and tedious and automated. A few hundred instances already done, the rest are going one at a time. Only instances that use the labcontrol salt master will be affected. [15:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:07:09] PROBLEM - dhclient process on technetium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:07:36] hm I should log that in the labs log woops [15:14:45] does any ops person perhaps know where the svn repo's are that were not imported to git ? [15:17:09] phabricator, I think [15:17:12] https://phabricator.wikimedia.org/diffusion/ [15:17:55] svn.wikimedia.org nowadays redirects there, so I think that's the canonical location [15:17:59] hmm. [15:18:19] well then i guess the repo was never there, or it wasn't imported... [15:18:42] I wouldn't know -- the release engineering team typically handles all that [15:18:49] or Reedy may know more perhaps? [15:20:10] i'll just keep poking.. or maybe i should just ask mdale, since it's probably his code anyways [15:37:21] (03PS1) 10Bartosz Dziewoński: Disable cross-wiki upload A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261371 (https://phabricator.wikimedia.org/T120867) [15:38:29] (03PS2) 10Bartosz Dziewoński: Disable cross-wiki upload A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261371 (https://phabricator.wikimedia.org/T120867) [16:00:05] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151229T1600). [16:00:05] MatmaRex: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:14] hiho. anyone deploying? [16:00:33] MatmaRex: I can SWAT. [16:01:10] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261371 (https://phabricator.wikimedia.org/T120867) (owner: 10Bartosz Dziewoński) [16:01:35] (03Merged) 10jenkins-bot: Disable cross-wiki upload A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261371 (https://phabricator.wikimedia.org/T120867) (owner: 10Bartosz Dziewoński) [16:04:42] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Disable cross-wiki upload A/B test [[gerrit:261371]] (duration: 00m 31s) [16:04:45] ^ MatmaRex check please [16:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:05:03] on it [16:07:46] thcipriani: all fine :) [16:07:58] MatmaRex: cool, thanks for checking :) [16:45:15] (03PS3) 10EBernhardson: Turn off A/B test for search lang detect via accept-language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258087 (https://phabricator.wikimedia.org/T119529) [16:45:29] (03CR) 10EBernhardson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258087 (https://phabricator.wikimedia.org/T119529) (owner: 10EBernhardson) [16:45:59] (03Merged) 10jenkins-bot: Turn off A/B test for search lang detect via accept-language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258087 (https://phabricator.wikimedia.org/T119529) (owner: 10EBernhardson) [16:47:00] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Turn off AB test for search lang detect via accept-language (duration: 00m 29s) [16:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:51:41] (03PS1) 10Giuseppe Lavagetto: [WiP] add native ipvs manager [debs/pybal] - 10https://gerrit.wikimedia.org/r/261375 [16:52:52] (03CR) 10jenkins-bot: [V: 04-1] [WiP] add native ipvs manager [debs/pybal] - 10https://gerrit.wikimedia.org/r/261375 (owner: 10Giuseppe Lavagetto) [17:00:05] _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151229T1700). [17:00:23] <_joe_> uhm, seems blank [17:00:52] _joe_: sadly I haven't taught jouncebot to ignore empty time slots yet [17:01:17] <_joe_> bd808: it's allright [17:01:36] it would probably need even more metadata in the wiki page which is kind of yuck [17:02:27] using the html generated from wikitext as a structured data store is an "interesting" feature of jouncebot [18:17:31] PROBLEM - puppet last run on mc1007 is CRITICAL: CRITICAL: Puppet has 1 failures [18:41:32] RECOVERY - puppet last run on mc1007 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [18:51:30] PROBLEM - RAID on technetium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:03:29] PROBLEM - DPKG on technetium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:06:39] PROBLEM - Check size of conntrack table on technetium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:10:00] godog, parvoid: I was told you may know something about CI network problem in https://phabricator.wikimedia.org/T122594? [19:11:28] I wonder if that instance/job is still trying to use the http proxy? [19:16:31] bd808, SMalyshev, I can wget those files just fine on a CI slave. So if it is using the proxy, it need not. [19:17:21] *nod* there was some of that ripped out last weekend from when we had some ci slaves inside the prod network [19:17:41] this may be something else that has been missed so far [19:18:02] yeah, seems likely [19:19:52] ok. these files don't change often but the build needs them [19:27:50] bd808: so what was the change that happened recently? [19:28:38] SMalyshev: https://phabricator.wikimedia.org/T122368 [19:28:53] bd808: ah, I can't see it [19:29:33] TL;DR webproxy.eqiad.wmnet:8080 is not useable as a HTTP proxy from labs any longer [19:31:04] yep and a lot of nova cloud or something repos get whined about now from labs [19:31:36] bd808: so that may mess things up, definitely. So, if my CI build needs to fetch stuff from outside is there a way to do it now? [19:31:56] https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/ci.pp#L172-L177 [19:32:12] SMalyshev: from labs external http is wide open [19:32:23] so we just need to yank out the old proxy config [19:32:41] bd808: CI runs on labs, right? [19:33:23] so we just need to drop contint::maven_webproxy? [19:33:25] SMalyshev: yes. we used to have some Jenkins slaves that were inside the prod network and needed the proxy but now we don't [19:33:31] I think so, yes [19:34:03] looks like it sets up .m2/settings.xml that will need to be cleaned up [19:34:37] 6operations, 10Continuous-Integration-Infrastructure: Test mwext-qunit-composer database disk image is malformed - https://phabricator.wikimedia.org/T122599#1908567 (10Paladox) 3NEW [19:36:07] bd808: that settings file only contains the proxy. So I think it may be just dropped [19:36:47] *nod* sounds right [19:37:18] SMalyshev: should I make some puppet patches or do you have time and energy to? [19:38:25] (03PS1) 10Smalyshev: Remove maven webproxy since it is not needed anymore after config changes [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122594) [19:38:37] bd808: I've just made https://gerrit.wikimedia.org/r/261476 [19:39:45] wait, maybe I deleted wrong one - which one does not need proxy, production, labs or both? [19:40:28] labs does not need it [19:40:54] it should stay on master I guess [19:41:04] (although we run no jobs there now) [19:41:38] (03PS2) 10Smalyshev: Remove maven webproxy since it is not needed anymore after config changes [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122594) [19:42:05] bd808: ok, so the patch should be ok as in https://gerrit.wikimedia.org/r/#/c/261476/2/manifests/role/ci.pp [19:42:42] SMalyshev: can you add in an ensure=>absent for /var/lib/jenkins-slave/.m2/settings.xml too? [19:43:09] I think I can cherry-pick to the project to test that and see if it fixes your builds [19:44:37] bd808: that should probably be /mnt/home/jenkins-deploy/.m2/settings.xml ? [19:44:59] yeah, you are correct [19:46:50] (03PS3) 10Smalyshev: Remove maven webproxy since it is not needed anymore after config changes [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122594) [19:46:57] ok, updated [19:48:14] (03PS4) 10Smalyshev: Remove maven webproxy since it is not needed anymore after config changes [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122594) [19:48:56] bd808: could we test-run it to see if it fixes the problem? [19:49:12] SMalyshev: yeah. working on that bit [19:49:20] bd808: thanks! [19:52:35] (03CR) 10BryanDavis: "Cherry-picked to integration-puppetmaster for testing" [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122594) (owner: 10Smalyshev) [19:54:55] SMalyshev: forcing puppet runs now. It will take a little bit [19:55:16] * bd808 goes to grab a sandwich while he waits [19:55:58] bd808: thanks! [19:58:15] (03PS1) 10Andrew Bogott: Labs ldap: size_limit by quite a bit. [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) [20:02:52] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [20:06:40] (03PS2) 10Andrew Bogott: Labs ldap: size_limit by quite a bit. [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) [20:08:13] (03CR) 10jenkins-bot: [V: 04-1] Remove maven webproxy since it is not needed anymore after config changes [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122594) (owner: 10Smalyshev) [20:09:42] (03PS3) 10Andrew Bogott: Labs ldap: increase size_limit by quite a bit. [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) [20:15:09] (03CR) 10Muehlenhoff: [C: 04-1] "This is needed for OSM which is using a dedicated user, right?" [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) (owner: 10Andrew Bogott) [20:15:33] 6operations, 10Fundraising-Backlog, 10Traffic, 10Unplanned-Sprint-Work, 3Fundraising Sprint Zapp: Firefox SPDY-coalesces requests to geoiplookup over text-lb, causing GeoIP IPv6 failures - https://phabricator.wikimedia.org/T121922#1908701 (10DStrine) [20:16:14] (03CR) 10Andrew Bogott: "Yes, dedicated user. But this patch will also fix https://phabricator.wikimedia.org/T122595 which is (currently) a tool used by lots of " [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) (owner: 10Andrew Bogott) [20:17:56] (03PS1) 10Alexandros Kosiaris: add akosiaris yubikey [puppet] - 10https://gerrit.wikimedia.org/r/261571 [20:18:31] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] add akosiaris yubikey [puppet] - 10https://gerrit.wikimedia.org/r/261571 (owner: 10Alexandros Kosiaris) [20:18:51] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [20:19:23] (03CR) 10ArielGlenn: [C: 031] Labs ldap: increase size_limit by quite a bit. [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) (owner: 10Andrew Bogott) [20:20:21] (03CR) 10Muehlenhoff: "I'd rather rather fix ldaplist to properly use paged requests, I can have a look after the allhands." [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) (owner: 10Andrew Bogott) [20:25:59] SMalyshev: my salt fu wasn't quite right. trying to force puppet to run again... [20:26:36] * apergos peeks in [20:27:07] apergos: nothing to see here :) I was trying to force puppet runs inside the intergration labs project and caught some instances that seem to not play well with the salt master (ols precise slaves) [20:27:11] *old [20:27:16] oh in ci [20:27:21] yea [20:27:28] I have no where near got to those yet [20:30:22] new packages are available [20:34:45] SMalyshev: it worked! https://integration.wikimedia.org/ci/job/wikidata-query-rdf/782/console [20:34:50] bd808: excellent! [20:36:25] (03CR) 10BryanDavis: [C: 031] "Verified via cherry-pick -- https://integration.wikimedia.org/ci/job/wikidata-query-rdf/782/console" [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122594) (owner: 10Smalyshev) [20:37:02] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0] [20:50:38] (03PS4) 10Andrew Bogott: Labs ldap: Repeal size_limit and timeouts for novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) [20:52:00] (03PS5) 10Andrew Bogott: Labs ldap: Repeal size_limit and timeouts for novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) [20:52:29] (03PS6) 10Andrew Bogott: Labs ldap: Repeal size_limit and timeouts for novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) [20:53:17] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [20:54:09] YuviPanda: do you know what ^ is? [20:54:42] it's life [20:54:49] am looking, but probably LDAP failure [20:55:18] I’ve restarted ldap a few times recently, probably my fault [20:55:53] yeah it's LDAP failure [20:56:10] is ldap /still/ failing? [20:56:15] I should probably merge https://gerrit.wikimedia.org/r/#/c/258658/ [20:56:27] (03PS2) 10Yuvipanda: labstore: Do not re-use connections for create-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/258658 [20:56:34] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Do not re-use connections for create-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/258658 (owner: 10Yuvipanda) [20:56:37] oh yeah, probably :) [20:57:05] andrewbogott: it gets restarted by puppet usually [20:59:00] andrewbogott: seems ok now [20:59:00] great — want to +1 https://gerrit.wikimedia.org/r/#/c/261535/ and then I’ll break it some more? [20:59:30] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [21:00:23] (03CR) 10Yuvipanda: "The alternative is touching OSM, screw that :D" [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) (owner: 10Andrew Bogott) [21:01:44] (03PS7) 10Andrew Bogott: Labs ldap: Repeal size_limit and timeouts for novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) [21:06:11] hm, and now jenkins is stuck? [21:07:40] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:07:44] wat wat [21:08:13] it's working, just really slow at times [21:08:23] because LDAP has become a *lot* slower since the switch [21:08:36] sudo takes forever even [21:08:36] since what switch? I haven’t merged anything yet [21:08:40] ignoring tools home page right? [21:08:41] andrewbogott: oh, since the OpenLDAP switch [21:08:46] oh, yeah [21:08:47] apergos: yeah, it's just timeout [21:08:50] maybe dowtime if while you are around? [21:08:51] k [21:09:01] jynus: yeah, am navigating icinga now [21:09:14] jynus: although, it should recover on next call and it isn't actually 'down' [21:09:27] I'm considering increasing the timeout [21:09:30] I know :-) [21:09:32] but that's just masking the problem [21:09:47] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) (owner: 10Andrew Bogott) [21:09:50] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 972619 bytes in 5.699 second response time [21:10:00] but I can't really do anything myself about LDAP being slow, so maybe I should let it be to annoy people [21:10:07] let me file a bug about it being slow [21:10:11] maybe create a separata dynamic page with no content? [21:10:33] it's like this big omnibus check [21:10:44] it goes down if any of: NFS, LDAP, webproxy, DNS, instances die [21:12:05] I can't reproduce sudo being slow now [21:12:07] hmm [21:12:10] do not listen to me, I am the first one to admit that it is easy to suggest things, not so much to fix them :-) [21:12:25] (even in this case where there is nothing to fix) [21:12:54] jynus: there's lots of things to fix, just time :) [21:12:58] the home page needs rewriting [21:13:03] mmmm, sounds familiar [21:13:13] I'm hoping to get bd808 involved in rewriting / rethinking tools 'flow' :) [21:15:23] 6operations, 6Labs: Increase timeout for tools-home check - https://phabricator.wikimedia.org/T122615#1908866 (10yuvipanda) [21:19:37] (03CR) 10Andrew Bogott: [C: 032] Labs ldap: Repeal size_limit and timeouts for novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/261535 (https://phabricator.wikimedia.org/T122601) (owner: 10Andrew Bogott) [21:22:31] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:23:37] oookk [21:23:44] this time it is actually dead! [21:24:03] and yet ldap lives on [21:24:04] and back [21:24:10] this might be NFS? [21:24:34] why should today be any different? [21:24:40] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 972622 bytes in 5.614 second response time [21:32:20] andrewbogott: I don't find anything errant on labstore atm [21:32:46] YuviPanda: ok. I think I’m done restarting ldap for the next while [21:32:48] maybe things will settle [21:32:48] hopefully [21:32:53] if it flaps again I'll increase timeout [21:50:00] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [21:56:10] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [24.0] [21:56:29] Is something amiss with LDAP? I'm seeing odd delays in host name resolution. [21:56:51] Ah, might be NFS [21:57:54] Coren: I restarted ldap a few times recently, but not in the last 30 minutes. [21:58:26] That looked to have been a side effect - I think NFS is crumbling under the load of an errant client atm [21:58:39] PROBLEM - puppet last run on ganeti2002 is CRITICAL: CRITICAL: puppet fail [21:59:06] Or was; it seems to be less worse right now, maybe. [22:00:17] PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:00:17] ... or not. Yep, something definitely ailing with NFS right now [22:00:31] It comes and goes in bursts. [22:01:32] hm. [22:01:43] And wikitech too. That smells increasingly like ldap [22:01:49] ^^ andrewbogott [22:02:11] RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.027 second response time [22:02:41] Any openstack extension page I try to touch on Wikitech that would hit ldap stalls. Wikitech itself is fine [22:03:24] RECOVERY - puppet last run on ganeti2002 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [22:04:07] And it's back now. How odd. [22:04:07] You might get a hint about what is going on in wikitech logs, though, since I got a 500 [22:04:26] I thought maybe ganeti2002 hosted serpens... [22:04:31] but 2002 seems fine now, at least [22:05:21] https://wikitech.wikimedia.org/wiki/Special:NovaSudoer gave me a 500 a few mins ago - IIRC that's one of the "only talks to LDAP" special pages. [22:05:32] Maybe the logs will be illuminating? [22:06:35] I noticed ldap slowness doing a df on the last couple instances so that sounds about right [22:07:38] andrewbogott: so things recovered but it seems ldap related? [22:07:59] PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:08:03] chasemp: They're still spotty. I think LDAP is still ill but caching papers over many things [22:08:10] chasemp: I still don’t know much. I did change an ldap config half an hour or so ago [22:08:18] where is that change? [22:08:30] https://gerrit.wikimedia.org/r/#/c/261535/ [22:08:56] huh [22:09:31] Could that cause wikitech to accidentally dos the ldap server/ [22:09:44] yeah, possibly [22:10:09] RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 5.018 second response time [22:10:19] look at this weirdness [22:10:19] https://grafana.wikimedia.org/dashboard/db/openldap-labs [22:10:41] clients are failing over to serpens it seems as they have issues w/ seaborgium iiuc how that failover works [22:11:08] Coren: unforunately, Wikitech needs those giant queries to be allowed [22:11:18] I can put a hard limit in instead of ‘unlimited’ as a test. Hang on... [22:11:35] andrewbogott: Perhaps we need to add a couple indices then. [22:11:50] !log disabling puppet on seaborgium and serpens [22:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:13:26] !log restarting slapd on seaborgium and serpens [22:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:13:41] andrewbogott: Extra data point: 'ldaplist -l passwd andre' can take 4-5 seconds to return. Definitely bad. [22:13:51] Ah. Seems better now? [22:14:20] still slow for me [22:14:47] restarts are always rought, let’s give it a few [22:15:18] andrewbogott: did you revert then? [22:15:31] chasemp: no, but I added a hard limit of 10,000 by hand [22:16:21] high enough to keep wikitech happy for now but less than ‘unlimited' [22:16:52] ok, restarts are done and both ldaps look happy to me [22:17:07] Things look better on this end too. [22:17:27] It’s more likely the restarts that fixed it though, no reason to think the config change mattered [22:17:27] so what's the theory, wikitech thrashes ldap to teh point where clients start bailing on seaborgium to serpens? [22:17:36] but in general it that failover is not super graceful [22:17:56] ugh [22:18:58] andrewbogott: why go forward with https://gerrit.wikimedia.org/r/#/c/261535/7 now instead of waiting till after all hands for moritz to poke? [22:19:00] ok, back when chase was seeing the failovers [22:19:07] seaborgium was saying this a lot: Dec 29 22:10:36 seaborgium slapd[6341]: cmp -256, too old [22:19:25] chasemp: because look at the attached bug/ [22:19:38] also, moritz’s ‘after all hands’ comment was regarding a different bug [22:19:46] the patch as merged is what he suggested. [22:19:57] gotcha, what a mess [22:20:12] looks like the 'too old' message is if replication fails? [22:20:20] that's all I see [22:20:22] * YuviPanda still has vague bad feelings about these being on ganeti [22:20:23] PROBLEM - check_listener_ipn on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:21:42] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1909017 (10ArielGlenn) salt updated and responsive on all lab instances that don't have their own salt master, with the following exceptions: towtruck.visualeditor.eqiad.wmflabs -- no route to... [22:21:52] so yeah, each has been complaining about ‘too old’ for a while [22:22:02] andrewbogott: I'm all tz turned around here [22:22:15] it seems like teh drop off of connections from primary to secondary here https://grafana.wikimedia.org/dashboard/db/openldap-labs [22:22:19] started before you merged...? [22:22:58] Also, the restart doesn't seem to have switch many things away from serpens [22:23:43] chasemp: I did a hotfix first, it’s still possible that I triggered it [22:23:55] any idea when the hotfix was time wise? [22:24:16] let me dig a bit [22:24:35] It would be cool if this coorelated and it wasn't just more mystery [22:25:14] PROBLEM - check_listener_ipn on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:25:20] I identified the issue at 19:40 [22:25:23] so no earlier than that [22:26:03] yeah, that fits — those dips in the first graph are probably restarts? [22:26:24] sorry, I mean in the ‘open connections’ seaborgium graph [22:27:03] so it's also possible depending on what order you restarted in [22:27:12] um… looking at these graphs it looks like the total amount of traffic increased. The graphs shoot up on serpens but there’s not a corresponding dip on seaborgium is there? [22:27:18] hotfix was only on seaborgium [22:27:22] didn’t restart/touch serpens until the merge [22:27:25] huh [22:28:27] 10Ops-Access-Requests, 6operations: Create new puppet group `discovery-analytics-deploy` - https://phabricator.wikimedia.org/T122620#1909028 (10EBernhardson) 3NEW [22:29:43] I don't understand the replication relationship well enough to know if in itself would cause issues [22:30:13] RECOVERY - check_listener_ipn on thulium is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 6.013 second response time [22:30:40] 6operations, 10Parsoid, 10Wikimedia-Site-Requests: please run fetch-sitematrix update - https://phabricator.wikimedia.org/T122548#1909046 (10Dzahn) p:5Triage>3Normal [22:31:32] replication should happen once/minute [22:32:53] (03PS1) 10Andrew Bogott: openldap: Drop the novaadmin query limit from 'unlimited' [puppet] - 10https://gerrit.wikimedia.org/r/261588 [22:32:57] So… ^ is more conservative than ‘unlimited' [22:33:19] but I still don’t understand how/where wikitech would be doing queries for more than that many records. [22:33:45] It’s weird that traffic is still heavy on serpens [22:33:54] does this thing have query logs? [22:34:01] PROBLEM - SSH on technetium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:35:37] ‘this thing’? [22:35:39] YuviPanda: it seems like it can but we don't [22:35:44] assuming ldap [22:35:51] there are OSM logs on fluorine [22:36:09] we probably should enable logging [22:36:16] I bet we'll find crazy-ass-queries hititng [22:36:18] it [22:36:22] although, OpenDJ handled them just fine [22:36:24] /a/mw-log/ldap.log [22:36:52] interesting I didn't know that existed [22:37:20] PROBLEM - configured eth on technetium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:37:32] it’s newish [22:37:33] (03PS1) 10Dzahn: toollabs: increase timeout for tools-home to 20 [puppet] - 10https://gerrit.wikimedia.org/r/261589 (https://phabricator.wikimedia.org/T122615) [22:38:06] man seaborgium is still saying "Dec 29 22:36:43 seaborgium slapd[13647]: connection_read(660): no connection!" [22:38:08] now and then [22:38:12] 10Ops-Access-Requests, 6operations: Create new puppet group `discovery-analytics-deploy` - https://phabricator.wikimedia.org/T122620#1909096 (10jcrespo) p:5Triage>3Normal [22:38:16] which I assume is failure to contact serpens [22:38:43] I kind of recall those being terminated connections (existing like for timeout) [22:38:50] but maybe I'm wrong [22:39:01] there is a whole rabbit hole of what that can mean [22:39:12] not that that makes it a positive thing [22:39:19] they seem to predate anything we’re doing today though [22:39:23] so probably ignorable for now [22:39:49] well from the log it shows it does most (all) of it's work using novaadmin account [22:39:56] but doesn't really expose the ldap query [22:40:21] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1909102 (10ArielGlenn) icinga is fixed, the remaining failure was due to the wrong key name (I saw a batch of "old-style" names without the project in them, for keys and cleaned them out but mi... [22:41:00] YuviPanda: https://gerrit.wikimedia.org/r/#/c/261589/ [22:41:11] as requested [22:41:21] RECOVERY - configured eth on technetium is OK: OK - interfaces up [22:41:34] mutante: <3 [22:43:14] andrewbogott: I think we are in ok territory connection count wise, seems around 18-19 hundred on seaborgium and 6-7 hundred on serpens [22:43:38] which is around the normal 2.5k right? [22:43:53] oh, goddamn it, the graphs have different scales [22:44:01] ah yes :) [22:44:04] so, yes, you’re right, it looks reasonable [22:44:18] I'm doing a rough aprox on server w/ [22:44:18] ss | grep ldap | awk '{print $6}' | cut -d ":" -f 1 | sort | wc [22:44:20] also [22:44:40] seems in teh ballpark but I also do not fully grok the failover/failback [22:45:00] I'm tempted to stop ldap on serpens to force a state or normalcy w/ clients on seaborgium to prove it can handle it [22:45:15] I mean a redundant ldap situation where we can't lose one ldap server isn't much good [22:45:38] yeah… I think that’s a good thing to test but let’s do it when Moritz is watching [22:46:07] agreed then if it seems stable as-is we'll live with it [22:46:14] meanwhile, I guess I’ll merge https://gerrit.wikimedia.org/r/#/c/261588 ? That sets the limit to more than it is now (10,000) but less than unlimited [22:46:15] I don't think what it is up is inherently bad [22:46:24] I just want to see the deterministic all clients go to seaborgium work [22:46:42] andrewbogott: tbh it seems prudent ot hotfix to that level first [22:46:50] since we don't know the actual sane limit [22:46:51] yeah, ok, will do [22:47:32] andrewbogott: should we attempt to implement paging in OSM instead? (if this is indeed the problem) [22:47:58] I also think we should either move it off to real hardware or increase the size of the ganeti instances... but I somehow guess that doesn't make the cores any more powerful [22:48:40] YuviPanda: maybe… if we can get by with 16,000 then OSM will keep working for a couple of years, which I can only hope is long enough [22:48:48] when a ganeti instance is created, part of the commandline that does it is specifying how many "vcpus" it gets [22:48:51] if that helps anything [22:48:59] But since this has to do with user creation… that’s going to stay in mediawiki potentially forever. [22:49:51] PROBLEM - configured eth on technetium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:49:51] hopefully not 'forever', but forever for at least the next 6-9months I guess [22:49:51] mutante: it might, but we need to figure out what’s limiting us first [22:49:56] mutante: yeah, but I don't know if slapd uses multiple cores at all [22:50:02] it seems to [22:50:34] but I have seen it just break neck spike once in teh last few minutes 100% across a few cpu's [22:50:42] I'm in favor of upping the logging level of slapd and doing the logrotate stuff for it [22:50:58] right now I feel blind w/ slapd and maybe that's just not knowing where to look [22:51:31] RECOVERY - configured eth on technetium is OK: OK - interfaces up [22:51:40] * YuviPanda nods furiously [22:51:56] ok, both are hotfixed to the state in https://gerrit.wikimedia.org/r/#/c/261588/ now [22:52:49] gotcha [22:52:50] we can give it a few minutes and make sure nothing freaks out [22:53:24] andrewbogott: so in watching ldap via the flourine log [22:53:34] any clue what the op that does this kind of thing is [22:53:36] 2015-12-29 22:52:43 silver labswiki ldap INFO: 2.1.0 adding tools.isbn [22:53:36] 2015-12-29 22:52:43 silver labswiki ldap INFO: 2.1.0 adding tools.wikipedia-library [22:53:45] seems to have context for like every tool in toollabs [22:53:46] and then some [22:53:51] and happens often [22:54:00] I can't for the life of me figure out what this coudl be doing so regularly [22:54:34] I don’t know offhand [22:56:13] does it look up info for each tool when people load the tools-home page? [22:56:13] (and that makes it slow, right) [22:56:32] like looking up the maintainer for the list? [22:56:56] ah mabye so and maybe that's why tools-home is such a pos [22:57:17] the http check on tools-home is slow becasue there are so many tools [22:57:30] maybe this is also the ldap lookup? guessing though [22:58:31] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1909152 (10ArielGlenn) I lie, there are about 35 instances not yet upgraded that are however happily salt responsive. I'll have to go in and deal with them by hand. [23:00:26] PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:00:49] 10Ops-Access-Requests, 6operations: Create new puppet group `discovery-analytics-deploy` - https://phabricator.wikimedia.org/T122620#1909162 (10EBernhardson) can certainly handle this after the all-hands, no need to rush. [23:00:52] ^ andrewbogott? [23:01:08] wat [23:01:17] did it ever formally recover tho? [23:01:34] yeah it did pretty quickly I think [23:01:42] I'm looking at top [23:01:44] lots of kworker [23:01:51] and at least some ksoftirqd [23:02:33] RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.036 second response time [23:02:37] hmm [23:02:44] Is it possible that there’s some kind of echo effect? Like, I restart ldap and 20 minutes later…? [23:03:13] this could also have been my 'find'. [23:03:22] I killed it as soon as the page popped up [23:03:23] I hate that puppet on ganeti2002 failed for no reason in the middle of this [23:03:31] seriously [23:03:39] but I can’t make that relate, since even if one of these is on that box it would’ve been serpens [23:03:58] we can verify by my running find again, but probably too https://xkcd.com/242/ [23:04:51] idk I'm not in love with the 16k limit now vs 10k [23:05:07] but clearly we are bumping into these limits in some query [23:06:20] http://www.openldap.org/lists/openldap-software/200707/msg00396.html [23:06:25] very clear log levels, openldap [23:06:38] how did we turn the *one* aspect of labs that wasn't fucking up constantly into this nightmare? [23:06:51] chasemp: meaning you’d prefer 10k? [23:07:19] well, what's the disadvantage I guess? (since we saw it sit fine the longest and believe it's fine for the known trama user limit bug) [23:07:39] is the rDNS failures(?) causing gridengine outage related? [23:07:42] alternative emo title, why not 10K? [23:07:44] maybe DNS is hitting LDAP heavily? [23:07:46] chasemp: anything other than ‘unlimited’ is waiting to bite us [23:07:52] right [23:07:54] so, a bigger number means a longer wait :) [23:08:41] but not before we are all back from vaca and travel [23:08:41] but, I’m ok with 10k [23:08:41] 16k is just a rounder number *shrug* [23:08:42] PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:08:56] goddamit [23:08:57] we know it has to be above # of users + reasonable % [23:08:59] so it wasn't my 'find' [23:09:00] YuviPanda: find again? [23:09:02] huh [23:09:03] yeah [23:09:05] great [23:09:14] (03PS2) 10Andrew Bogott: openldap: Drop the novaadmin query limit from 'unlimited' [puppet] - 10https://gerrit.wikimedia.org/r/261588 [23:09:19] ^ 10k [23:09:19] at what point do we wake up moritz? [23:09:37] well, let's sit w/ 10k which we know solves the user id problem [23:09:39] and seemed stable [23:09:41] and go from there [23:09:55] NFS instance has a lot of ksoftirqd cpu usage going on now [23:10:00] there's also the gridengine outage [23:10:06] why do these things happen in clusters.... [23:10:07] related or unrelated? [23:10:12] (03PS3) 10Andrew Bogott: openldap: Drop the novaadmin query limit from 'unlimited' [puppet] - 10https://gerrit.wikimedia.org/r/261588 [23:10:23] not fully sure [23:10:24] can I get some +1s? [23:10:52] (the other alternative is locking down new users till post all staff or real fixes) [23:10:55] that's the true triage fix I guess [23:11:20] > http://www.openldap.org/lists/openldap-software/200707/msg00396.html [23:11:21] err [23:11:25] > 11 root rt 0 0 0 0 S 54.1 0.0 76:33.36 watchdog/0 [23:11:25] (03CR) 10Rush: [C: 031] "well we know we are handing out dupe id's and we think 10k is sane based on hotfix so let's try it" [puppet] - 10https://gerrit.wikimedia.org/r/261588 (owner: 10Andrew Bogott) [23:11:28] that's not good... [23:11:38] and points somewhat to the same bug as last time, maybe? [23:11:42] causing the soft lockup [23:11:49] on nfs? [23:11:51] yeah [23:12:00] triggered by ldap shenanigans maybe or who knows [23:12:05] it could also just be issues there [23:12:13] the nfs / ldap tie-in is somewhat opaque to me honestly [23:12:24] but I know enough to know as one goes...so goes the other [23:12:33] RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.245 second response time [23:12:41] could be, yeah [23:14:16] andrewbogott: back to 10k^? [23:14:28] kicker is I can't imagine taht limit matters unless you are using it which means if 10K "fixes" the outages [23:14:34] nope, it’s still at 16384 [23:14:38] it's probably at the expense of whatever query is getting hit [23:14:40] huh [23:14:58] right, exactly. I still believe/want to believe that 10k == 16k == unlimited [23:16:51] in terms of behavior [23:16:51] and that our issues are something to do with restart hiccups or something [23:16:52] but… I have no theory at all at this point, really [23:16:52] we need to enable query logging before we can say anything, IMO [23:16:52] * YuviPanda is still trying to dig the gridengine outage [23:16:53] whelp [23:16:53] that instance is 'stuck' [23:16:53] unreachable [23:16:53] the master [23:16:57] hmm, root login worked after a long delay [23:16:59] ...redundant masters? [23:17:19] theoretically, although the auto failover process depends on.... (YOU GUESSED IT) [23:17:50] oh, nvm, ssh works was just super slow [23:17:50] stuck for like many many seconds [23:18:53] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:18:58] PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:20:31] chasemp: andrewbogott I vote to wake up moritz [23:20:31] it's interesting that during all of this has ldap monitoring triggered an alert at all? [23:20:32] do you want me to merge the change that raises the timeout for one of those checks? [23:20:33] mutante: yes [23:20:33] mutante: sure [23:20:38] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [60.0] [23:21:02] is mortiz on vacation? [23:21:04] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 973217 bytes in 6.012 second response time [23:21:07] well, you were too, chasemp :) [23:21:14] * andrewbogott restarts slapd yet again [23:21:22] andrewbogott: 10k? [23:21:28] yeah [23:21:48] k [23:21:58] (03PS2) 10Dzahn: toollabs: increase timeout for tools-home to 20 [puppet] - 10https://gerrit.wikimedia.org/r/261589 (https://phabricator.wikimedia.org/T122615) [23:22:00] can we apply to both and restart both and just hunker down here [23:22:07] because it's possible wikitech tries to user serpens right [23:22:11] yeah, that’s what I did [23:22:13] w/ all the non-hotfixed things [23:22:13] ok [23:22:34] (03PS3) 10Dzahn: toollabs: increase timeout for tools-home to 20 [puppet] - 10https://gerrit.wikimedia.org/r/261589 (https://phabricator.wikimedia.org/T122615) [23:22:34] I haven’t merged the puppet change only because jenkins hasn’t verified yet [23:22:42] (03CR) 10Dzahn: [C: 032] toollabs: increase timeout for tools-home to 20 [puppet] - 10https://gerrit.wikimedia.org/r/261589 (https://phabricator.wikimedia.org/T122615) (owner: 10Dzahn) [23:23:30] andrewbogott: if you are ok w/ it [23:23:41] let's just hunker down as-is with no more restarts at 10k [23:23:41] and see what happens [23:23:56] let's change nothing at all here and see how it goes for a minute :) [23:24:01] * andrewbogott nods [23:25:52] Leeeeeeroy Jeeenkins... [23:26:48] RECOVERY - Persistent high iowait on labstore1001 is OK: OK: Less than 50.00% above the threshold [40.0] [23:27:38] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [23:28:38] (03CR) 10Dzahn: [V: 032] "jenkins is awol" [puppet] - 10https://gerrit.wikimedia.org/r/261589 (https://phabricator.wikimedia.org/T122615) (owner: 10Dzahn) [23:30:05] chasemp: btw, also high load on labstore2001, just that notifications look disabled [23:30:18] I know how to get jenkins running again, but I only know how to do it the bad way [23:30:20] but maybe it says something that it's on both [23:30:46] labstore2001 is just misnamed, should've been called labstorebackup2001 [23:31:16] it's odd it would be affected [23:31:55] YuviPanda: did you set up the nfs check for tools.wmflabs.org? [23:32:24] ? [23:32:34] the tools-checker one? [23:32:36] yeah [23:32:40] that check that is failing for nfs w/ tools.wmflabs.org as a canary [23:32:42] yeah [23:32:43] RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.022 second response time [23:32:45] it hasn't come back [23:32:46] there [23:32:54] I was wondering if you would look manually at whatever it checks [23:32:58] as I wasn't convinced [23:33:04] but maybe it was just icinga being icinga [23:33:17] that one is also paging separately and won't be affected by what i merged [23:56:13] !log restarting nodepool on labnodepool1001 [23:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master