[00:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170301T0000). Please do the needful. [00:00:19] Nothing to SWAT [00:03:21] RainbowSprinkles there's a question for you at https://gerrit-review.googlesource.com/#/c/98775/5//COMMIT_MSG@11 [00:03:31] i carn't anwser that one as i doint know why. [00:06:30] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [00:09:30] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [00:09:51] ^^ is that meant to be doing that? [00:10:56] no I'll look :) [00:13:51] ok thanks [00:13:52] :) [00:15:30] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [00:16:09] 06Operations, 06Performance-Team, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, and 3 others: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc) - https://phabricator.wikimedia.org/T156922#3063202 (10Krinkle) >>! In T156922#3061198, @Joe wrote: > - I wiped all memcacheds in codfw >... [00:17:23] (03PS1) 10Rush: nova: fullstack test give 480s before failing on creation [puppet] - 10https://gerrit.wikimedia.org/r/340445 [00:18:50] (03PS2) 10Tim Landscheidt: Tools: Use LDAP for mail queries [puppet] - 10https://gerrit.wikimedia.org/r/237871 [00:24:29] (03CR) 10Tim Landscheidt: "Okay, I'm now reasonably confident that I covered all bases. Thanks @paravoid for the advice." [puppet] - 10https://gerrit.wikimedia.org/r/237871 (owner: 10Tim Landscheidt) [00:25:12] (03CR) 10Rush: [C: 032] nova: fullstack test give 480s before failing on creation [puppet] - 10https://gerrit.wikimedia.org/r/340445 (owner: 10Rush) [00:34:51] (03PS1) 10Reedy: Add wikimedia-mirror.dh.bytemark.co.uk to dumps::rsync_clients [puppet] - 10https://gerrit.wikimedia.org/r/340447 [00:35:00] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:36:07] (03PS13) 10Krinkle: [WIP] mediawiki: Add cache-warmup to maintenance [puppet] - 10https://gerrit.wikimedia.org/r/339802 (https://phabricator.wikimedia.org/T156922) [00:41:00] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [00:41:35] (03PS14) 10Krinkle: [WIP] mediawiki: Add cache-warmup to maintenance [puppet] - 10https://gerrit.wikimedia.org/r/339802 (https://phabricator.wikimedia.org/T156922) [00:48:51] (03PS2) 10Reedy: Add wikimedia-mirror.dh.bytemark.co.uk to dumps::rsync_clients [puppet] - 10https://gerrit.wikimedia.org/r/340447 [00:52:27] (03PS15) 10Krinkle: [WIP] mediawiki: Add cache-warmup to maintenance [puppet] - 10https://gerrit.wikimedia.org/r/339802 (https://phabricator.wikimedia.org/T156922) [00:56:18] (03PS2) 10Reedy: Deprecate DonationInterface i18n messages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340421 (https://phabricator.wikimedia.org/T159098) (owner: 10Awight) [00:56:25] (03CR) 10Reedy: [C: 032] Deprecate DonationInterface i18n messages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340421 (https://phabricator.wikimedia.org/T159098) (owner: 10Awight) [00:58:05] (03Merged) 10jenkins-bot: Deprecate DonationInterface i18n messages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340421 (https://phabricator.wikimedia.org/T159098) (owner: 10Awight) [00:58:14] (03CR) 10jenkins-bot: Deprecate DonationInterface i18n messages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340421 (https://phabricator.wikimedia.org/T159098) (owner: 10Awight) [00:59:14] !log reedy@tin Synchronized wmf-config/CommonSettings.php: Remove DonationInterface loading as gone from master (primarily to unbreak beta) (duration: 00m 40s) [00:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:19] (03PS5) 10Tim Landscheidt: Tools: Allow proxymanager to add and remove proxy forward entries [puppet] - 10https://gerrit.wikimedia.org/r/266448 [01:00:21] (03PS3) 10Tim Landscheidt: Tools: Switch portgrabber and portreleaser to proxymanager [puppet] - 10https://gerrit.wikimedia.org/r/268279 [01:00:22] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Remove DonationInterface loading as gone from master (primarily to unbreak beta) (duration: 00m 42s) [01:00:23] (03PS2) 10Tim Landscheidt: Tools: Decommission proxylistener [puppet] - 10https://gerrit.wikimedia.org/r/268346 [01:00:25] (03PS2) 10Tim Landscheidt: Tools: Remove obsolete code [puppet] - 10https://gerrit.wikimedia.org/r/268347 [01:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:26] (03CR) 10jerkins-bot: [V: 04-1] Tools: Switch portgrabber and portreleaser to proxymanager [puppet] - 10https://gerrit.wikimedia.org/r/268279 (owner: 10Tim Landscheidt) [01:01:45] (03CR) 10jerkins-bot: [V: 04-1] Tools: Decommission proxylistener [puppet] - 10https://gerrit.wikimedia.org/r/268346 (owner: 10Tim Landscheidt) [01:01:56] Reedy: Thanks again--that probably gives us back a significant number of cycles on app serves! [01:01:58] (03CR) 10jerkins-bot: [V: 04-1] Tools: Remove obsolete code [puppet] - 10https://gerrit.wikimedia.org/r/268347 (owner: 10Tim Landscheidt) [01:05:21] (03PS4) 10Tim Landscheidt: Tools: Switch portgrabber and portreleaser to proxymanager [puppet] - 10https://gerrit.wikimedia.org/r/268279 [01:05:24] (03PS3) 10Tim Landscheidt: Tools: Decommission proxylistener [puppet] - 10https://gerrit.wikimedia.org/r/268346 [01:05:26] (03PS3) 10Tim Landscheidt: Tools: Remove obsolete code [puppet] - 10https://gerrit.wikimedia.org/r/268347 [01:07:06] (03CR) 10Tim Landscheidt: [C: 04-1] "Depends on I2d643fc902208eafaaa0d7814e586f0c326f16b5 deployed on tools-proxy-*." [puppet] - 10https://gerrit.wikimedia.org/r/268279 (owner: 10Tim Landscheidt) [01:07:44] (03CR) 10Tim Landscheidt: [C: 04-1] "Depends on I717c8d220625971b169e7a578500e89c69545d74 being deployed." [puppet] - 10https://gerrit.wikimedia.org/r/268346 (owner: 10Tim Landscheidt) [01:12:39] (03PS4) 10Tim Landscheidt: Tools: Remove obsolete code [puppet] - 10https://gerrit.wikimedia.org/r/268347 [01:14:43] (03CR) 10Tim Landscheidt: [C: 04-1] "Depends on I2c62dbcc6f18adb0d84ea31a8ee999b44e514963 having been deployed." [puppet] - 10https://gerrit.wikimedia.org/r/268347 (owner: 10Tim Landscheidt) [01:17:00] PROBLEM - puppet last run on mw1285 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:17:44] (03PS1) 10Dzahn: mgmt: script to detect vendor by mgmt ssh banner [puppet] - 10https://gerrit.wikimedia.org/r/340450 [01:28:15] (03PS2) 10Tim Landscheidt: Tools: Puppetize gridengine global configuration [puppet] - 10https://gerrit.wikimedia.org/r/230477 (https://phabricator.wikimedia.org/T95747) [01:29:12] (03CR) 10jerkins-bot: [V: 04-1] Tools: Puppetize gridengine global configuration [puppet] - 10https://gerrit.wikimedia.org/r/230477 (https://phabricator.wikimedia.org/T95747) (owner: 10Tim Landscheidt) [01:30:42] (03PS3) 10Tim Landscheidt: Tools: Puppetize gridengine global configuration [puppet] - 10https://gerrit.wikimedia.org/r/230477 (https://phabricator.wikimedia.org/T95747) [01:30:50] PROBLEM - puppet last run on db1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:34:38] (03CR) 10jerkins-bot: [V: 04-1] Tools: Puppetize gridengine global configuration [puppet] - 10https://gerrit.wikimedia.org/r/230477 (https://phabricator.wikimedia.org/T95747) (owner: 10Tim Landscheidt) [01:36:34] (03PS4) 10Tim Landscheidt: Tools: Puppetize gridengine global configuration [puppet] - 10https://gerrit.wikimedia.org/r/230477 (https://phabricator.wikimedia.org/T95747) [01:44:59] (03PS4) 10Tim Landscheidt: apache: Fix some issues with apache::static_site [puppet] - 10https://gerrit.wikimedia.org/r/328466 (https://phabricator.wikimedia.org/T153816) [01:45:01] (03PS4) 10Tim Landscheidt: [WIP] aptly: Make aptly work with Apache [puppet] - 10https://gerrit.wikimedia.org/r/328467 (https://phabricator.wikimedia.org/T153814) [01:46:00] RECOVERY - puppet last run on mw1285 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [01:53:06] (03PS1) 10Mattflaschen: Enable Flow on 'Viquiprojecte Discussió' on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340455 (https://phabricator.wikimedia.org/T159047) [01:55:52] 06Operations, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, 10Wikimedia-Extension-setup, and 3 others: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183#3063449 (10Ladsgroup) pywikibot mailing list? [01:56:48] (03PS2) 10Mattflaschen: Enable Flow on 'Viquiprojecte Discussió' on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340455 (https://phabricator.wikimedia.org/T159047) [01:58:50] RECOVERY - puppet last run on db1041 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [02:10:00] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [02:11:00] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4058836 keys, up 120 days 17 hours - replication_delay is 0 [02:23:12] (03CR) 10Tim Landscheidt: "When I work around the SSL issue with "curl -iH 'X-Forwarded-Proto: https' http://toolsbeta-puppetmaster7.toolsbeta.eqiad.wmflabs/pool/abc" [puppet] - 10https://gerrit.wikimedia.org/r/328467 (https://phabricator.wikimedia.org/T153814) (owner: 10Tim Landscheidt) [02:24:23] (03PS5) 10Tim Landscheidt: aptly: Make aptly work with Apache [puppet] - 10https://gerrit.wikimedia.org/r/328467 (https://phabricator.wikimedia.org/T153814) [02:26:12] (03CR) 10Tim Landscheidt: [C: 04-1] "(See above.)" [puppet] - 10https://gerrit.wikimedia.org/r/328467 (https://phabricator.wikimedia.org/T153814) (owner: 10Tim Landscheidt) [02:29:22] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.13) (duration: 08m 03s) [02:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:50] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:35:30] PROBLEM - puppet last run on kraz is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:37:24] (03PS2) 10Tim Landscheidt: Tools: Use exported resources for ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/329382 (https://phabricator.wikimedia.org/T153163) [02:39:06] (03CR) 10Tim Landscheidt: [C: 04-1] "DO NOT SUBMIT. The Tools puppetmaster does not support exported resources yet (but the code works, and that's very encouraging)." [puppet] - 10https://gerrit.wikimedia.org/r/329382 (https://phabricator.wikimedia.org/T153163) (owner: 10Tim Landscheidt) [02:42:20] PROBLEM - Check systemd state on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:30] PROBLEM - puppet last run on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:40] PROBLEM - Check whether ferm is active by checking the default input chain on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:43:20] RECOVERY - Check systemd state on bast3001 is OK: OK - running: The system is fully operational [02:43:20] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 24 minutes ago with 0 failures [02:43:30] RECOVERY - Check whether ferm is active by checking the default input chain on bast3001 is OK: OK ferm input default policy is set [02:47:30] (03PS3) 10BBlack: authdns: re-structure prep for discovery [puppet] - 10https://gerrit.wikimedia.org/r/340156 (https://phabricator.wikimedia.org/T156100) [02:47:40] (03CR) 10BBlack: [V: 032 C: 032] authdns: re-structure prep for discovery [puppet] - 10https://gerrit.wikimedia.org/r/340156 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [02:47:57] (03CR) 10BBlack: [V: 032 C: 032] geo config structure changes for discovery [dns] - 10https://gerrit.wikimedia.org/r/340154 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [02:48:09] (03PS4) 10BBlack: geo config structure changes for discovery [dns] - 10https://gerrit.wikimedia.org/r/340154 (https://phabricator.wikimedia.org/T156100) [02:48:11] (03CR) 10BBlack: [V: 032 C: 032] geo config structure changes for discovery [dns] - 10https://gerrit.wikimedia.org/r/340154 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [02:53:22] (03CR) 1020after4: "Perhaps apache::static_site is just broken? I don't see it used anywhere else. Try apache::site instead?" [puppet] - 10https://gerrit.wikimedia.org/r/328467 (https://phabricator.wikimedia.org/T153814) (owner: 10Tim Landscheidt) [02:58:50] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [03:00:38] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.14) (duration: 13m 51s) [03:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:00:50] PROBLEM - puppet last run on wtp1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:03:30] RECOVERY - puppet last run on kraz is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [03:06:25] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Mar 1 03:06:24 UTC 2017 (duration 5m 46s) [03:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:08:09] (03CR) 10Tim Landscheidt: [C: 04-1] "Not really fond of that because I don't like writing Apache configurations from scratch as I tend to miss some security-relevant boilerpla" [puppet] - 10https://gerrit.wikimedia.org/r/328467 (https://phabricator.wikimedia.org/T153814) (owner: 10Tim Landscheidt) [03:13:09] (03PS3) 10Tim Landscheidt: Tools: Use exported resources for ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/329382 (https://phabricator.wikimedia.org/T153163) [03:14:57] 06Operations, 07Puppet, 13Patch-For-Review: apache::static_site is not working - https://phabricator.wikimedia.org/T153816#3063519 (10scfc) [03:16:17] (03CR) 10Tim Landscheidt: "DO NOT SUBMIT. The Tools puppetmaster does not support exported resources yet (but the code works, and that's very encouraging)." [puppet] - 10https://gerrit.wikimedia.org/r/329382 (https://phabricator.wikimedia.org/T153163) (owner: 10Tim Landscheidt) [03:19:14] (03PS3) 10Tim Landscheidt: Allow use of PuppetDB in labs for ssh_known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/333471 (https://phabricator.wikimedia.org/T72792) (owner: 10Alex Monk) [03:23:08] (03CR) 10Tim Landscheidt: [C: 031] "I tested that the key generated by Sshkey['gerrit'] would still show up in ssh_known_hosts and added a guard around the File['ssh_known_ho" [puppet] - 10https://gerrit.wikimedia.org/r/333471 (https://phabricator.wikimedia.org/T72792) (owner: 10Alex Monk) [03:28:50] RECOVERY - puppet last run on wtp1014 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [03:29:33] (03CR) 10Krinkle: "@Joe Reminder: confctl doesn't exist on wasat. Tested with a workaround (got the output from an app server instead)." [puppet] - 10https://gerrit.wikimedia.org/r/339802 (https://phabricator.wikimedia.org/T156922) (owner: 10Krinkle) [03:33:10] PROBLEM - puppet last run on elastic1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:40:33] (03PS11) 10Krinkle: Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [03:41:19] (03CR) 10jerkins-bot: [V: 04-1] Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [03:54:54] (03PS2) 10Tim Landscheidt: puppetdb: Reduce shared_buffers in Labs to 128 MBytes [puppet] - 10https://gerrit.wikimedia.org/r/329390 [03:54:56] (03PS1) 10Tim Landscheidt: puppetdb: Set defaults for replication in Labs [puppet] - 10https://gerrit.wikimedia.org/r/340460 [03:54:58] (03PS1) 10Tim Landscheidt: puppet: Make standalone puppetmasters optionally use PuppetDB [puppet] - 10https://gerrit.wikimedia.org/r/340461 (https://phabricator.wikimedia.org/T153577) [03:55:00] (03PS1) 10Tim Landscheidt: puppetdb: Allow to use Apache as frontend [puppet] - 10https://gerrit.wikimedia.org/r/340462 (https://phabricator.wikimedia.org/T154105) [04:02:10] RECOVERY - puppet last run on elastic1031 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [04:17:50] PROBLEM - puppet last run on lvs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:24:51] (03PS12) 10Krinkle: Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [04:25:02] (03CR) 10Krinkle: "Fixed failing test to ensure both return Other._ instead of Other.- or None." [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [04:25:40] Krinkle: thank you for doing that, been swamped with budget/annual plan stuff [04:25:45] (03CR) 10jerkins-bot: [V: 04-1] Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [04:25:51] Krinkle: will add minor vs as requested [04:26:24] (03PS13) 10Krinkle: Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [04:27:15] (03CR) 10jerkins-bot: [V: 04-1] Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [04:28:00] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:31:42] (03CR) 10Tim Landscheidt: "> […]" [puppet] - 10https://gerrit.wikimedia.org/r/333473 (https://phabricator.wikimedia.org/T72792) (owner: 10Alex Monk) [04:38:27] (03PS14) 10Krinkle: Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [04:38:32] (03CR) 10Krinkle: "tox fix" [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [04:43:18] (03PS3) 10Tim Landscheidt: puppetdb: Use tuning.conf only in production [puppet] - 10https://gerrit.wikimedia.org/r/329390 [04:43:20] (03PS2) 10Tim Landscheidt: puppetdb: Set defaults for replication in Labs [puppet] - 10https://gerrit.wikimedia.org/r/340460 [04:43:22] (03PS2) 10Tim Landscheidt: puppet: Make standalone puppetmasters optionally use PuppetDB [puppet] - 10https://gerrit.wikimedia.org/r/340461 (https://phabricator.wikimedia.org/T153577) [04:43:25] (03PS2) 10Tim Landscheidt: puppetdb: Allow to use Apache as frontend [puppet] - 10https://gerrit.wikimedia.org/r/340462 (https://phabricator.wikimedia.org/T154105) [04:45:50] RECOVERY - puppet last run on lvs1004 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [04:47:50] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:49:34] 07Puppet, 06Labs, 10Tool-Labs, 13Patch-For-Review: Make standalone puppetmasters optionally use PuppetDB - https://phabricator.wikimedia.org/T153577#3063602 (10scfc) I'm pretty sure the patches work except that I can't get them to work on `toolsbeta-puppetmaster7` due to some PostgreSQL hiccups (our puppet... [04:49:49] 07Puppet, 06Labs, 10Tool-Labs, 13Patch-For-Review: Make standalone puppetmasters optionally use PuppetDB - https://phabricator.wikimedia.org/T153577#3063603 (10scfc) (… or anyone else can do that.) [04:54:50] PROBLEM - puppet last run on db1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:57:00] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [05:09:30] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:14:50] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [05:23:50] RECOVERY - puppet last run on db1073 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [05:37:30] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [05:58:20] PROBLEM - puppet last run on mw1195 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:13:40] PROBLEM - puppet last run on wdqs1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:28:20] RECOVERY - puppet last run on mw1195 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:38:30] PROBLEM - puppet last run on db1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:41:40] RECOVERY - puppet last run on wdqs1003 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:53:18] <_joe_> Krinkle: is the warmup script ready in your opinion? I wanted to add the necessary boilerplate to your puppet change and merge it today [06:53:32] <_joe_> uhm, he might not be around :P [06:54:10] (03CR) 10Giuseppe Lavagetto: [C: 031] "Seems ok and the impact is limited anyways." [puppet] - 10https://gerrit.wikimedia.org/r/339657 (https://phabricator.wikimedia.org/T158782) (owner: 10Gehel) [07:05:08] !log Deploy alter table enwiki.revision - dbstore2002 - T132416 [07:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:15] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [07:06:30] RECOVERY - puppet last run on db1023 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [07:10:31] (03PS2) 10Dzahn: mgmt: script to detect vendor by mgmt ssh banner [puppet] - 10https://gerrit.wikimedia.org/r/340450 (https://phabricator.wikimedia.org/T156673) [07:17:30] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:28:00] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:35:00] (03PS1) 10EBernhardson: Test disable super_detect_noop script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340472 [07:39:57] !run pt-table-checksum on eowiki (s2) - T154485 [07:39:58] T154485: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485 [07:41:48] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1026 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340417 [07:45:30] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [07:47:22] (03PS16) 10Giuseppe Lavagetto: mediawiki: Add cache-warmup to maintenance [puppet] - 10https://gerrit.wikimedia.org/r/339802 (https://phabricator.wikimedia.org/T156922) (owner: 10Krinkle) [07:52:13] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "There is absolutely no case in which using apache instead of nginx can be a good idea here; also, this makes the code too fat and less mai" [puppet] - 10https://gerrit.wikimedia.org/r/340462 (https://phabricator.wikimedia.org/T154105) (owner: 10Tim Landscheidt) [07:57:00] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [08:06:58] !run pt-table-checksum on fiwiki (s2) - T154485 [08:06:58] T154485: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485 [08:12:30] PROBLEM - puppet last run on mw1283 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:13:54] (03CR) 10Filippo Giunchedi: prometheus: add node tlsproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/339465 (owner: 10Ema) [08:17:03] (03PS1) 10Ema: varnish: increase check_varnish_expiry_mailbox_lag alerting threshold [puppet] - 10https://gerrit.wikimedia.org/r/340475 (https://phabricator.wikimedia.org/T145661) [08:18:35] !log installing libgd2 security updates on trusty (jessie already fixed) [08:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:51] (03CR) 10Ema: [C: 032] varnish: increase check_varnish_expiry_mailbox_lag alerting threshold [puppet] - 10https://gerrit.wikimedia.org/r/340475 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [08:21:37] (03PS3) 10ArielGlenn: Add wikimedia-mirror.dh.bytemark.co.uk to dumps::rsync_clients [puppet] - 10https://gerrit.wikimedia.org/r/340447 (owner: 10Reedy) [08:22:50] PROBLEM - Juniper alarms on mr1-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.154.199 [08:23:00] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.199 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [08:23:20] (03CR) 10ArielGlenn: [C: 032] Add wikimedia-mirror.dh.bytemark.co.uk to dumps::rsync_clients [puppet] - 10https://gerrit.wikimedia.org/r/340447 (owner: 10Reedy) [08:24:40] RECOVERY - Juniper alarms on mr1-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [08:24:50] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 38, down: 0, dormant: 0, excluded: 0, unused: 0 [08:27:50] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:41:17] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1026 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340417 (owner: 10Jcrespo) [08:41:30] RECOVERY - puppet last run on mw1283 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [08:42:40] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1026 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340417 (owner: 10Jcrespo) [08:42:49] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1026 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340417 (owner: 10Jcrespo) [08:49:13] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1056 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340416 [08:49:38] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1026 after maintenance (duration: 00m 40s) [08:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:53] (03PS1) 10Gehel: maps - fix nodejs 6 version [puppet] - 10https://gerrit.wikimedia.org/r/340480 (https://phabricator.wikimedia.org/T150354) [08:52:48] (03CR) 10Gehel: [C: 032] maps - fix nodejs 6 version [puppet] - 10https://gerrit.wikimedia.org/r/340480 (https://phabricator.wikimedia.org/T150354) (owner: 10Gehel) [08:54:50] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [08:57:16] (03PS1) 10Giuseppe Lavagetto: discovery: add global MW-related entry [puppet] - 10https://gerrit.wikimedia.org/r/340481 [08:57:39] (03CR) 10Muehlenhoff: "But let's please drop that entire check once the node 4->6 migration is done, otherwise we'll have to update that with every node security" [puppet] - 10https://gerrit.wikimedia.org/r/340480 (https://phabricator.wikimedia.org/T150354) (owner: 10Gehel) [08:58:29] (03CR) 10Gehel: "Of course! And I'm hoping to have it done this week or early next week..." [puppet] - 10https://gerrit.wikimedia.org/r/340480 (https://phabricator.wikimedia.org/T150354) (owner: 10Gehel) [09:01:08] (03PS2) 10Giuseppe Lavagetto: discovery: add global MW-related entry [puppet] - 10https://gerrit.wikimedia.org/r/340481 [09:01:36] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] discovery: add global MW-related entry [puppet] - 10https://gerrit.wikimedia.org/r/340481 (owner: 10Giuseppe Lavagetto) [09:01:40] PROBLEM - DPKG on analytics1039 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:02:40] RECOVERY - DPKG on analytics1039 is OK: All packages OK [09:06:48] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1056 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340416 (owner: 10Jcrespo) [09:07:52] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1056 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340416 (owner: 10Jcrespo) [09:08:00] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1056 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340416 (owner: 10Jcrespo) [09:14:16] !log Deploy alter table s3 (all wikis) user_groups table - T155605 [09:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:21] T155605: Schema changes for expiring user groups - https://phabricator.wikimedia.org/T155605 [09:15:10] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Shreyas Lakhtakia (shrlak) - https://phabricator.wikimedia.org/T158978#3063875 (10ema) [09:15:48] (03PS2) 10Ema: shell access for Shreyas Lakhtakia [puppet] - 10https://gerrit.wikimedia.org/r/339686 (owner: 10RobH) [09:15:58] (03CR) 10Ema: [V: 032 C: 032] shell access for Shreyas Lakhtakia [puppet] - 10https://gerrit.wikimedia.org/r/339686 (owner: 10RobH) [09:20:29] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1056 after maintenance (duration: 00m 41s) [09:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:10] 06Operations, 10Revision-Scoring-As-A-Service-Backlog: Set up oresrdb redis node in codfw - https://phabricator.wikimedia.org/T139372#3063893 (10fgiunchedi) @Halfak I jumped the gun on replication I think :) Taking a step back, I took a look at ores code and it seems redis is used for caching scores and celery... [09:27:11] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Shreyas Lakhtakia (shrlak) - https://phabricator.wikimedia.org/T158978#3053488 (10ema) No objections noted, account created. @shrlak please let us know if... [09:27:20] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:28:20] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [09:33:50] !log running alter table on db1037 T147747 [09:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:16] 06Operations, 07discovery-system: confctl SubjectAltNameWarning after python-urllib3 upgrade - https://phabricator.wikimedia.org/T156232#3063947 (10faidon) Ping! @ema or @joe? [09:50:12] 06Operations: Puppet certificate missing subjectAltName - https://phabricator.wikimedia.org/T158757#3063952 (10ema) p:05Triage>03Normal [10:17:11] 06Operations, 10Dumps-Generation: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#3064022 (10ArielGlenn) The only thing this leaves out is future work: - possible use of swift or a similar system to store some datasets for which nfs access is not needed - ev... [10:32:02] (03PS3) 10Jcrespo: [WIP]mariadb: Include a new option "socket" for all servers [puppet] - 10https://gerrit.wikimedia.org/r/339004 [10:32:49] (03PS3) 10Jcrespo: [WIP] Create scripts for batch sql execution [puppet] - 10https://gerrit.wikimedia.org/r/338809 [10:33:08] (03CR) 10jerkins-bot: [V: 04-1] [WIP]mariadb: Include a new option "socket" for all servers [puppet] - 10https://gerrit.wikimedia.org/r/339004 (owner: 10Jcrespo) [10:33:30] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [10:33:40] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:33:50] PROBLEM - puppet last run on analytics1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:38:40] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [10:39:30] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [10:42:13] (03PS1) 10Filippo Giunchedi: hieradata: add oresrdb in codfw [puppet] - 10https://gerrit.wikimedia.org/r/340485 (https://phabricator.wikimedia.org/T139372) [10:42:53] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 0.1.35 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/340306 (https://phabricator.wikimedia.org/T149903) (owner: 10Gilles) [10:46:00] (03PS1) 10Jcrespo: Add db-codfw.php to noc.wikimedia.org visible config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340487 [10:47:50] (03PS2) 10Jcrespo: Add db-codfw.php to noc.wikimedia.org visible config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340487 [10:47:57] (03CR) 10Marostegui: [C: 031] "+100, very useful - I have always wondered why it wasn't there" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340487 (owner: 10Jcrespo) [10:51:01] (03PS3) 10Jcrespo: Add db-codfw.php to noc.wikimedia.org visible config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340487 [10:51:33] 06Operations, 07Puppet: Puppet certificate missing subjectAltName - https://phabricator.wikimedia.org/T158757#3064059 (10Peachey88) [11:02:33] !log uploaded lz4 0.0~r131 for jessie-wikimedia to apt.wikimedia.org (required by HHVM 3.18) [11:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:50] RECOVERY - puppet last run on analytics1054 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [11:06:23] 06Operations, 10Revision-Scoring-As-A-Service-Backlog, 13Patch-For-Review: Set up oresrdb redis node in codfw - https://phabricator.wikimedia.org/T139372#3064098 (10fgiunchedi) [11:06:26] 06Operations, 10Revision-Scoring-As-A-Service-Backlog, 10hardware-requests, 13Patch-For-Review: Create one oresrdb VM in codfw - https://phabricator.wikimedia.org/T159207#3064095 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi VM in service [11:07:44] (03CR) 10Filippo Giunchedi: "I'm rsync'ing prometheus metrics bast3001 -> bast3002. This can be merged when the transfer is finished" [dns] - 10https://gerrit.wikimedia.org/r/340272 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [11:08:10] PROBLEM - DPKG on thumbor1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:08:18] 06Operations, 13Patch-For-Review: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#3064100 (10fgiunchedi) [11:08:28] thumbor1001 is me [11:09:10] RECOVERY - DPKG on thumbor1001 is OK: All packages OK [11:10:08] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, and 2 others: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3064105 (10Gehel) Deployment plan (some [[ https://wikitech.wikimedia.org/wiki/Application_servers | additional docs o... [11:10:59] godog: https://youtu.be/tjTrFo-bITU?t=1m4s [11:11:23] elukey: hahaha [11:15:24] (03PS3) 10ArielGlenn: dumps: Redesign progress report page [puppet] - 10https://gerrit.wikimedia.org/r/339332 (https://phabricator.wikimedia.org/T155697) (owner: 10Ladsgroup) [11:22:50] PROBLEM - puppet last run on labvirt1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:27:02] !log upgrading nginx on meiterium/archiva.wikimedia.org to 1.11.4 (using openssl 1.1) [11:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:58] !log running alter table on db2037 T147747 [11:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:19] (03PS1) 10Gehel: elasticsearch: provide elasticsearch 5.x in the repo [puppet] - 10https://gerrit.wikimedia.org/r/340491 (https://phabricator.wikimedia.org/T159168) [11:39:30] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table flaggedrevs_labswikimedia.user_groups doesnt exist on query. Default database: flaggedrevs_labswikimedia. [Query snipped] [11:39:58] ^marostegui, that is yesterday issue on the delayed slaves [11:41:43] yeah [11:41:45] I was tailing it [11:41:56] As I was waiting for it [11:42:03] :( [11:43:00] I will silence it as there will be more coming today [11:43:30] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [11:45:33] (03PS4) 10Ema: vcl: grace, keep and expired TTLs [puppet] - 10https://gerrit.wikimedia.org/r/340335 [11:47:10] PROBLEM - puppet last run on mw1173 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:50:50] RECOVERY - puppet last run on labvirt1005 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [11:55:27] (03PS5) 10Ema: vcl: grace, keep and expired TTLs [puppet] - 10https://gerrit.wikimedia.org/r/340335 [12:05:30] PROBLEM - HP RAID on db2037 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [12:08:20] PROBLEM - Disk space on labstore1003 is CRITICAL: DISK CRITICAL - free space: /boot 10 MB (4% inode=99%) [12:09:38] probably high iops makes the check timeout? [12:11:12] fixing labstore1003 [12:11:23] 06Operations, 10Revision-Scoring-As-A-Service-Backlog, 13Patch-For-Review: Set up oresrdb redis node in codfw - https://phabricator.wikimedia.org/T139372#3064206 (10Ladsgroup) >>! In T139372#3063893, @fgiunchedi wrote: > in case of active/active configuration the caches will be split though I don't think tha... [12:11:50] PROBLEM - DPKG on labvirt1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:12:13] all disks are ok there [12:12:20] RECOVERY - Disk space on labstore1003 is OK: DISK OK [12:14:41] PROBLEM - puppet last run on analytics1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:14:50] RECOVERY - HP RAID on db2037 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Battery/Capacitor [12:15:10] RECOVERY - puppet last run on mw1173 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [12:15:50] RECOVERY - DPKG on labvirt1001 is OK: All packages OK [12:20:05] (03Draft1) 10Paladox: Zuul: Add quotes around running for git-daemon service [puppet] - 10https://gerrit.wikimedia.org/r/340495 [12:20:08] (03PS2) 10Paladox: Zuul: Add quotes around running for git-daemon service [puppet] - 10https://gerrit.wikimedia.org/r/340495 [12:22:04] !log upgrade thumbor to 0.1.13 on thumbor100[12] [12:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:41] (03Draft1) 10Paladox: Zuul: Make sure git-daemon starts after installing it [puppet] - 10https://gerrit.wikimedia.org/r/340496 [12:27:10] (03PS2) 10Paladox: Zuul: Make sure git-daemon starts after installing it [puppet] - 10https://gerrit.wikimedia.org/r/340496 (https://phabricator.wikimedia.org/T157785) [12:28:16] (03CR) 10Paladox: Zuul: Make sure git-daemon starts after installing it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/340496 (https://phabricator.wikimedia.org/T157785) (owner: 10Paladox) [12:30:00] (03PS3) 10Paladox: Zuul: Make sure git-daemon starts after installing it [puppet] - 10https://gerrit.wikimedia.org/r/340496 (https://phabricator.wikimedia.org/T157785) [12:34:43] (03PS4) 10Paladox: Zuul: Make sure git-daemon starts after installing it [puppet] - 10https://gerrit.wikimedia.org/r/340496 (https://phabricator.wikimedia.org/T157785) [12:41:40] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:41:50] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:43:40] RECOVERY - puppet last run on analytics1033 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [12:43:52] !log installing apache2 security updates on mw1261 [12:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:30] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [12:44:40] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:56:42] (03PS15) 10Fdans: Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [12:57:38] (03CR) 10jerkins-bot: [V: 04-1] Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [12:59:28] (03PS1) 10Gehel: relforge: use experimental apt repo to have access to elasticsearch 5 [puppet] - 10https://gerrit.wikimedia.org/r/340500 (https://phabricator.wikimedia.org/T159168) [13:00:41] (03CR) 10Muehlenhoff: [C: 031] relforge: use experimental apt repo to have access to elasticsearch 5 [puppet] - 10https://gerrit.wikimedia.org/r/340500 (https://phabricator.wikimedia.org/T159168) (owner: 10Gehel) [13:03:42] (03PS16) 10Fdans: Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [13:04:35] (03CR) 10jerkins-bot: [V: 04-1] Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [13:04:41] 06Operations, 10Dumps-Generation: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#3064313 (10chasemp) >>! In T118154#3064022, @ArielGlenn wrote: > The only thing this leaves out is future work: > > - possible use of swift or a similar system to store some dat... [13:13:38] (03Abandoned) 10Gehel: elasticsearch: provide elasticsearch 5.x in the repo [puppet] - 10https://gerrit.wikimedia.org/r/340491 (https://phabricator.wikimedia.org/T159168) (owner: 10Gehel) [13:16:33] !log run pt-table-checksum on idwiki - T154485 [13:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:38] T154485: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485 [13:22:30] (03PS2) 10Gehel: relforge: use experimental apt repo to have access to elasticsearch 5 [puppet] - 10https://gerrit.wikimedia.org/r/340500 (https://phabricator.wikimedia.org/T159168) [13:34:33] jouncebot: refresh [13:34:37] I refreshed my knowledge about deployments. [13:34:39] jouncebot: next [13:34:39] In 0 hour(s) and 25 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170301T1400) [13:37:40] (03PS3) 10Hashar: interface: IPAddr.new() requires an address family [puppet] - 10https://gerrit.wikimedia.org/r/336840 [13:37:57] (03CR) 10Hashar: "PS3 fix ruby style (rubocop)" [puppet] - 10https://gerrit.wikimedia.org/r/336840 (owner: 10Hashar) [13:49:14] (03PS6) 10Ema: vcl: grace, keep and expired TTLs [puppet] - 10https://gerrit.wikimedia.org/r/340335 [13:56:30] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3064388 (10chasemp) ```We need to take labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet offline to update them from Ubuntu Precise to Debian Jessie on 2017-03-08. This... [13:56:43] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3064389 (10chasemp) [13:57:03] (03PS8) 10Elukey: [WIP] Refactor role memcached in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880 [13:58:18] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Refactor role memcached in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880 (owner: 10Elukey) [14:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170301T1400). Please do the needful. [14:00:05] dcausse: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:36] o/ [14:00:48] dcausse: wanna push them ? [14:00:53] hashar: sure [14:01:03] they look all fine to me [14:01:12] though I can tell what havoc it might cause on elasticsearch :}} [14:01:18] I can NOT tell [14:01:19] :D [14:01:23] :) [14:01:38] do proceed. I am around if you need any assistance [14:01:45] ok swating [14:01:58] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340472 (owner: 10EBernhardson) [14:02:42] o/ [14:02:56] also around, but looks like things are taken care of [14:03:22] (03PS9) 10Elukey: [WIP] Refactor role memcached in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880 [14:05:19] (03Merged) 10jenkins-bot: Test disable super_detect_noop script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340472 (owner: 10EBernhardson) [14:06:30] (03CR) 10jenkins-bot: Test disable super_detect_noop script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340472 (owner: 10EBernhardson) [14:07:14] (03CR) 10Elukey: "Decided to reduce the scope of the change not adding the profile for single/multiple Redis instance on the same host (my original use case" [puppet] - 10https://gerrit.wikimedia.org/r/333880 (owner: 10Elukey) [14:12:42] !log dcausse@tin Synchronized wmf-config/CirrusSearch-common.php: [cirrus] Test disable super_detect_noop script (duration: 00m 47s) [14:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:41] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339408 (owner: 10DCausse) [14:16:37] (03PS2) 10DCausse: [cirrus] cleanup old A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339408 [14:17:18] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339408 (owner: 10DCausse) [14:17:58] hashar: I +2ed but then realized that I needed to rebase, do I need to remove/readd my +2? [14:18:34] dcausse: yes :( [14:18:38] dcausse: hit [Rebase] [14:18:41] drop your CR+2 vote [14:18:43] and vote again [14:18:56] (03CR) 10DCausse: [cirrus] cleanup old A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339408 (owner: 10DCausse) [14:19:07] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339408 (owner: 10DCausse) [14:22:56] (03PS10) 10Elukey: [WIP] Refactor role memcached in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880 [14:23:09] (03Merged) 10jenkins-bot: [cirrus] cleanup old A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339408 (owner: 10DCausse) [14:23:21] (03CR) 10jenkins-bot: [cirrus] cleanup old A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339408 (owner: 10DCausse) [14:26:06] !log elukey@tin Started deploy [analytics/refinery@33db287]: (no justification provided) [14:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:50] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=81%) [14:27:04] 06Operations, 10hardware-requests: additional graphite machines request, 1x per DC - https://phabricator.wikimedia.org/T126253#3064462 (10fgiunchedi) [14:27:06] 06Operations, 13Patch-For-Review: rack and set up graphite1003 - https://phabricator.wikimedia.org/T132717#3064460 (10fgiunchedi) 05Open>03Resolved This was completed some time ago but never resolved, doing so now. [14:27:17] stat1002 is due to me, fixing it [14:27:21] ah [14:27:30] !log elukey@tin Finished deploy [analytics/refinery@33db287]: (no justification provided) (duration: 01m 24s) [14:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:38] I was going to say, checking it if I had to shout to someone :-) [14:27:47] !log dcausse@tin Synchronized wmf-config/CirrusSearch-common.php: [cirrus] cleanup old A/B test (duration: 00m 40s) [14:27:50] RECOVERY - Disk space on stat1002 is OK: DISK OK [14:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:00] I remember those filling in because user-created temporary files [14:28:28] 10Blocked-on-Operations, 06Operations, 10Graphite, 06WMDE-Analytics-Engineering, and 3 others: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#3064466 (10fgiunchedi) [14:28:31] 06Operations, 10Graphite, 13Patch-For-Review: put additional graphite machines in service - https://phabricator.wikimedia.org/T134889#3064464 (10fgiunchedi) 05Open>03Resolved This was completed some time ago but never resolved, doing so now. [14:29:14] !log EU SWAT Done [14:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:25] !log elukey@tin Started deploy [analytics/refinery@33db287]: (no justification provided) [14:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:38] !log elukey@tin Finished deploy [analytics/refinery@33db287]: (no justification provided) (duration: 01m 13s) [14:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:50] PROBLEM - DPKG on labvirt1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:35:30] 06Operations, 10Wikimedia-General-or-Unknown, 06Wikisource: Upgrade Ghostscript to 9.15 or later - https://phabricator.wikimedia.org/T110849#3064479 (10Aklapper) [14:35:50] RECOVERY - DPKG on labvirt1001 is OK: All packages OK [14:37:55] (03PS11) 10Elukey: [WIP] Refactor role memcached in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880 [14:42:19] (03PS12) 10Elukey: [WIP] Refactor role memcached in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880 [14:46:23] 06Operations, 10Wikimedia-General-or-Unknown, 06Wikisource: Upgrade Ghostscript to 9.15 or later - https://phabricator.wikimedia.org/T110849#1588033 (10MoritzMuehlenhoff) We can't easily upgrade ghostscript (for the reasons already provided and also because is also provides a library). The next Debian releas... [14:52:33] (03PS4) 10Gehel: portals: do not rewrite 404 errors [puppet] - 10https://gerrit.wikimedia.org/r/339657 (https://phabricator.wikimedia.org/T158782) [14:54:45] !log starting deployment of mediawiki apache config - T158782 [14:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:51] T158782: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782 [14:55:58] gehel: have we tested in deployment-prep? [14:56:20] elukey: yes, the change is cherry-picked on deployment-prep [14:56:23] super [14:56:29] just wanted to make sure :) [14:56:44] elukey: good to know that you are watching me (or not)... [14:56:50] (03CR) 10Gehel: [C: 032] portals: do not rewrite 404 errors [puppet] - 10https://gerrit.wikimedia.org/r/339657 (https://phabricator.wikimedia.org/T158782) (owner: 10Gehel) [14:57:09] let me know if you need any help, apache is always a bit scary to deploy [14:57:36] * gehel is a bit scared, but should be able to manage [14:58:40] gehel: if you want to be really sure, disable puppet on all mw1*, run it on codfw, and then in batches in eqiad [14:59:09] in codfw you could even run apache-fast-test on a couple of hosts to have some sort of feedback [14:59:18] * gehel is following the plan at https://phabricator.wikimedia.org/T158782#3064105 [14:59:21] not sure if needed but last time I used this procedure [14:59:42] but I should add deploying codfw first, thanks! [15:00:43] ah I didn't know it, nice :) [15:03:05] 06Operations, 10Domains, 10Traffic: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3064563 (10Aklapper) @Kaarel_Vaidla: Did the last comments help? [15:05:41] !log mwdebug1001 looks good, deploying on mw1209 - T158782 [15:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:46] T158782: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782 [15:10:39] !log mw1209 looks good, deploying on codfw - T158782 [15:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:26] (03PS17) 10Hashar: jenkins: migrate to systemd [puppet] - 10https://gerrit.wikimedia.org/r/337404 [15:14:17] (03CR) 10Hashar: "I have removed the ajp13port=1, it is disabled by default." [puppet] - 10https://gerrit.wikimedia.org/r/337404 (owner: 10Hashar) [15:18:51] !log testing a few host on codfw looks good, deploying on eqiad - T158782 [15:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:58] T158782: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782 [15:21:31] 06Operations, 10Dumps-Generation: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#3064588 (10ArielGlenn) At Marks' request, here is a precise description of what dataset1003 and 1004 would be doing. (Maybe we want to give them different names?) Dataset1003 -... [15:23:50] !log elukey@tin Started deploy [analytics/refinery@b4a8fcc]: (no justification provided) [15:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:05] !log elukey@tin Finished deploy [analytics/refinery@b4a8fcc]: (no justification provided) (duration: 02m 15s) [15:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:33] !log deploying on eqiad completed - T158782 [15:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:38] T158782: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782 [15:29:04] _joe_: it seems I haven't broken anything in horrible ways. I'll keep an eye on things for a bit, but it looks good... [15:29:23] <_joe_> gehel: cool [15:30:42] (03PS1) 10Ema: varnishtest: mock VCL configuration [puppet] - 10https://gerrit.wikimedia.org/r/340511 [15:35:09] !log running alter table on db1034 T147747 [15:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:33] (03PS2) 10Muehlenhoff: Script for offboarding a user from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/340346 (https://phabricator.wikimedia.org/T142825) [15:40:26] (03CR) 10jerkins-bot: [V: 04-1] Script for offboarding a user from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/340346 (https://phabricator.wikimedia.org/T142825) (owner: 10Muehlenhoff) [15:42:01] (03PS3) 10Muehlenhoff: Script for offboarding a user from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/340346 (https://phabricator.wikimedia.org/T142825) [15:44:12] !log joal@tin Started deploy [analytics/refinery@b4a8fcc]: (no justification provided) [15:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:25] !log joal@tin Finished deploy [analytics/refinery@b4a8fcc]: (no justification provided) (duration: 00m 13s) [15:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:15] !log joal@tin Started deploy [analytics/refinery@f4a5020]: (no justification provided) [15:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:33] sorry for the spam - messed up my previous deployment :S [15:45:48] (03CR) 10Ema: [C: 032] vcl: grace, keep and expired TTLs [puppet] - 10https://gerrit.wikimedia.org/r/340335 (owner: 10Ema) [15:45:52] !log Resume pt-table-checksum on idwiki (s2) - T154485 [15:45:57] (03PS7) 10Ema: vcl: grace, keep and expired TTLs [puppet] - 10https://gerrit.wikimedia.org/r/340335 [15:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:57] T154485: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485 [15:46:03] (03CR) 10Ema: [V: 032 C: 032] vcl: grace, keep and expired TTLs [puppet] - 10https://gerrit.wikimedia.org/r/340335 (owner: 10Ema) [15:47:48] !log joal@tin Finished deploy [analytics/refinery@f4a5020]: (no justification provided) (duration: 02m 33s) [15:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:05] (03CR) 10Tim Landscheidt: "The (production) Labs puppetmaster (role::labs::puppetmaster) is a special case of role::puppetmaster::standalone, i. e. using Apache as a" [puppet] - 10https://gerrit.wikimedia.org/r/340462 (https://phabricator.wikimedia.org/T154105) (owner: 10Tim Landscheidt) [15:57:20] (03PS2) 10Ema: varnishtest: mock VCL configuration [puppet] - 10https://gerrit.wikimedia.org/r/340511 [16:00:57] 06Operations, 10Revision-Scoring-As-A-Service-Backlog, 13Patch-For-Review: Set up oresrdb redis node in codfw - https://phabricator.wikimedia.org/T139372#3064683 (10fgiunchedi) >>! In T139372#3064206, @Ladsgroup wrote: >>>! In T139372#3063893, @fgiunchedi wrote: >> in case of active/active configuration the... [16:02:00] PROBLEM - puppet last run on kafka1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:02:00] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 626 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4086967 keys, up 121 days 7 hours - replication_delay is 626 [16:02:10] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 632 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4087150 keys, up 121 days 7 hours - replication_delay is 632 [16:02:51] I can see [16:02:52] master_link_status:down [16:02:52] master_last_io_seconds_ago:-1 [16:02:52] master_sync_in_progress:0 [16:02:52] slave_repl_offset:1 [16:02:54] master_link_down_since_seconds:659 [16:03:03] but pretty sure it will recover soon [16:03:16] 06Operations, 10fundraising-tech-ops, 10netops: set up firewall policies for barium replacement civi1001 - https://phabricator.wikimedia.org/T159336#3064690 (10Jgreen) a:05Jgreen>03None [16:03:29] (03CR) 10Volans: [C: 04-1] "See few possible improvements inline." (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/340346 (https://phabricator.wikimedia.org/T142825) (owner: 10Muehlenhoff) [16:07:07] ahhh the link down is with rdb2005.codfw.wmnet, since rdb2005.codfw.wmnet. and rdb2006 replicates between themselves. [16:07:10] weeeird [16:08:22] aaand rdb2005 has issues replicating with rdb1007 (his master) [16:08:31] so it is a cascade failure [16:09:29] (03PS11) 10BBlack: [WIP] DNS: service discovery [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) [16:11:36] (03CR) 10BBlack: [WIP] DNS: service discovery (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [16:12:10] PROBLEM - puppet last run on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:18:09] (03PS12) 10BBlack: [WIP] DNS: service discovery [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) [16:24:00] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 34 minutes ago with 0 failures [16:27:17] (03PS4) 10Tim Landscheidt: puppetdb: Use tuning.conf only in production [puppet] - 10https://gerrit.wikimedia.org/r/329390 [16:27:19] (03PS3) 10Tim Landscheidt: puppetdb: Set defaults for replication in Labs [puppet] - 10https://gerrit.wikimedia.org/r/340460 [16:27:21] (03PS3) 10Tim Landscheidt: puppet: Make standalone puppetmasters optionally use PuppetDB [puppet] - 10https://gerrit.wikimedia.org/r/340461 (https://phabricator.wikimedia.org/T153577) [16:27:23] (03PS3) 10Tim Landscheidt: puppetdb: Allow to use Apache as frontend [puppet] - 10https://gerrit.wikimedia.org/r/340462 (https://phabricator.wikimedia.org/T154105) [16:29:15] (03CR) 10Tim Landscheidt: "(Only rebased for update in dependency change.)" [puppet] - 10https://gerrit.wikimedia.org/r/340462 (https://phabricator.wikimedia.org/T154105) (owner: 10Tim Landscheidt) [16:30:10] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4088947 keys, up 121 days 8 hours - replication_delay is 0 [16:31:00] RECOVERY - puppet last run on kafka1020 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:31:00] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4088835 keys, up 121 days 8 hours - replication_delay is 0 [16:31:09] 06Operations, 10fundraising-tech-ops, 10netops: reassign wmf7010/frpm1001 to host "civi1001.frack.eqiad.wmnet" - https://phabricator.wikimedia.org/T159342#3064786 (10Jgreen) [16:37:05] 06Operations, 06Discovery, 06Discovery-Search (Current work), 13Patch-For-Review: Add elasticsearch 5 .deb to reprepro experimental repository - https://phabricator.wikimedia.org/T159168#3058848 (10Gehel) deb uploaded (`reprepro -C experimental includedeb jessie-wikimedia ~gehel/elasticsearch_5.2.2_all.deb`) [16:37:22] (03PS3) 10Gehel: relforge: use experimental apt repo to have access to elasticsearch 5 [puppet] - 10https://gerrit.wikimedia.org/r/340500 (https://phabricator.wikimedia.org/T159168) [16:44:21] (03PS13) 10BBlack: [WIP] DNS: service discovery [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) [16:45:18] (03PS1) 10RobH: lists.w.o : updating a new LE interval check for smtp certificates [puppet] - 10https://gerrit.wikimedia.org/r/340525 [16:47:53] 06Operations, 10Traffic, 10Wikimedia-Mailing-lists, 13Patch-For-Review: convert lists.wikimedia.org certificate to LetsEncrypt (deadline:2017-03-02) - https://phabricator.wikimedia.org/T154917#3064881 (10RobH) I checked with Brandon, and he reminded me that while he did update the script and now it support... [16:49:55] (03PS2) 10Volans: Cumin: fine tuning ssh_options [puppet] - 10https://gerrit.wikimedia.org/r/340104 (https://phabricator.wikimedia.org/T159127) [16:50:40] PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] [16:51:35] (03CR) 10Dzahn: [C: 031] lists.w.o : updating a new LE interval check for smtp certificates [puppet] - 10https://gerrit.wikimedia.org/r/340525 (owner: 10RobH) [16:53:57] (03PS4) 10Ema: tlsproxy: add prometheus support [puppet] - 10https://gerrit.wikimedia.org/r/339465 [16:56:14] (03CR) 10Ema: tlsproxy: add prometheus support (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/339465 (owner: 10Ema) [16:56:40] PROBLEM - puppet last run on mw1260 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:57:20] (03CR) 10BBlack: [V: 04-1] lists.w.o : updating a new LE interval check for smtp certificates (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/340525 (owner: 10RobH) [16:57:52] bblack: duly noted, will fix! [16:59:09] (03PS2) 10RobH: lists.w.o : updating a new LE interval check for smtp certificates [puppet] - 10https://gerrit.wikimedia.org/r/340525 [17:03:13] 06Operations, 10Mail, 10Traffic: convert mail servers from GS to LE certificates - https://phabricator.wikimedia.org/T159346#3064917 (10RobH) [17:04:23] (03CR) 10jerkins-bot: [V: 04-1] [WIP] DNS: service discovery [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [17:07:01] (03PS14) 10BBlack: [WIP] DNS: service discovery [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) [17:10:09] (03CR) 10BBlack: [C: 031] lists.w.o : updating a new LE interval check for smtp certificates [puppet] - 10https://gerrit.wikimedia.org/r/340525 (owner: 10RobH) [17:10:40] (03CR) 10Dzahn: "does a lint or style check actually say to do this? i am always unsure about these. but it seems like "no quotes" is correct per https://d" [puppet] - 10https://gerrit.wikimedia.org/r/340495 (owner: 10Paladox) [17:11:26] (03CR) 10Paladox: "> does a lint or style check actually say to do this? i am always" [puppet] - 10https://gerrit.wikimedia.org/r/340495 (owner: 10Paladox) [17:11:33] (03CR) 10RobH: [C: 032] lists.w.o : updating a new LE interval check for smtp certificates [puppet] - 10https://gerrit.wikimedia.org/r/340525 (owner: 10RobH) [17:11:37] (03CR) 10Dzahn: "well.. it can be both" [puppet] - 10https://gerrit.wikimedia.org/r/340495 (owner: 10Paladox) [17:11:59] (03CR) 10Paladox: "> well.. it can be both" [puppet] - 10https://gerrit.wikimedia.org/r/340495 (owner: 10Paladox) [17:12:56] "=>" is called a "fat comma" or a "hash rocket"? heh, ok [17:13:07] lol [17:14:29] check successfully updated [17:14:37] icinga didnt die. wooo \o/ [17:14:52] :) [17:15:03] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#3064954 (10RobH) [17:15:04] 06Operations, 10Traffic, 10Wikimedia-Mailing-lists, 13Patch-For-Review: convert lists.wikimedia.org certificate to LetsEncrypt (deadline:2017-03-02) - https://phabricator.wikimedia.org/T154917#3064952 (10RobH) 05Open>03Resolved Updated the check and resolving this task. [17:15:13] (03CR) 10jerkins-bot: [V: 04-1] [WIP] DNS: service discovery [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [17:15:43] paladox: when i look at this globally, across all modules.. i guess we have a solid 50/50 inconsistency :) [17:15:54] oh lol [17:16:35] (03PS3) 10Dzahn: Zuul: Add quotes around running for git-daemon service [puppet] - 10https://gerrit.wikimedia.org/r/340495 (owner: 10Paladox) [17:16:40] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2240497 (10RobH) [17:19:24] 06Operations, 10Mail, 10Traffic: convert mail servers from GS to LE certificates - https://phabricator.wikimedia.org/T159346#3064964 (10RobH) The last time the mx systems had cert work done was via T144568. [17:20:41] (03PS1) 10Hashar: contint: Zuul no more interact with Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/340529 [17:22:13] (03PS15) 10BBlack: [WIP] DNS: service discovery [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) [17:22:47] (03CR) 10Dzahn: [C: 032] Zuul: Add quotes around running for git-daemon service [puppet] - 10https://gerrit.wikimedia.org/r/340495 (owner: 10Paladox) [17:25:17] 06Operations, 10Revision-Scoring-As-A-Service-Backlog, 13Patch-For-Review: Set up oresrdb redis node in codfw - https://phabricator.wikimedia.org/T139372#2430068 (10Joe) >>! In T139372#3064206, @Ladsgroup wrote: >>>! In T139372#3063893, @fgiunchedi wrote: >> in case of active/active configuration the caches... [17:25:41] RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:25:56] paladox: what is the difference between a "standard comment" and a "velocity comment"? [17:26:04] (03CR) 10Hashar: "Puppet compile https://puppet-compiler.wmflabs.org/5617/ shows the [jenkins] section is entirely removed." [puppet] - 10https://gerrit.wikimedia.org/r/340529 (owner: 10Hashar) [17:26:22] Well a velocity comment uses a template which you write the comment [17:26:36] and standard comment is used from the its-base plugin [17:26:41] mutante ^^ [17:26:53] aha, ok [17:27:19] yep [17:27:34] and the event-type "change-merged" exists? [17:27:42] yep [17:27:52] thats gerrit's rest api it uses. [17:28:04] (03PS5) 10Dzahn: Gerrit: Report repo in comment on merged patches too [puppet] - 10https://gerrit.wikimedia.org/r/340435 (https://phabricator.wikimedia.org/T159202) (owner: 10Paladox) [17:29:44] Anyone having trouble connecting to prod via ssh? Can't get into any of the bastions, timing out. tools-bastion.wmflabs working fine, hoever. [17:30:40] hashar: can do that now if you are here and want to monitor [17:30:59] Krinkle: i can still ssh to bast1001 as normal [17:32:15] Krinkle: possibly "fatal: no matching mac found:" ? [17:32:42] "Accepted publickey for krinkle" [17:32:51] mutante: No, it was just timing out. But I just got through now. [17:32:57] hm,ok [17:32:58] Took 10 attempts. Tried all 4 bastions (one at a time) [17:33:09] anyhow, nvm I guess.. [17:33:13] ok [17:33:30] (03PS3) 10Volans: Cumin: fine tuning ssh_options [puppet] - 10https://gerrit.wikimedia.org/r/340104 (https://phabricator.wikimedia.org/T159127) [17:35:08] 06Operations, 10Continuous-Integration-Config, 06Release-Engineering-Team, 06Wikipedia-Android-App-Backlog, and 2 others: Investigate how to improve Android CI performance and stability - https://phabricator.wikimedia.org/T158014#3064996 (10Niedzielski) @hashar, sorry to bring up the GPU question again wit... [17:42:03] PROBLEM - MariaDB Slave Lag: s1 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 611.69 seconds [17:43:08] I will ack that, it is not critical, and it is most likely the extra pressure of peak time + the one time alter table running now [17:43:46] 06Operations, 10Continuous-Integration-Config, 06Release-Engineering-Team, 06Wikipedia-Android-App-Backlog, and 2 others: Investigate how to improve Android CI performance and stability - https://phabricator.wikimedia.org/T158014#3023430 (10Legoktm) I guess {T159165} could be related? [17:43:56] puppet fails on copper since about one day [17:44:05] something with package "docker-engine" [17:46:41] ACKNOWLEDGEMENT - MariaDB Slave Lag: s1 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 644.52 seconds Jcrespo temporary overload due to ongoing schema change, it will go away on its own [17:47:12] mutante, let's try to git blame it on someone :-) [17:47:34] mutante: is trying to force a version that is not anymore in the repo [17:47:54] yes, i see "E: Version '1.12.5-0~debian-jessie' for 'docker-engine' was not found [17:48:00] 06Operations, 06Performance-Team: Investigate if we can graph the age of the Thumbor processes in Grafana - https://phabricator.wikimedia.org/T159352#3065071 (10Gilles) [17:49:03] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 616 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4093497 keys, up 121 days 9 hours - replication_delay is 616 [17:49:13] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 624 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4093691 keys, up 121 days 9 hours - replication_delay is 624 [17:50:51] mutante: Hm.. seeing any errors now? That one connection got through but only after it was pending for a good minute. I closed it (doh!), and can't seem to get it back open. Trying bast4 at the moment. [17:51:13] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4089367 keys, up 121 days 9 hours - replication_delay is 44 [17:52:08] !log running alter table on db2044 T147747 [17:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:31] Krinkle: i see that a session is opened for your user and then closed again [17:53:10] mutante: Hm.. got through now. Takse like 2 minutes. Weird. [17:53:22] latency is fine after that [17:53:28] Krinkle: have you tried with -vvv to see where it stops? [17:54:06] !log autoremoving old kernels on terbium to make room on /boot [17:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:12] auth.log says : disconnected by user [17:54:17] Krinkle: ^^^ [17:54:18] volans: debug1: Connecting to bast2001.wikimedia.org [2620::...] port 22. [17:54:26] that's where it stops [17:54:33] I'm there right now in a tab [17:54:41] a problem on ipv6? [17:54:52] ping is responsive without issue [17:55:00] Krinkle: try -4 [17:55:33] then after 1-2 minutes, it's established and all is fine [17:55:53] volans: yeah, -4 is immediate every time [17:55:56] interesting [17:56:11] What's it doing for 2 minutes! [17:56:19] (03PS1) 10Dzahn: builder: update version of docker-engine [puppet] - 10https://gerrit.wikimedia.org/r/340536 [17:56:22] so definitely something related to the IPv6 path [17:56:41] (03CR) 10Volans: [C: 032] Cumin: fine tuning ssh_options [puppet] - 10https://gerrit.wikimedia.org/r/340104 (https://phabricator.wikimedia.org/T159127) (owner: 10Volans) [17:56:53] 06Operations, 06Performance-Team: Move coal from graphite machine(s) - https://phabricator.wikimedia.org/T159354#3065109 (10fgiunchedi) [17:57:02] 06Operations, 10Continuous-Integration-Config, 06Release-Engineering-Team, 06Wikipedia-Android-App-Backlog, and 2 others: Investigate how to improve Android CI performance and stability - https://phabricator.wikimedia.org/T158014#3065122 (10hashar) I can see how a GPU would accelerate UI drawing etc, reque... [18:01:13] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 644 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4089755 keys, up 121 days 9 hours - replication_delay is 644 [18:02:03] (03CR) 10Dzahn: [C: 032] builder: update version of docker-engine [puppet] - 10https://gerrit.wikimedia.org/r/340536 (owner: 10Dzahn) [18:02:10] (03PS6) 10Paladox: Gerrit: Report repo in comment on merged patches too [puppet] - 10https://gerrit.wikimedia.org/r/340435 (https://phabricator.wikimedia.org/T159202) [18:03:40] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/339465 (owner: 10Ema) [18:03:50] (03PS2) 10Dzahn: builder: update version of docker-engine [puppet] - 10https://gerrit.wikimedia.org/r/340536 [18:05:10] (03CR) 10Volans: "see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/340536 (owner: 10Dzahn) [18:07:26] (03PS3) 10Dzahn: builder: update version of docker-engine [puppet] - 10https://gerrit.wikimedia.org/r/340536 [18:08:46] (03PS17) 10Krinkle: mediawiki: Add cache-warmup to maintenance [puppet] - 10https://gerrit.wikimedia.org/r/339802 (https://phabricator.wikimedia.org/T156922) [18:08:52] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: Add cache-warmup to maintenance [puppet] - 10https://gerrit.wikimedia.org/r/339802 (https://phabricator.wikimedia.org/T156922) (owner: 10Krinkle) [18:09:49] (03PS18) 10Giuseppe Lavagetto: mediawiki: Add cache-warmup to maintenance [puppet] - 10https://gerrit.wikimedia.org/r/339802 (https://phabricator.wikimedia.org/T156922) (owner: 10Krinkle) [18:10:05] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] mediawiki: Add cache-warmup to maintenance [puppet] - 10https://gerrit.wikimedia.org/r/339802 (https://phabricator.wikimedia.org/T156922) (owner: 10Krinkle) [18:10:12] 06Operations, 06Discovery, 06Discovery-Search, 10Wikidata, 10Wikidata-Query-Service: collect usual GC metrics for Blazegraph JVMs - https://phabricator.wikimedia.org/T159248#3065233 (10Gehel) Elasticsearch configuration is done through puppet (https://github.com/wikimedia/puppet/blob/production/modules/e... [18:11:03] RECOVERY - MariaDB Slave Lag: s1 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 266.54 seconds [18:11:14] RECOVERY - puppet last run on copper is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [18:11:56] (03PS19) 10Giuseppe Lavagetto: mediawiki: Add cache-warmup to maintenance [puppet] - 10https://gerrit.wikimedia.org/r/339802 (https://phabricator.wikimedia.org/T156922) (owner: 10Krinkle) [18:12:13] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4089361 keys, up 121 days 9 hours - replication_delay is 0 [18:13:03] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4089609 keys, up 121 days 9 hours - replication_delay is 0 [18:13:48] (03CR) 10Muehlenhoff: Script for offboarding a user from LDAP (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/340346 (https://phabricator.wikimedia.org/T142825) (owner: 10Muehlenhoff) [18:14:01] (03PS4) 10Muehlenhoff: Script for offboarding a user from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/340346 (https://phabricator.wikimedia.org/T142825) [18:15:29] (03CR) 10jerkins-bot: [V: 04-1] Script for offboarding a user from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/340346 (https://phabricator.wikimedia.org/T142825) (owner: 10Muehlenhoff) [18:15:45] jouncebot: next [18:15:45] In 0 hour(s) and 44 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170301T1900) [18:16:33] thcipriani: FYI I'll merge your scap upgrade patch now [18:16:57] godog: awesome, thanks :) [18:17:04] (03PS2) 10Filippo Giunchedi: Scap: update version to 3.5.3-1 [puppet] - 10https://gerrit.wikimedia.org/r/340159 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [18:17:07] ^ RainbowSprinkles FYI [18:18:08] (03PS5) 10Muehlenhoff: Script for offboarding a user from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/340346 (https://phabricator.wikimedia.org/T142825) [18:18:10] (03CR) 10Filippo Giunchedi: [C: 032] Scap: update version to 3.5.3-1 [puppet] - 10https://gerrit.wikimedia.org/r/340159 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [18:20:03] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:20:18] thcipriani: Awesome thx <3 [18:20:47] (03PS1) 10Giuseppe Lavagetto: confd: add ability to define a global and per-template prefix [puppet] - 10https://gerrit.wikimedia.org/r/340537 [18:20:49] (03PS1) 10Giuseppe Lavagetto: profile::discovery::client: create confd-generate files for discovery [puppet] - 10https://gerrit.wikimedia.org/r/340538 (https://phabricator.wikimedia.org/T149617) [18:21:30] <_joe_> bblack: ^^ this is the complement to your patch: in order to make a yaml file that mediawiki will be able to consume [18:23:53] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [18:24:23] ruh roh [18:24:24] hrm [18:25:00] well, not seeing 3.5.3-1 available in apt-cache policy scap [18:25:43] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:25:53] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:26:33] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [18:26:42] thcipriani: probably needs a manual apt-get update before the next cron run [18:26:44] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:28:13] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[nodejs] [18:28:28] hrm, I thought that apt-get update was part of puppet-run [18:28:51] ^ godog do you normally do an apt-get update for scap updates? [18:29:02] yeah, that was a mental typo, I meant the puppet run [18:31:08] thcipriani: I thought the same, i.e. apt-get update would be run [18:31:50] (03PS1) 10Krinkle: mediawiki-cache-warmup: Remove unused var, reduce concurrency, log slowest-5 [puppet] - 10https://gerrit.wikimedia.org/r/340539 [18:32:06] https://github.com/wikimedia/puppet/blob/production/modules/base/templates/puppet-run.erb#L48 ¯\_(ツ)_/¯ [18:32:34] ah ok so via cron is fine, all good [18:32:50] yeah, must've made the change at a weird moment, I guess [18:32:53] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [18:33:54] nice, looks all good on tin now [18:34:06] * thcipriani tries sync-file [18:35:41] !log thcipriani@tin Synchronized README: test sync for scap 3.5.3-1 (duration: 00m 46s) [18:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:04] thcipriani: tin failure was me running puppet but not apt-get update [18:36:15] ahhh, ok [18:37:05] godog: anyway, looks like we're all good for the time being. Many thanks! [18:37:13] awesomesauce [18:37:19] :) [18:37:22] I'm off [18:42:03] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 633 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4091046 keys, up 121 days 10 hours - replication_delay is 633 [18:42:13] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 642 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4091248 keys, up 121 days 10 hours - replication_delay is 642 [18:49:03] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [18:50:01] 06Operations, 06Performance-Team, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, and 3 others: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc) - https://phabricator.wikimedia.org/T156922#2990060 (10jcrespo) > I guess we'll have to coordinate with the DBAs to be sure this won't ha... [18:50:13] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:51:13] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4090087 keys, up 121 days 10 hours - replication_delay is 0 [18:52:03] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4089767 keys, up 121 days 10 hours - replication_delay is 0 [18:55:40] thcipriani: could you add me as an admin to https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep please? [18:56:43] So that I can do https://phabricator.wikimedia.org/T158628 [18:57:35] addshore: done! [18:57:40] thcipriani: thanks!!! [18:57:46] np :) [18:59:46] (03CR) 10Dzahn: [C: 032] Gerrit: Report repo in comment on merged patches too [puppet] - 10https://gerrit.wikimedia.org/r/340435 (https://phabricator.wikimedia.org/T159202) (owner: 10Paladox) [18:59:55] (03PS7) 10Dzahn: Gerrit: Report repo in comment on merged patches too [puppet] - 10https://gerrit.wikimedia.org/r/340435 (https://phabricator.wikimedia.org/T159202) (owner: 10Paladox) [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170301T1900). [19:01:08] nothing in swat! [19:01:29] well, I can schedule something addshore [19:01:43] PROBLEM - puppet last run on mw1264 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:01:43] RECOVERY - High lag on wdqs1002 is OK: OK: Less than 30.00% above the threshold [600.0] [19:01:52] but no time now to verify [19:04:30] !log terbium - uses ubuntu.wikimedia.org in APT sources but that does not exist anymore. replaced 'ubuntu' with 'mirrors' globally, apt-get update [19:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:03] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 626 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4089835 keys, up 121 days 10 hours - replication_delay is 626 [19:05:13] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 635 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4090245 keys, up 121 days 10 hours - replication_delay is 635 [19:06:40] 06Operations, 10Continuous-Integration-Config, 06Release-Engineering-Team, 06Wikipedia-Android-App-Backlog, and 2 others: Investigate how to improve Android CI performance and stability - https://phabricator.wikimedia.org/T158014#3065377 (10Fjalapeno) @hashar aside from the the question of if it will help... [19:07:03] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:45] !log terbium - install multiple pending package upgrades [19:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:33] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [19:10:03] PROBLEM - DPKG on terbium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:11:46] that's because it's installing right now [19:13:03] RECOVERY - DPKG on terbium is OK: All packages OK [19:13:33] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:14:33] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [19:16:13] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4088407 keys, up 121 days 10 hours - replication_delay is 11 [19:19:13] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [19:26:03] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [19:26:13] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 611 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4088407 keys, up 121 days 10 hours - replication_delay is 611 [19:30:43] RECOVERY - puppet last run on mw1264 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [19:30:49] !log labsdb1001, labtestcontrol2001, labtestvirt2001 - fix APT sources list. replace ubuntu.wikimedia (deleted) with mirrors.wikimedia [19:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:41] !log ocg1001, db1047, californium, db1051, rcs1002, db1041, iridium - fix APT sources list. replace ubuntu.wikimedia (deleted) with mirrors.wikimedia, apt-get update [19:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:13] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4087850 keys, up 121 days 11 hours - replication_delay is 0 [19:50:04] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4087483 keys, up 121 days 11 hours - replication_delay is 0 [19:53:15] thcipriani: https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Add_a_wiki says "Run sync-common-all", is that a scap command? or? [19:53:28] oh man [19:53:36] or out of date? ;) [19:53:39] yeah [19:54:28] * thcipriani digs [19:55:16] https://github.com/wikimedia/scap/commit/957086b109bd6546bc4fafdc8e08d54b0c0db8f2 [19:55:27] > sync-common-all is a little used alias for scap [19:55:33] I didn't remember that one [19:55:41] so just: scap sync [19:56:04] cool! [19:59:05] !log smalyshev@tin Started deploy [wdqs/wdqs@2b8ffef]: Bump memory limit for Java to 16g [19:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] RainbowSprinkles: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170301T2000). Please do the needful. [20:00:28] choo choo [20:01:38] thcipriani: you doing the release today ? [20:02:08] matanya: nope, RainbowSprinkles is on it already judging from his message :) [20:02:36] :) [20:02:38] thcipriani: having to do a full scap sync for beta after running addwiki seems a bit odd in my mind.. Why is that needed? [20:02:41] !log smalyshev@tin Finished deploy [wdqs/wdqs@2b8ffef]: Bump memory limit for Java to 16g (duration: 03m 36s) [20:02:43] ah, RainbowSprinkles did you notice the echo issue? [20:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:52] No [20:03:08] pecial:Notifications Exception from line 290 of /srv/mediawiki/php-1.29.0-wmf.14/extensions/Echo/includes/model/Event.php: DateTimeZone::__construct(): Unknown or bad timezone (+00:00) [20:03:17] Special:Notifications Exception from line 290 of /srv/mediawiki/php-1.29.0-wmf.14/extensions/Echo/includes/model/Event.php: DateTimeZone::__construct(): Unknown or bad timezone (+00:00) [20:04:03] Meh [20:04:23] RainbowSprinkles: worth a phab ticket ? [20:04:29] addshore: does seem like it might be unnecessary, but I'm not very familiar with the process to add wikis, honestly :( [20:04:33] Yes, but it's not related to yesterday's deploy to group0 [20:05:01] thcipriani: fair! thanks! [20:05:36] (03PS1) 10Addshore: Add beta hewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340559 (https://phabricator.wikimedia.org/T158628) [20:05:39] (03PS1) 10Chad: Group1 to wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340560 [20:06:46] addshore: People should get used to doing full scaps anyway :) [20:06:49] That's the future! [20:09:01] 06Operations, 06Services, 10Traffic, 07Performance: Look into a solution for replaying traffic for load testing - https://phabricator.wikimedia.org/T129682#3065625 (10Eevans) [20:09:41] 06Operations, 06Services, 10Traffic, 07Performance: Look into a solution for replaying traffic for testing - https://phabricator.wikimedia.org/T129682#2112642 (10Eevans) a:03Eevans [20:11:34] ASAT: All Scaps All the Time [20:11:40] we need that deployed ASAT [20:17:20] (03CR) 10Chad: [C: 032] Group1 to wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340560 (owner: 10Chad) [20:18:19] 06Operations, 06Services, 10Traffic, 07Performance: Look into a solution for replaying traffic for testing - https://phabricator.wikimedia.org/T129682#3065712 (10Eevans) [20:18:27] (03Merged) 10jenkins-bot: Group1 to wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340560 (owner: 10Chad) [20:18:36] (03CR) 10jenkins-bot: Group1 to wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340560 (owner: 10Chad) [20:18:43] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] [20:21:43] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] [20:23:42] matanya: Hmm, maybe it is wmf.14 related [20:23:52] seems so [20:24:14] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:20] RainbowSprinkles: i opened https://phabricator.wikimedia.org/T159372 [20:24:32] Wonder why I didn't spot it last night [20:24:35] Seems to come in bursts [20:25:02] but https://phabricator.wikimedia.org/T121644 is not related to the version apparently [20:25:39] so if you wish you can link it to the blockers and call RoanKattouw for help :) [20:26:07] RoanKattouw: Halp! [20:26:32] (03PS1) 10Chad: Revert "Group1 to wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340564 [20:26:38] (03CR) 10Chad: [V: 032 C: 032] Revert "Group1 to wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340564 (owner: 10Chad) [20:27:53] !log all trusty hosts via salt - fix APT sources list. replace ubuntu.wikimedia (deleted) with mirrors.wikimedia, apt-get update (re: https://phabricator.wikimedia.org/rOPUPe9da17d739233a4db197e947e627cf2a47ce6e6f) [20:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:11] (03CR) 10Dduvall: "> alright. yea, but on phabricator we were just talking about moving" [puppet] - 10https://gerrit.wikimedia.org/r/340164 (owner: 10Dduvall) [20:35:24] !log [neodymium:~] $ sudo salt --out=txt -b 10 -C 'G@lsb_distrib_codename:trusty' cmd.run "sed -i 's/ubuntu.wikimedia/mirrors.wikimedia/g' /etc/apt/sources.list && apt-get update" (https://phabricator.wikimedia.org/rOPUPe9da17d739233a4db197e947e627cf2a47ce6e6f#2080366) [20:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:17] 06Operations, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Install librdkafka 0.9.4 on SCB - https://phabricator.wikimedia.org/T159379#3065770 (10Pchelolo) [20:39:06] RainbowSprinkles: Sorry, was making lunch. Looking [20:39:24] 06Operations, 06Services, 10Traffic, 07Performance: Look into a solution for replaying traffic for testing - https://phabricator.wikimedia.org/T129682#3065784 (10Eevans) [20:40:59] Ha! [20:41:08] https://www.irccloud.com/pastebin/oYQxwg68/ [20:42:23] So all the "fix" for T73489 did was make it not actually break, but log it as if it was an exception [20:42:23] T73489: Echo: Special:Notifications Exception from line of : DateTimeZone::__construct(): Unknown or bad timezone (+00:00) - https://phabricator.wikimedia.org/T73489 [20:42:23] (03CR) 10Krinkle: "@Fdans: I don't understand the changes from PS15, was that intentional?" [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [20:42:23] !log netmon1001, labsdb1006,labsdb1007, fluorine, helium same fix as above, were not covered by salt targeting as they are precise. this is all now. ubuntu.wikimedia.org does not appear in sources when checking * [20:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:27] RainbowSprinkles: So, these exceptions aren't actually thrown, they're caught and logged, and we don't render any notifs that threw an exception on unserializing [20:42:29] (03PS5) 10Krinkle: webperf: Remove unused deprecate.py [puppet] - 10https://gerrit.wikimedia.org/r/338929 [20:42:33] I think that warrants a lower status than UBN [20:42:42] It's holding the train [20:42:46] mutante: Could you help me roll out https://gerrit.wikimedia.org/r/#/c/338929/ later today? (or someone else perhaps) [20:48:56] I'm arguing it doesn't need to [20:48:56] Then shut up the exception :) [20:48:56] And there is likely no relationship with wmf.14 [20:48:56] (if you don't care about it) [20:48:56] Happy to log it differently [20:48:56] We need to log additional info anyway, to figure out which rows are causing this [20:48:56] WARNING/FATAL are blockers now :) [20:48:57] (03CR) 10jenkins-bot: Revert "Group1 to wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340564 (owner: 10Chad) [20:48:57] I'm being strict on this now [20:48:57] :) [21:09:24] (03CR) 10Dzahn: [V: 031 C: 031] "double-checked manually and ran in compiler on "*". no-op" [puppet] - 10https://gerrit.wikimedia.org/r/334301 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [21:09:44] (03PS8) 10Dzahn: openstack: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334301 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [21:09:53] Krinkle: Heh, similar. In this case, scap would *follow* the symlink in production. So wanting to sync a symlink to a directory involves linting all of its contents! [21:10:04] Sync'ing the php -> php-* symlink was rather slow ;-) [21:10:18] Interesting. [21:10:25] And that's what we want? [21:10:32] I missed which way around it changed [21:10:42] I made it so it doesn't try to lint it anymore [21:10:45] There's no need [21:11:31] (03PS1) 10Hashar: jenkins: move plugins cache to /var/run [puppet] - 10https://gerrit.wikimedia.org/r/340576 [21:11:36] (03CR) 10Dzahn: [C: 032] openstack: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334301 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [21:11:59] PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:12:46] 06Operations, 06Services, 10Traffic, 07Performance: Look into solutions for replaying traffic to testing environment(s) - https://phabricator.wikimedia.org/T129682#3065895 (10Eevans) [21:13:14] legoktm: We need to backport https://gerrit.wikimedia.org/r/#/c/329025/ to wmf.14 as well. Seeing it break CentralAuth's rename queue special page [21:13:19] (breaking without it, that is) [21:13:42] RainbowSprinkles: uhhh it was merged back in December??? [21:13:45] Er, maybe misreading.... [21:13:46] Fatal error: Call to undefined method RenameQueueTablePager::getLinkRenderer() in /srv/mediawiki/php-1.29.0-wmf.14/extensions/CentralAuth/includes/specials/SpecialGlobalRenameQueue.php on line 832 [21:14:03] that's different [21:14:32] and my bad, that means I didn't test it fully [21:14:34] Yeah, got confused cuz similar [21:14:42] (03CR) 10Hashar: "Should probably move them to /var/cache instead" [puppet] - 10https://gerrit.wikimedia.org/r/340576 (owner: 10Hashar) [21:14:50] RainbowSprinkles: you can revert https://gerrit.wikimedia.org/r/#/c/336009, I'll fix it properly in a bit [21:16:32] (03PS2) 10Dzahn: mgmt: script to change mgmt password on HP servers [puppet] - 10https://gerrit.wikimedia.org/r/340567 (owner: 10Papaul) [21:16:58] Uno momento [21:18:01] (03CR) 10Dzahn: "would you like it if we integrate this into the existing script for Dell hosts and make it an option? like "-d" for Dell and "-h" for HP ?" [puppet] - 10https://gerrit.wikimedia.org/r/340567 (owner: 10Papaul) [21:18:45] (03PS1) 10Hashar: jenkins: expand war in /var/cache instead of /var/run [puppet] - 10https://gerrit.wikimedia.org/r/340580 [21:19:53] 06Operations, 10MediaWiki-ResourceLoader, 06Performance-Team, 10Traffic: Expires header for load.php should be relative to request time instead of cache time - https://phabricator.wikimedia.org/T105657#3065929 (10Krinkle) [21:21:19] (03PS2) 10Hashar: jenkins: move plugins cache from /var/run to /var/cache [puppet] - 10https://gerrit.wikimedia.org/r/340576 [21:21:23] legoktm: Reverting just in wmf.14: https://gerrit.wikimedia.org/r/#/c/340584/ [21:21:49] (03PS8) 10Dzahn: Linting changes for docker/etcd/kubernetes profiles [puppet] - 10https://gerrit.wikimedia.org/r/334303 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [21:21:51] thanks, I'll get it fixed in master [21:24:25] !log arlolra@tin Started deploy [parsoid/deploy@32ca3fb]: Updating parsoid to 9f96b2a0 [21:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:46] Notice: Array to string conversion in /srv/mediawiki/php-1.29.0-wmf.13/includes/TemplateParser.php(131)(0f0cd66993c2354c79ae70f19c2c0ec7) : eval()'d code on line 40 [21:24:50] Rather spammy in wmf.13 [21:24:55] Wonder what that's about [21:25:57] I was noticing that on my test wiki the other day, but I didn't investigate :s [21:26:25] It was showing up on Special:recentchanges [21:27:00] bug in the HTML template? [21:27:18] (03CR) 10Dzahn: [C: 032] Linting changes for docker/etcd/kubernetes profiles [puppet] - 10https://gerrit.wikimedia.org/r/334303 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [21:28:09] !log demon@tin Synchronized php-1.29.0-wmf.14/extensions/CentralAuth/: Unbreak pending real fix (duration: 00m 49s) [21:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:09] legoktm: CentralAuth should shut up now ^ [21:29:23] (03PS6) 10Dzahn: lvm/lvs: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334294 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [21:30:43] RECOVERY - puppet last run on db1057 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [21:32:04] !log arlolra@tin Finished deploy [parsoid/deploy@32ca3fb]: Updating parsoid to 9f96b2a0 (duration: 07m 39s) [21:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:14] !log arlolra@tin Started deploy [parsoid/deploy@32ca3fb]: Updating parsoid to 9f96b2a0 [21:32:18] thanks [21:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:13] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 149 bytes in 0.073 second response time [21:36:13] RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 149 bytes in 0.073 second response time [21:37:27] !log arlolra@tin Finished deploy [parsoid/deploy@32ca3fb]: Updating parsoid to 9f96b2a0 (duration: 05m 14s) [21:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:37] !log arlolra@tin Started deploy [parsoid/deploy@32ca3fb]: Updating parsoid to 9f96b2a0 [21:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:53] RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [21:40:55] !log demon@tin Synchronized php-1.29.0-wmf.14/extensions/Echo/includes/model/Event.php: better logging and such (duration: 00m 40s) [21:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:27] !log arlolra@tin Finished deploy [parsoid/deploy@32ca3fb]: Updating parsoid to 9f96b2a0 (duration: 03m 50s) [21:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:37] !log arlolra@tin Started deploy [parsoid/deploy@32ca3fb]: Updating parsoid to 9f96b2a0 [21:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:37] !log arlolra@tin Finished deploy [parsoid/deploy@32ca3fb]: Updating parsoid to 9f96b2a0 (duration: 02m 00s) [21:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:23] !log arlolra@tin Started deploy [parsoid/deploy@32ca3fb]: (no justification provided) [21:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:39] !log arlolra@tin Finished deploy [parsoid/deploy@32ca3fb]: (no justification provided) (duration: 00m 15s) [21:44:44] !log arlolra@tin Started deploy [parsoid/deploy@32ca3fb]: (no justification provided) [21:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:59] !log arlolra@tin Finished deploy [parsoid/deploy@32ca3fb]: (no justification provided) (duration: 00m 15s) [21:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:00] 06Operations, 10Continuous-Integration-Config, 06Release-Engineering-Team, 06Wikipedia-Android-App-Backlog, and 2 others: Investigate how to improve Android CI performance and stability - https://phabricator.wikimedia.org/T158014#3065994 (10hashar) I have absolutely no idea :-/ People with actual knowledg... [22:07:43] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1004 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [22:07:46] (03CR) 10Papaul: [C: 032] mgmt: script to detect vendor by mgmt ssh banner [puppet] - 10https://gerrit.wikimedia.org/r/340450 (https://phabricator.wikimedia.org/T156673) (owner: 10Dzahn) [22:08:43] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1004 is OK: OK ferm input default policy is set [22:09:53] RECOVERY - High lag on wdqs1003 is OK: OK: Less than 30.00% above the threshold [600.0] [22:13:03] PROBLEM - puppet last run on wtp1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:22:42] 06Operations, 06Performance-Team, 10Traffic, 07Performance: enwiki Main_Page timeouts - https://phabricator.wikimedia.org/T104225#1410845 (10Krinkle) ``` [22:18 UTC] krinkle at fluorine.eqiad.wmnet in /a/mw-log $ grep Main_Page slow-parse.log ``` | 2017-03-01 15:04:16 | mw2108 | enwiki | slow-parse | 3.35... [22:23:53] PROBLEM - puppet last run on mw1297 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:41:03] RECOVERY - puppet last run on wtp1004 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [22:43:33] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:44:33] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [22:53:53] RECOVERY - puppet last run on mw1297 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [22:59:33] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:01:33] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [23:09:00] (03PS1) 10Hashar: contint: remove ganglia diskstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/340657 [23:18:53] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:22:50] (03CR) 10Dzahn: [C: 032] "no-op on all http://puppet-compiler.wmflabs.org/5618/" [puppet] - 10https://gerrit.wikimedia.org/r/334294 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [23:24:11] (03PS2) 10Dzahn: contint: remove ganglia diskstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/340657 (owner: 10Hashar) [23:24:33] mutante: will need to purge the diskstat plugin manually on the hosts [23:24:47] yes, i saw that [23:25:08] iirc there is the .py file and some kind of config file [23:25:27] yes, /usr/lib/ganglia/ and /etc/ganglia/conf.d/ [23:25:33] magic :} [23:25:45] (03CR) 10Dzahn: [C: 032] contint: remove ganglia diskstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/340657 (owner: 10Hashar) [23:28:52] !log contint1002, contint2001: rm /usr/lib/ganglia/python_modules/diskstat.py*; rm /etc/ganglia/conf.d/diskstat.pyconf (re: gerrit 340657) [23:28:55] hashar: done [23:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:48] (03CR) 10Dzahn: "< mutante> !log contint1002, contint2001: rm /usr/lib/ganglia/python_modules/diskstat.py*; rm /etc/ganglia/conf.d/diskstat.pyconf (re: ge" [puppet] - 10https://gerrit.wikimedia.org/r/340657 (owner: 10Hashar) [23:30:54] thcipriani: ganglia is gone :) [23:32:27] (03CR) 10Dzahn: "Error: Syntax error at '<<'; expected '}' at /mnt/jenkins-workspace/puppet-compiler/5619/change/src/modules/aptrepo/manifests/rsync.pp:21 " [puppet] - 10https://gerrit.wikimedia.org/r/334317 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [23:34:46] have a good afternoon * [23:37:15] (03PS6) 10Dzahn: Linting fixes (multiple modules) [puppet] - 10https://gerrit.wikimedia.org/r/334317 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [23:47:53] RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures