[00:00:04] twentyafterfour: My dear minions, it's time we take the moon! Just kidding. Time for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180315T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:01:17] thanks for checking! [00:01:44] twentyafterfour: I'm running over a bit with SWAT, is that OK? [00:02:02] 10Operations, 10netops: Connection timeout from 195.77.175.64/29 to text-lb.esams.wikimedia.org - https://phabricator.wikimedia.org/T189689#4050109 (10Platonides) In fact, those timeouts appear even on the early hops. The ticket also includes a successful tcp traceroute, and no explicit mention of timeouts. T... [00:02:20] !log tgr@tin Synchronized php-1.31.0-wmf.24/extensions/VisualEditor/modules/ve-mw: VE fixes: T189267, T189381 (duration: 01m 16s) [00:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:27] T189381: Clear VE autosave when making changes in 2010 editor - https://phabricator.wikimedia.org/T189381 [00:02:27] T189267: Invisible templates are unexpectedly large - https://phabricator.wikimedia.org/T189267 [00:03:46] !log tgr@tin Synchronized php-1.31.0-wmf.25/extensions/VisualEditor/modules/ve-mw: VE fixes: T189267, T189381 (duration: 01m 15s) [00:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:57] MatmaRex: should be live [00:04:36] (does that mean VE autosaves now?) [00:05:14] tgr: i think you need to sync lib/ve too [00:05:23] and maybe do something special because it's a submodule [00:05:36] for wmf25 [00:05:41] Amir1: your wiki edit says 1.25 but the backports are to 1.24 actually, is that correct? [00:06:28] I'm backporting to 1.25 (1.24 includes it already) [00:06:30] my bad [00:06:46] sorry, 1.24. 1.25 includes it [00:07:09] doh, sorry! didn't notice those were different sets of patches, MatmaRex [00:08:06] tgr: oh, sorry. yeah, one patch for wmf.24 only (it was already in wmf.25), and one different one for wmf.25 only (difficult to backport) [00:08:35] submodule update with --recursive should have updated it but let me check [00:08:54] !log tgr@tin Synchronized php-1.31.0-wmf.25/extensions/VisualEditor/: VE fixes followup (duration: 01m 15s) [00:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:17] tgr: thanks! a quick check in production looks good on wmf.24 (en.wp) and wmf.25 (mw.org) [00:12:11] yeah, lib/ve is at the right commit on tin [00:13:39] (03CR) 10Chico Venancio: [C: 031] toolsdb: Remove stale accounts if present in maintain-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/419630 (https://phabricator.wikimedia.org/T188680) (owner: 10Bstorm) [00:15:44] Amir1: the test failures are irrelevant, right? [00:15:56] tgr: yup [00:16:05] ok, I'll force it through [00:16:45] strangely phan passes on master of the extension but fails on every branch [00:17:00] selenium is random failure for this case AFAIK [00:17:43] Amir1: on mwdebug1002 [00:18:13] (03CR) 10Dzahn: [C: 032] mediawiki/apache: Add romd.wikimedia.org ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/412898 (https://phabricator.wikimedia.org/T187184) (owner: 10Urbanecm) [00:18:48] hmm, can you double check it's there? [00:18:50] duh, forgot the submodule update [00:19:00] :D [00:19:14] it happens to me all. the. time. too [00:19:31] sorry, now it's really on mwdebug1002 [00:20:13] tgr: it's good to go \o/ [00:20:23] one of the Apache ServerAliases in prod is called... [00:20:26] FIXME [00:22:56] !log tgr@tin Synchronized php-1.31.0-wmf.24/includes/user/ExternalUserNames.php: T189320 Add ExternalUserNames::getLocal() to get local part of username (duration: 01m 15s) [00:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:02] T189320: Wikidata external changes have usernames with "wikidata>" prefixes - https://phabricator.wikimedia.org/T189320 [00:25:17] !log tgr@tin Synchronized php-1.31.0-wmf.24/extensions/Wikibase/client/includes/RecentChanges/ExternalChangeFactory.php: T189320 Use only local part of username when building the RC line (duration: 01m 18s) [00:25:21] Amir1: should be live, thanks for your patience [00:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:40] thank you for getting this done :) [00:26:00] heh mutante, are you now being nitpicky with how it is named? [00:26:02] xD [00:26:15] (03CR) 10Dzahn: [C: 032] "confirmed on mwdebug1002" [puppet] - 10https://gerrit.wikimedia.org/r/412898 (https://phabricator.wikimedia.org/T187184) (owner: 10Urbanecm) [00:27:19] Platonides: the ServerAlias or the commit message? yes [00:27:32] a config line like this: [00:27:33] ServerAlias *.wikimedia.org # FIXME: Should this still be here? [00:27:54] if you run apache2ctl -S to show all ServerAlias on this.. guess what you are getting [00:28:51] alias #, alias FIXME:, alias Should, alias this, alias still, alias be, wild alias *.wikimedia.org, wild alias here? [00:29:26] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:29:36] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:29:36] PROBLEM - puppet last run on kafka1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:29:43] awww.. [00:29:46] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:30:23] runs puppet on some of those [00:30:35] assumes puppetdb [00:30:55] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:31:00] yea, it's that again [00:31:05] PROBLEM - puppet last run on nitrogen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:31:05] PROBLEM - puppet last run on cp4025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:32:15] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:32:26] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:32:56] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:32:56] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:32:57] puppetdb on nitrogen, auto-restarted 4 min ago [00:33:06] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:33:06] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:33:06] PROBLEM - puppet last run on analytics1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:33:46] silencing it for a few min [00:35:56] Urbanecm: romd is also being rolled out now... max 30 min [00:43:24] hi all... Looks like some cruft left over from the scap problems earlier today... A client-side message somehow is not getting thru when it's supposed to [00:44:14] Visible here: https://meta.wikimedia.org/w/index.php?title=Special:CentralNotice&subaction=noticeDetail¬ice=test [00:44:27] 'centralnotice-impression-events-sample-rate-field' [00:45:44] (not a severe issue at all, the feature is not really in use yet... so, not worth worrying about this before other stuff that's surely more important) [00:54:36] (03PS2) 10Dzahn: cassandra/icinga: make monitoring configurable, skip on dev [puppet] - 10https://gerrit.wikimedia.org/r/419339 (https://phabricator.wikimedia.org/T189050) [00:56:49] (03CR) 10jerkins-bot: [V: 04-1] cassandra/icinga: make monitoring configurable, skip on dev [puppet] - 10https://gerrit.wikimedia.org/r/419339 (https://phabricator.wikimedia.org/T189050) (owner: 10Dzahn) [00:58:06] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [00:58:46] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [00:59:06] RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [00:59:06] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:59:06] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [00:59:35] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:59:36] RECOVERY - puppet last run on kafka1014 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:00:55] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:01:05] RECOVERY - puppet last run on cp4025 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:02:15] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:02:26] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:02:56] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [01:02:56] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [01:03:06] RECOVERY - puppet last run on analytics1065 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [01:03:35] RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [01:09:26] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:11:50] (03PS3) 10Dzahn: cassandra/icinga: make monitoring configurable, skip on dev [puppet] - 10https://gerrit.wikimedia.org/r/419339 (https://phabricator.wikimedia.org/T189050) [01:40:31] (03CR) 10MZMcBride: [C: 031] "For what it's worth, I left a couple comments on the relevant Phabricator task and haven't yet received any reply. Compare to this Gerrit " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408073 (https://phabricator.wikimedia.org/T186463) (owner: 10Zoranzoki21) [02:16:49] 10Operations, 10Traffic, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4052301 (10ayounsi) [02:26:05] 10Operations, 10Traffic, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4052304 (10ayounsi) Updated the task description for small pacific islands. The methodology is the following. Made possible by the fact tha... [02:38:57] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.24) (duration: 10m 10s) [02:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:47:25] 10Operations, 10Traffic, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4052309 (10ayounsi) [02:57:47] (03PS1) 10Andrew Bogott: labtestweb: standardize horizon vhost file [puppet] - 10https://gerrit.wikimedia.org/r/419655 [03:01:17] (03CR) 10Andrew Bogott: [C: 032] labtestweb: standardize horizon vhost file [puppet] - 10https://gerrit.wikimedia.org/r/419655 (owner: 10Andrew Bogott) [03:26:21] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 841.65 seconds [04:05:21] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 186.15 seconds [04:06:00] (03PS1) 10Ottomata: [WIP] Puppetization for newer SWAP (JupyterHub) deployed via scap [puppet] - 10https://gerrit.wikimedia.org/r/419656 (https://phabricator.wikimedia.org/T183145) [04:07:13] (03CR) 10BryanDavis: "nice :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419630 (https://phabricator.wikimedia.org/T188680) (owner: 10Bstorm) [04:10:31] PROBLEM - HHVM rendering on mw2226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:11:21] RECOVERY - HHVM rendering on mw2226 is OK: HTTP OK: HTTP/1.1 200 OK - 73918 bytes in 0.297 second response time [05:37:28] 10Operations, 10Traffic, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4052382 (10ayounsi) [05:50:45] (03PS1) 10Madhuvishy: dumps: Add static ipv6 to new distribution servers [puppet] - 10https://gerrit.wikimedia.org/r/419659 [05:54:12] 10Operations, 10Traffic, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4052383 (10ayounsi) [05:58:28] (03PS2) 10Madhuvishy: dumps: Add static ipv6 to new distribution servers [puppet] - 10https://gerrit.wikimedia.org/r/419659 [05:59:30] (03CR) 10Madhuvishy: [C: 032] dumps: Add static ipv6 to new distribution servers [puppet] - 10https://gerrit.wikimedia.org/r/419659 (owner: 10Madhuvishy) [06:29:43] !log Stop MySQL on db1064 for mariadb upgrade [06:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:42] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419661 [06:47:32] (03PS1) 10Muehlenhoff: Update MAC of mc2036 after mainboard replacement [puppet] - 10https://gerrit.wikimedia.org/r/419662 [06:49:07] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419661 (owner: 10Marostegui) [06:50:19] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419661 (owner: 10Marostegui) [06:50:37] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419661 (owner: 10Marostegui) [06:51:55] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1064 (duration: 01m 15s) [06:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:41] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [06:54:42] (03PS1) 10Marostegui: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419663 (https://phabricator.wikimedia.org/T187089) [06:56:27] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419663 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:58:25] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419663 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:58:34] (03PS1) 10Muehlenhoff: Depool rdb1001 for kernel update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419664 [06:59:08] (03CR) 10Muehlenhoff: [C: 032] Update MAC of mc2036 after mainboard replacement [puppet] - 10https://gerrit.wikimedia.org/r/419662 (owner: 10Muehlenhoff) [06:59:38] (03CR) 10jerkins-bot: [V: 04-1] Depool rdb1001 for kernel update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419664 (owner: 10Muehlenhoff) [07:00:31] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1084 (duration: 01m 15s) [07:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:03] !log Deploy schema change on db1084 - T187089 T185128 T153182 [07:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:10] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [07:01:10] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [07:01:10] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [07:01:42] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419663 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [07:03:34] (03PS1) 10Marostegui: db-eqiad.php: Depool es1017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419665 [07:03:54] (03PS1) 10Marostegui: es1017.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/419666 [07:04:28] (03PS2) 10Muehlenhoff: Depool rdb1001 for kernel update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419664 [07:04:57] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool es1017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419665 (owner: 10Marostegui) [07:06:12] (03Merged) 10jenkins-bot: db-eqiad.php: Depool es1017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419665 (owner: 10Marostegui) [07:07:00] (03CR) 10jenkins-bot: db-eqiad.php: Depool es1017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419665 (owner: 10Marostegui) [07:07:40] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool es1017 (duration: 01m 14s) [07:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:59] !log Stop mariadb on es1017 for kernel, mariadb and socket location upgrade [07:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:19] (03CR) 10Marostegui: [C: 032] es1017.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/419666 (owner: 10Marostegui) [07:17:12] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool es1017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419667 [07:19:07] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool es1017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419667 (owner: 10Marostegui) [07:19:46] (03CR) 10Ayounsi: [V: 032 C: 032] Wheels for Netbox 2.3.1 [wheels/netbox] - 10https://gerrit.wikimedia.org/r/419383 (owner: 10Ayounsi) [07:20:17] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool es1017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419667 (owner: 10Marostegui) [07:20:32] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool es1017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419667 (owner: 10Marostegui) [07:21:39] !log reimaging mc2036 after hardware replacement T185587 [07:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:44] T185587: mc2036 mainboard fuse failure - https://phabricator.wikimedia.org/T185587 [07:21:50] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool es1017 (duration: 01m 14s) [07:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:24] (03PS1) 10Ayounsi: Update python requirements based on new wheels for 2.3.1 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/419668 [07:28:01] RECOVERY - Disk space on kubernetes1003 is OK: DISK OK [07:28:14] (03CR) 10Ayounsi: [V: 032 C: 032] Update python requirements based on new wheels for 2.3.1 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/419668 (owner: 10Ayounsi) [07:28:51] PROBLEM - Disk space on kubernetes1002 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/2bf8ae39-2822-11e8-9685-aa0000fe6bdf/volumes/kubernetes.iosecret/tiller-token-cw3sp is not accessible: Permission denied [07:29:18] (03PS1) 10Jcrespo: dbproxy: Failover from db1020 to db1051 [puppet] - 10https://gerrit.wikimedia.org/r/419669 (https://phabricator.wikimedia.org/T189656) [07:29:42] !log ayounsi@tin Started deploy [netbox/deploy@7310860]: Upgrading Netbox to v2.3.1 [07:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:50] (03CR) 10Marostegui: [C: 031] dbproxy: Failover from db1020 to db1051 [puppet] - 10https://gerrit.wikimedia.org/r/419669 (https://phabricator.wikimedia.org/T189656) (owner: 10Jcrespo) [07:30:21] !log ayounsi@tin Finished deploy [netbox/deploy@7310860]: Upgrading Netbox to v2.3.1 (duration: 00m 39s) [07:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:26] !log ayounsi@tin Started deploy [netbox/deploy@7310860]: Upgrading Netbox to v2.3.1 [07:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:31] !log ayounsi@tin Finished deploy [netbox/deploy@7310860]: Upgrading Netbox to v2.3.1 (duration: 00m 05s) [07:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:51] RECOVERY - Disk space on kubernetes1002 is OK: DISK OK [07:33:01] (03PS1) 10Alexandros Kosiaris: Pass --tiller-image to helm init [deployment-charts] - 10https://gerrit.wikimedia.org/r/419670 [07:33:39] (03PS1) 10Jcrespo: mariadb: Switchover master from db1020 to db1051 [puppet] - 10https://gerrit.wikimedia.org/r/419671 (https://phabricator.wikimedia.org/T189656) [07:33:50] PROBLEM - Disk space on kubernetes1001 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/e3a1ca6a-2822-11e8-9685-aa0000fe6bdf/volumes/kubernetes.iosecret/tiller-token-72p10 is not accessible: Permission denied [07:34:01] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4052448 (10elukey) During the first puppet run I have seen two issues: 1) the /etc/hadoop directory seems not present when the /etc/h... [07:34:50] RECOVERY - Disk space on kubernetes1001 is OK: DISK OK [07:35:50] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: CRITICAL - kubelet_operational_latencies is 20778 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:36:51] PROBLEM - Disk space on kubernetes1002 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/384a8206-2823-11e8-9685-aa0000fe6bdf/volumes/kubernetes.iosecret/tiller-token-72p10 is not accessible: Permission denied [07:38:00] (03CR) 10Marostegui: [C: 031] mariadb: Switchover master from db1020 to db1051 [puppet] - 10https://gerrit.wikimedia.org/r/419671 (https://phabricator.wikimedia.org/T189656) (owner: 10Jcrespo) [07:39:51] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: OK - kubelet_operational_latencies is 1801 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:42:30] (03PS3) 10Alexandros Kosiaris: Prepare kubernetes nodes for serving mathoid traffic [puppet] - 10https://gerrit.wikimedia.org/r/410489 (https://phabricator.wikimedia.org/T184919) [07:42:40] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Prepare kubernetes nodes for serving mathoid traffic [puppet] - 10https://gerrit.wikimedia.org/r/410489 (https://phabricator.wikimedia.org/T184919) (owner: 10Alexandros Kosiaris) [07:45:02] (03PS1) 10Jcrespo: mariadb: Reenable notifications on db1051 [puppet] - 10https://gerrit.wikimedia.org/r/419672 (https://phabricator.wikimedia.org/T189656) [07:45:22] (03PS1) 10Ayounsi: Netbox: change the owner of conf files so scap can use them [puppet] - 10https://gerrit.wikimedia.org/r/419673 [07:46:31] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/mathoid on labpuppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/mathoid is broken [07:46:42] that's me ^ [07:46:50] PROBLEM - Confd template for /srv/config-master/pybal/codfw/mathoid on labpuppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/mathoid is broken [07:46:52] (03CR) 10Ayounsi: [C: 032] Netbox: change the owner of conf files so scap can use them [puppet] - 10https://gerrit.wikimedia.org/r/419673 (owner: 10Ayounsi) [07:47:00] I wonder though what it it is that I broke [07:47:11] (03PS2) 10Jcrespo: mariadb: Reenable notifications on db1051 [puppet] - 10https://gerrit.wikimedia.org/r/419672 (https://phabricator.wikimedia.org/T189656) [07:48:24] !log ayounsi@tin Started deploy [netbox/deploy@7310860]: Upgrading Netbox to v2.3.1 [07:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:31] !log ayounsi@tin Finished deploy [netbox/deploy@7310860]: Upgrading Netbox to v2.3.1 (duration: 01m 07s) [07:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:16] (03CR) 10Jcrespo: [C: 032] mariadb: Reenable notifications on db1051 [puppet] - 10https://gerrit.wikimedia.org/r/419672 (https://phabricator.wikimedia.org/T189656) (owner: 10Jcrespo) [07:58:52] (03PS1) 10Alexandros Kosiaris: Fix conftool data wnnet typo [puppet] - 10https://gerrit.wikimedia.org/r/419675 [08:01:25] !log switching db2044 to be a direct replica of db1051 [08:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:05] it failed [08:02:13] I see it [08:02:21] oh, I know what it is [08:02:25] the typical old-format thingy [08:02:28] bad repl password [08:02:30] yep [08:02:39] so if I update the pass on the real master [08:02:44] it should fix itself [08:02:47] (03PS1) 10Ayounsi: Update submodules [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/419676 [08:03:18] yeah, I have had to run FLUSH PRIVILEGES on some cases (not all) [08:03:32] (03CR) 10Ayounsi: [V: 032 C: 032] Update submodules [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/419676 (owner: 10Ayounsi) [08:05:03] I actually had to stop and start slave [08:05:06] !log ayounsi@tin Started deploy [netbox/deploy@278aec4]: Upgrading Netbox to v2.3.1 [08:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:11] because connecting had failed too many times [08:05:17] ah right [08:05:25] or maybe it has a larger timeout [08:06:08] !log ayounsi@tin Finished deploy [netbox/deploy@278aec4]: Upgrading Netbox to v2.3.1 (duration: 01m 02s) [08:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:30] but it should be ok now, I think [08:06:36] yeah [08:06:38] it looks good now [08:06:48] I will check for other bad passes [08:07:17] yeah, I guess other's are also affected [08:08:06] nagios [08:08:18] will swithch that to unix_socket [08:09:50] (03PS1) 10Alexandros Kosiaris: puppetmaster::frontend: Enable confd monitoring [puppet] - 10https://gerrit.wikimedia.org/r/419677 [08:10:39] (03CR) 10Alexandros Kosiaris: [C: 032] Fix conftool data wnnet typo [puppet] - 10https://gerrit.wikimedia.org/r/419675 (owner: 10Alexandros Kosiaris) [08:10:41] (03CR) 10Alexandros Kosiaris: [C: 032] puppetmaster::frontend: Enable confd monitoring [puppet] - 10https://gerrit.wikimedia.org/r/419677 (owner: 10Alexandros Kosiaris) [08:11:01] I think that is all [08:11:11] good [08:12:52] PROBLEM - Netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Proxy Error - 619 bytes in 0.087 second response time [08:15:49] :-( [08:15:52] RECOVERY - Netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 339 bytes in 0.009 second response time [08:16:06] ah, I just saw the log [08:16:23] happens to be a server using the dbs we are going to manage [08:18:18] !log disable puppet on db1051, db1020 for switchover preparation [08:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:42] (03CR) 10Jcrespo: [C: 032] dbproxy: Failover from db1020 to db1051 [puppet] - 10https://gerrit.wikimedia.org/r/419669 (https://phabricator.wikimedia.org/T189656) (owner: 10Jcrespo) [08:20:50] (03PS2) 10Jcrespo: dbproxy: Failover from db1020 to db1051 [puppet] - 10https://gerrit.wikimedia.org/r/419669 (https://phabricator.wikimedia.org/T189656) [08:21:53] PROBLEM - Netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Proxy Error - 619 bytes in 0.083 second response time [08:23:35] ran puppet and checked config on dbproxy1002 and dbproxy1007 [08:23:40] cool [08:23:46] will merge the other patch too [08:23:52] 10Operations, 10ops-eqiad, 10Analytics-Kanban: DIMM errors for analytics1062 - https://phabricator.wikimedia.org/T187164#4052486 (10elukey) 05Open>03Resolved a:03elukey Thanks Chris, going to close the task and re-open if 1062 freezes again! [08:23:55] ok [08:24:05] akosiaris: we are close to start the actual failover [08:24:13] (03PS2) 10Jcrespo: mariadb: Switchover master from db1020 to db1051 [puppet] - 10https://gerrit.wikimedia.org/r/419671 (https://phabricator.wikimedia.org/T189656) [08:25:29] I am ready [08:25:32] 10Operations, 10netops: Connection timeout from 195.77.175.64/29 to text-lb.esams.wikimedia.org - https://phabricator.wikimedia.org/T189689#4052490 (10Samtar) @ayounsi I'm happy to pass on comments and suggestions to the OTRS ticket. @Platonides Please feel free to update the task description to something mor... [08:26:00] (03CR) 10Jcrespo: [C: 032] mariadb: Switchover master from db1020 to db1051 [puppet] - 10https://gerrit.wikimedia.org/r/419671 (https://phabricator.wikimedia.org/T189656) (owner: 10Jcrespo) [08:26:16] ^just a premerge [08:26:51] I am fully ready [08:26:53] RECOVERY - Netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 339 bytes in 0.008 second response time [08:28:10] !log foreachwikiindblist "% private.dblist" extensions/WikimediaMaintenance/filebackend/setZoneAccess.php --backend=local-multiwrite --private [08:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:17] !log setZoneAccess done [08:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:07] (03PS5) 10Filippo Giunchedi: Add Thumbor private container user configuration keys [puppet] - 10https://gerrit.wikimedia.org/r/414631 (https://phabricator.wikimedia.org/T187822) (owner: 10Gilles) [08:30:25] ok, let's do this [08:30:43] ok [08:30:45] let's go [08:31:08] ok [08:31:16] !log setting m2 as read only [08:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:26] let me know when ready so I can kill connections [08:31:54] read only on [08:31:55] PROBLEM - Netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Proxy Error - 619 bytes in 0.078 second response time [08:31:56] ok [08:32:15] all good from my side [08:32:17] catched up [08:32:34] let me know when you want me to reload proxies [08:32:48] starting puppet [08:33:08] heartbeat running [08:33:15] marostegui: please failover [08:33:19] ok [08:33:28] on both, of course [08:33:36] failed on dbproxy1002 [08:33:37] checking [08:33:50] errors on config [08:33:51] checking what is it [08:34:12] ah right [08:34:13] i know it [08:34:26] is it missing a port? [08:34:38] reloaded 1002 [08:34:41] going for 1007 [08:35:06] all done [08:35:41] I can see both pointing to db1051 [08:36:39] ok, setting db1051 as read write [08:36:43] go for it [08:36:52] I am ready to send a patch to test [08:37:24] akosiaris: time to check services [08:37:35] ok checking otrs first [08:37:44] I am seding a patch to gerrit [08:37:45] marostegui: feel free to try with https://gerrit.wikimedia.org/r/c/414631 I have to merge that anyways [08:37:50] (03PS1) 10Marostegui: dbproxy1002.yaml: Add missing port [puppet] - 10https://gerrit.wikimedia.org/r/419680 (https://phabricator.wikimedia.org/T189656) [08:37:56] RECOVERY - Netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 341 bytes in 0.490 second response time [08:38:16] https://gerrit.wikimedia.org/r/#/c/419680/ -> can you check a +1 there? :) [08:38:31] ok just logged into OTRS, looks ok [08:38:43] (03CR) 10Jcrespo: [C: 04-1] dbproxy1002.yaml: Add missing port [puppet] - 10https://gerrit.wikimedia.org/r/419680 (https://phabricator.wikimedia.org/T189656) (owner: 10Marostegui) [08:38:52] it is actually a -1 :-) [08:38:59] true XDDD [08:39:00] the port is in the line bewlo [08:39:03] but it works [08:39:09] it did require an apache restart but OTRS is OK [08:39:28] marostegui: not only my fault, I literally asked you for a review! [08:39:32] (03PS2) 10Marostegui: dbproxy1002.yaml: Add missing port [puppet] - 10https://gerrit.wikimedia.org/r/419680 (https://phabricator.wikimedia.org/T189656) [08:39:33] 0:-) [08:39:36] (03PS1) 10Marostegui: db-eqiad.php: Restore es1017 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419681 [08:39:47] (03CR) 10Jcrespo: [C: 032] dbproxy1002.yaml: Add missing port [puppet] - 10https://gerrit.wikimedia.org/r/419680 (https://phabricator.wikimedia.org/T189656) (owner: 10Marostegui) [08:39:53] jynus: I actually checked the ips and all that :_( [08:40:14] hey, we both misseed it, even it it cannot be more clear on hiera [08:40:20] maybe we can change the format [08:40:28] gerrit works fine for me [08:40:34] yeah, gerrit looks cool [08:40:38] to have 3306 as the default port [08:40:45] sorry for that [08:40:48] (03PS3) 10Muehlenhoff: Switch debdeploy clients to Python 3 (WIP) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/413397 [08:41:00] you will not belive it wasn't intentional sorry [08:41:00] db1020 is also looking good without new connections [08:41:10] (03CR) 10Filippo Giunchedi: [C: 032] Add Thumbor private container user configuration keys [puppet] - 10https://gerrit.wikimedia.org/r/414631 (https://phabricator.wikimedia.org/T187822) (owner: 10Gilles) [08:41:11] did you kill stuff there? [08:41:14] yeah [08:41:17] cool [08:41:36] (03PS6) 10Filippo Giunchedi: Add Thumbor private container user configuration keys [puppet] - 10https://gerrit.wikimedia.org/r/414631 (https://phabricator.wikimedia.org/T187822) (owner: 10Gilles) [08:42:09] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Add Thumbor private container user configuration keys [puppet] - 10https://gerrit.wikimedia.org/r/414631 (https://phabricator.wikimedia.org/T187822) (owner: 10Gilles) [08:42:11] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore es1017 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419681 (owner: 10Marostegui) [08:42:21] !log end of maintenance for m2 [08:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:35] but please keep checking of something is weird [08:42:36] Nice job jynus! :-Ç) [08:42:39] :-) [08:42:41] not that nice! [08:42:50] more like fail [08:42:59] It wasn't a fail at all [08:43:25] (03Merged) 10jenkins-bot: db-eqiad.php: Restore es1017 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419681 (owner: 10Marostegui) [08:43:30] (03CR) 10Muehlenhoff: "Patch is fine, but the commit message needs to reflect that we're re-adding Oliver, not adding him under a new account" [puppet] - 10https://gerrit.wikimedia.org/r/416993 (https://phabricator.wikimedia.org/T188945) (owner: 10RobH) [08:43:38] akosiaris: should I send a test email to otrs? [08:43:43] (03CR) 10jenkins-bot: db-eqiad.php: Restore es1017 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419681 (owner: 10Marostegui) [08:43:50] jynus: doing so already [08:43:57] perfect, thanks [08:44:00] (03PS1) 10Alexandros Kosiaris: kubernetes: Allow setting service-node-port-range [puppet] - 10https://gerrit.wikimedia.org/r/419682 [08:44:09] !log roll-restart thumbor in eqiad/codfw to enable access to swift private container [08:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:34] I am going to change db1020 replication [08:44:36] just in case [08:44:43] and the other administrative changes [08:45:13] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore normal weight for es1017 (duration: 01m 14s) [08:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:47] (03PS4) 10Vgutierrez: Re-add shell user ironholds/oliver keyes [puppet] - 10https://gerrit.wikimedia.org/r/416993 (https://phabricator.wikimedia.org/T188945) (owner: 10RobH) [08:48:31] (03PS5) 10Vgutierrez: Re-add shell user ironholds/oliver keyes [puppet] - 10https://gerrit.wikimedia.org/r/416993 (https://phabricator.wikimedia.org/T188945) (owner: 10RobH) [08:49:28] (03CR) 10Vgutierrez: [C: 032] Re-add shell user ironholds/oliver keyes [puppet] - 10https://gerrit.wikimedia.org/r/416993 (https://phabricator.wikimedia.org/T188945) (owner: 10RobH) [08:50:15] jynus: marostegui mx1001 can't connect to m2-slave.eqiad.wmnet [08:50:31] I 've already reloaded exim [08:51:06] it's trying to some reason to connecto to m2-slave (probably detected that m2-master had problems) [08:51:49] can you reload it so it trys to connect to m2-master? [08:51:56] I am doing so now [08:52:18] turns out stopping it does not stop the queue handling children [08:52:27] I have to kill them manually [08:52:29] it should never connect to the slave direcy [08:52:44] we can remove it from the config [08:52:47] fine by me [08:52:47] there is precisely a proxy in between to handle that [08:53:00] application-transparent [08:53:03] well there probably wasn't when that configuration was set [08:53:09] it's not writing anyway anything ever [08:53:12] sure [08:53:17] let's fix the issue first [08:53:26] I think it's fixed [08:53:33] then we talk improvements [08:53:36] mails are once more flowing in [08:53:57] these are the kind of issues I wouldn't be able to see by my own [08:54:26] yeah I also forgot about mx1001 doing that [08:54:33] that is ok [08:54:41] can you show me the config? [08:54:41] I remembered it as I was testing the inbound email stuff [08:54:51] I may be able to suggest a better config [08:55:04] modules/otrs/templates/exim4.conf.otrs.erb [08:55:21] line 21 [08:55:42] so it has like a failover mechanism? [08:55:47] yes [08:55:51] clearly no longer needed [08:55:54] yeah, I would remove it [08:56:03] the proxy already does that for you [08:56:14] twentyafterfour: awake? :) I'm good to go on the new scap version btw when you want/can (cc thcipriani) [08:56:16] and in this case, we prefered to have some "microdowntime" [08:56:26] of disallowing writes [08:56:30] funnily enough that configuration part is NOT just on the OTRS hosts exim4 [08:56:30] during the failover [08:56:55] it's on mx1001 as well in order to protect us against incoming spam to the otrs queues [08:56:58] it should be able to connect, though [08:57:42] probably it was there before the proxy happened [08:57:42] 10Operations, 10HHVM, 10User-Elukey: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4052538 (10Elitre) [08:58:04] could have been there when gerrit failed yesterday? [08:58:10] hmm it's trying once more to connect to the slave [08:58:12] due to the firewall issues? [08:58:24] yes [08:58:25] oh wait [08:58:28] akosiaris: which ip is that running from? [08:58:34] maybe it is a firewall issue [08:58:35] it can't connect to m2-master either now [08:58:50] probably never was and I misdiagnosed [08:58:52] maybe it needs a firewall exception [08:59:05] it could have been something new this week [08:59:17] 208.80.154.76 and 208.80.154.91 [08:59:21] check if you can connect from myql client [08:59:30] ok, so we need a firewall exception for that [08:59:32] no I can't [08:59:42] is it a hard outage? [08:59:42] telnet m2-master.eqiad.wmnet 3306 [08:59:42] Trying 10.64.32.156... [08:59:50] to understand the impact [09:00:02] emails are temporarily rejected [09:00:09] ok, I can fix that [09:00:23] it's not unbreak now, but it is unbreak within the next 1 hour tops [09:00:33] oh, I am fixing it [09:00:46] just it has been like that for 2 days probably [09:01:07] it can't be... we would be missing emails [09:01:32] aha.. it actually was [09:02:58] I 'll remove m2-slave from the config in the meantime [09:03:44] (03PS1) 10Jcrespo: dbproxy: Allow exim hosts to connect to misc proxies [puppet] - 10https://gerrit.wikimedia.org/r/419685 (https://phabricator.wikimedia.org/T189656) [09:04:00] (03CR) 10jerkins-bot: [V: 04-1] dbproxy: Allow exim hosts to connect to misc proxies [puppet] - 10https://gerrit.wikimedia.org/r/419685 (https://phabricator.wikimedia.org/T189656) (owner: 10Jcrespo) [09:04:09] jynus: lemme check is those are enough [09:04:16] we might have to add a few more [09:04:30] (03PS2) 10Jcrespo: dbproxy: Allow exim hosts to connect to misc proxies [puppet] - 10https://gerrit.wikimedia.org/r/419685 (https://phabricator.wikimedia.org/T189656) [09:04:48] yeah we do [09:05:04] ok, we can merge this right now or I can amend it [09:05:17] let's amend it, no need to hurry that much [09:05:24] please tell [09:05:43] RECOVERY - Check systemd state on puppetmaster2002 is OK: OK - running: The system is fully operational [09:05:44] wiki-mail-codfw.wikimedia.org, mx2001.wikimedia.org as well [09:05:49] that should be all [09:06:18] and you can do @resolve((mx1001.wikimedia.org mx2001.wikimedia.org etcetc)) [09:06:28] no need for multiple ferm::service definitions [09:06:41] I know, was going to do that [09:06:44] ok [09:06:51] I thought they were separate "services" [09:07:15] well the wiki-mail IP is not supposed to be used for outgoing connections to mysql [09:07:26] but source address selection in IPv6 sucks [09:07:32] so let's be prudent [09:07:39] we don't need ipv6 [09:07:41] not that we have IPv6 enabled right now for that thing [09:07:47] we will at some point in time [09:07:48] because mysql doesn't allow iv6 [09:07:53] ah true [09:07:56] I forgot about that [09:07:59] accounts right now [09:08:05] it supports them [09:08:19] anyway I was just being prudent [09:08:43] (03PS3) 10Jcrespo: dbproxy: Allow exim hosts to connect to misc proxies [puppet] - 10https://gerrit.wikimedia.org/r/419685 (https://phabricator.wikimedia.org/T189656) [09:09:06] (03CR) 10Alexandros Kosiaris: [C: 031] dbproxy: Allow exim hosts to connect to misc proxies [puppet] - 10https://gerrit.wikimedia.org/r/419685 (https://phabricator.wikimedia.org/T189656) (owner: 10Jcrespo) [09:09:17] check carefuly for typos [09:09:35] I did [09:09:41] (03CR) 10Jcrespo: [C: 032] dbproxy: Allow exim hosts to connect to misc proxies [puppet] - 10https://gerrit.wikimedia.org/r/419685 (https://phabricator.wikimedia.org/T189656) (owner: 10Jcrespo) [09:10:00] (03PS1) 10Filippo Giunchedi: hieradata: pool puppetmaster2002 [puppet] - 10https://gerrit.wikimedia.org/r/419689 (https://phabricator.wikimedia.org/T184562) [09:10:08] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Pass --tiller-image to helm init [deployment-charts] - 10https://gerrit.wikimedia.org/r/419670 (owner: 10Alexandros Kosiaris) [09:10:48] rules added now [09:11:04] (03PS6) 10Vgutierrez: Re-add shell user ironholds/oliver keyes [puppet] - 10https://gerrit.wikimedia.org/r/416993 (https://phabricator.wikimedia.org/T188945) (owner: 10RobH) [09:11:30] exim4 restarted and I am monitoring logs [09:12:44] wow this was very confusing [09:12:59] ? [09:13:04] I am still logs for example from the various queue children still trying to connect to m2-slave [09:13:07] seeing* [09:13:21] despite having restarted the service [09:13:32] this doesn't help in debugging [09:13:38] can you check the connection manually? [09:13:45] it should at least get a denied permission [09:13:49] yeah everything is fine now [09:13:52] ok [09:13:59] but it was confusing debugging this [09:14:14] 6 different exim hosts, using m2-master, m2-slave [09:14:36] a bit of a mess [09:14:45] it consumed way way more time than I expected [09:15:34] 10Operations, 10ops-codfw: mc2036 mainboard fuse failure - https://phabricator.wikimedia.org/T185587#4052570 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff>03Papaul I tried to reimage, but the BIOS prints "PXE-E61: Media test failure, check cable" for 20 times or so until it falls back to attempting to boot f... [09:15:59] yeah, the m2-slave shouln't be there anymore since the proxies were setup- and that predates my @WMF [09:16:02] *me [09:16:33] !log reset slave all @db1051 [09:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:20] test email just went through [09:17:32] it didn't before? [09:17:36] nope [09:17:41] mmm [09:17:48] that is a long time without emails [09:17:51] and there is a probably a huge queue of these emails that is going to go through [09:18:02] about 2 days of emails :( [09:18:42] maybe some monitoring :-) ? [09:19:09] <_joe_> akosiaris: for otrs? [09:19:10] (03CR) 10Filippo Giunchedi: "I spot-checked some hosts using puppetmaster.test.codfw.wmnet and no errors and diffs reported" [puppet] - 10https://gerrit.wikimedia.org/r/419689 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [09:19:13] yes. We used to have some exim queue thing when on ganglia. We don't currently have something IIRC [09:19:22] _joe_: yes [09:19:41] <_joe_> ouch [09:19:43] akosiaris: I don't fully understand the impact- root email flowed apparently ok? [09:20:02] (03PS2) 10Vgutierrez: adding ironholds/oliver keyes to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/416996 (https://phabricator.wikimedia.org/T188945) (owner: 10RobH) [09:20:24] so, incoming mail servers (mx1001, mx2001) do a lookup in OTRS mysql for recipients [09:20:49] the same lookup happens on the otrs server's exim as well [09:21:00] the reason is to do a quick lookup in order to avoid incoming spam [09:21:02] (03CR) 10Vgutierrez: [C: 04-1] "waiting approval on next Monday meeting" [puppet] - 10https://gerrit.wikimedia.org/r/416996 (https://phabricator.wikimedia.org/T188945) (owner: 10RobH) [09:21:09] the latter was working but the former no [09:21:39] so the main incoming mail servers would return a 4XX code to senders temporarily rejecting emails [09:22:15] isn't rejection a bit hard, at least at exim level? [09:22:28] *temporarily* [09:22:32] sure [09:22:38] (03PS1) 10Muehlenhoff: Add a component for Varnish 5.1 [puppet] - 10https://gerrit.wikimedia.org/r/419690 (https://phabricator.wikimedia.org/T188545) [09:22:54] there is really no other action [09:23:05] you either accept, reject temporarily or permanently [09:23:16] we don't want either the accept or the permanent reject [09:23:47] 10Operations, 10Ops-Access-Requests, 10Research, 10Research-collaborations, and 2 others: Request access to data for Wikimedia Donation Patterns research - https://phabricator.wikimedia.org/T188945#4052577 (10Vgutierrez) everything ready, your user will be added to analytics-privatedata-users after it's ap... [09:23:59] (03PS2) 10Alexandros Kosiaris: kubernetes: Allow setting service-node-port-range [puppet] - 10https://gerrit.wikimedia.org/r/419682 [09:24:01] (03PS1) 10Alexandros Kosiaris: exim4: Remove mentions of m2-slave [puppet] - 10https://gerrit.wikimedia.org/r/419693 [09:24:14] 10Operations, 10Ops-Access-Requests, 10Ops-Access-Reviews, 10Research, and 3 others: Request access to data for Wikimedia Donation Patterns research - https://phabricator.wikimedia.org/T188945#4052582 (10Vgutierrez) 05Open>03stalled [09:24:49] grr that assign button in gerrit got the best of me once more [09:25:06] (03CR) 10Alexandros Kosiaris: [C: 032] kubernetes: Allow setting service-node-port-range [puppet] - 10https://gerrit.wikimedia.org/r/419682 (owner: 10Alexandros Kosiaris) [09:25:51] vgutierrez: I 've merged your patch 8316c3a as well [09:27:10] akosiaris: aparently there was a huge debt in or near m2- it was pooled and with 1200 days of uptime [09:27:29] (03PS2) 10Filippo Giunchedi: hieradata: pool puppetmaster2002 [puppet] - 10https://gerrit.wikimedia.org/r/419689 (https://phabricator.wikimedia.org/T184562) [09:27:56] wow [09:27:57] which means very lacking on maintenance [09:28:18] akosiaris: not a problem [09:28:21] (03PS1) 10Elukey: profile::hadoop: refactor depencencies to reduce confusion [puppet] - 10https://gerrit.wikimedia.org/r/419694 (https://phabricator.wikimedia.org/T188294) [09:29:20] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: pool puppetmaster2002 [puppet] - 10https://gerrit.wikimedia.org/r/419689 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [09:30:30] !log repool puppetmaster2002 [09:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:04] (03PS2) 10Elukey: profile::hadoop: refactor depencencies to reduce confusion [puppet] - 10https://gerrit.wikimedia.org/r/419694 (https://phabricator.wikimedia.org/T188294) [09:31:06] there is a spike in bad performance since yesterday at 22h [09:31:38] https://grafana.wikimedia.org/dashboard/db/navigation-timing-alerts [09:34:10] 10Operations, 10Packaging: Build .deb package of python3-typing for jessie - https://phabricator.wikimedia.org/T189729#4052597 (10Aklapper) -> #packaging [09:36:37] (03Abandoned) 10Muehlenhoff: Add a component for Varnish 5.1 [puppet] - 10https://gerrit.wikimedia.org/r/419690 (https://phabricator.wikimedia.org/T188545) (owner: 10Muehlenhoff) [09:38:21] 10Operations, 10Packaging: Build .deb package of python3-aiokafka - https://phabricator.wikimedia.org/T189741#4052600 (10Aklapper) [09:41:26] (03PS1) 10Volans: Fix tests indentation and formatting [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/419696 [09:41:28] (03PS1) 10Volans: Force docker dependency <3.0.0 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/419697 [09:41:55] (03CR) 10jerkins-bot: [V: 04-1] Fix tests indentation and formatting [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/419696 (owner: 10Volans) [09:42:32] right, I should send them inverted... [09:43:17] (03PS2) 10Volans: Force docker dependency <3.0.0 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/419697 [09:43:19] (03PS2) 10Volans: Fix tests indentation and formatting [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/419696 [09:44:01] (03CR) 10Hashar: [C: 032] Force docker dependency <3.0.0 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/419697 (owner: 10Volans) [09:44:29] (03CR) 10Hashar: [C: 032] Fix tests indentation and formatting [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/419696 (owner: 10Volans) [09:44:31] (03Merged) 10jenkins-bot: Force docker dependency <3.0.0 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/419697 (owner: 10Volans) [09:44:58] (03Merged) 10jenkins-bot: Fix tests indentation and formatting [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/419696 (owner: 10Volans) [09:45:52] oh, that was quick :D [09:52:36] (03CR) 10Jcrespo: [C: 031] "This is how I would do it- let the proxy failover or fail if necesary." [puppet] - 10https://gerrit.wikimedia.org/r/419693 (owner: 10Alexandros Kosiaris) [09:55:16] (03CR) 10Elukey: "So following this approach we'd end up changing Druid's deb policies:" [puppet] - 10https://gerrit.wikimedia.org/r/419694 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [09:55:35] (03CR) 10Jcrespo: [C: 031] "Not related to this patch, but you may want to check the mysql credentials on db1051 for otrs@ account: root@db1051:~$ pt-show-grants | gr" [puppet] - 10https://gerrit.wikimedia.org/r/419693 (owner: 10Alexandros Kosiaris) [09:56:35] !log installing curl security updates on Debian [09:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:53] !log apt.w.o: move varnish=5.1.3-1wm3, varnish-modules=0.12.1-1+wmf1, libvmod-netmapper=1.6-1 from jessie-wikimedia/experimental to jessie-wikimedia/main T188545 [09:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:58] T188545: Post Varnish 5 migration cleanup - https://phabricator.wikimedia.org/T188545 [10:02:48] (03CR) 10Ema: [C: 032] 5.1.3-1wm4: extrachance retry fixes from upstream [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/419460 (owner: 10Ema) [10:03:19] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Depool rdb1001 for kernel update (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419664 (owner: 10Muehlenhoff) [10:05:12] (03PS3) 10Muehlenhoff: Depool rdb1001 for kernel update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419664 [10:08:01] (03PS1) 10Gehel: elastic: decommission elastic1021 [puppet] - 10https://gerrit.wikimedia.org/r/419702 (https://phabricator.wikimedia.org/T189727) [10:09:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, and 2 others: decommission elastic1021 - https://phabricator.wikimedia.org/T189727#4052649 (10Gehel) [10:10:28] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4052652 (10elukey) >>! In T188294#4052448, @elukey wrote: > During the first puppet run I have seen two issues: > > 1) the /etc/hadoo... [10:10:43] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/mathoid on labpuppetmaster1001 is OK: No errors detected [10:11:24] RECOVERY - Confd template for /srv/config-master/pybal/codfw/mathoid on labpuppetmaster1001 is OK: No errors detected [10:11:58] 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#4052654 (10fgiunchedi) puppetmaster2002 was repooled today and is working as intended. puppetdb on nihal had a spike in commands processed while... [10:12:40] !log gehel@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=elastic1021.eqiad.wmnet [10:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:23] (03PS2) 10Alexandros Kosiaris: exim4: Remove mentions of m2-slave [puppet] - 10https://gerrit.wikimedia.org/r/419693 [10:13:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, and 2 others: decommission elastic1021 - https://phabricator.wikimedia.org/T189727#4052655 (10Gehel) Preliminary decommissioning steps are done (pending the merge of https://gerrit.wikimedia.org/r/#/c/419702/). A few notes: Since elastic1021 is dow... [10:17:17] (03PS3) 10Alexandros Kosiaris: exim4: Remove mentions of m2-slave [puppet] - 10https://gerrit.wikimedia.org/r/419693 [10:18:53] (03PS4) 10Muehlenhoff: Depool rdb1001 for kernel update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419664 [10:19:01] (03PS1) 10Filippo Giunchedi: hieradata: depool puppetmaster1002 for stretch reimage [puppet] - 10https://gerrit.wikimedia.org/r/419704 (https://phabricator.wikimedia.org/T184562) [10:22:16] !log apt.w.o: upload varnish=5.1.3-1wm4 to jessie-wikimedia/main (upstream "extrachance" fixes) T174932 [10:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:21] T174932: Recurrent 'mailbox lag' critical alerts and 500s - https://phabricator.wikimedia.org/T174932 [10:22:50] (03CR) 10Muehlenhoff: [C: 032] Depool rdb1001 for kernel update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419664 (owner: 10Muehlenhoff) [10:23:09] (03CR) 10jenkins-bot: Depool rdb1001 for kernel update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419664 (owner: 10Muehlenhoff) [10:24:40] (03CR) 10Faidon Liambotis: [C: 031] "As far as the exim4 syntax goes, that looks fine. The m2-master/slave part, I'll leave to Jaime :)" [puppet] - 10https://gerrit.wikimedia.org/r/419693 (owner: 10Alexandros Kosiaris) [10:24:47] !log jmm@tin Synchronized wmf-config/ProductionServices.php: Depooling rdb1001 for kernel security update (duration: 01m 14s) [10:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:05] (03PS4) 10Jcrespo: exim4: Remove mentions of m2-slave [puppet] - 10https://gerrit.wikimedia.org/r/419693 (owner: 10Alexandros Kosiaris) [10:26:32] (03PS2) 10Gehel: elastic: decommission elastic1021 [puppet] - 10https://gerrit.wikimedia.org/r/419702 (https://phabricator.wikimedia.org/T189727) [10:28:24] (03CR) 10Alexandros Kosiaris: [C: 031] exim4: Remove mentions of m2-slave [puppet] - 10https://gerrit.wikimedia.org/r/419693 (owner: 10Alexandros Kosiaris) [10:28:26] (03CR) 10Jcrespo: [C: 032] exim4: Remove mentions of m2-slave [puppet] - 10https://gerrit.wikimedia.org/r/419693 (owner: 10Alexandros Kosiaris) [10:28:54] (03PS4) 10Ahmed123: Enable rollbacker user right at arwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419535 (https://phabricator.wikimedia.org/T189732) [10:32:16] !log rebooting rdb1001 for kernel security update [10:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:38] (03CR) 10DCausse: "nitpick: in hieradata/regex.yaml there's a specific regex for elastic1021 we could remove:" [puppet] - 10https://gerrit.wikimedia.org/r/419702 (https://phabricator.wikimedia.org/T189727) (owner: 10Gehel) [10:39:59] (03PS3) 10Gehel: elastic: decommission elastic1021 [puppet] - 10https://gerrit.wikimedia.org/r/419702 (https://phabricator.wikimedia.org/T189727) [10:40:28] (03CR) 10Gehel: "I knew I would miss something... Corrected" [puppet] - 10https://gerrit.wikimedia.org/r/419702 (https://phabricator.wikimedia.org/T189727) (owner: 10Gehel) [10:40:30] (03PS5) 10Ahmed123: Enable rollbacker user right at arwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419535 (https://phabricator.wikimedia.org/T189732) [10:41:24] (03CR) 10DCausse: [C: 031] elastic: decommission elastic1021 [puppet] - 10https://gerrit.wikimedia.org/r/419702 (https://phabricator.wikimedia.org/T189727) (owner: 10Gehel) [10:41:52] (03PS1) 10Ema: varnish: remove gethdr_extrachance [puppet] - 10https://gerrit.wikimedia.org/r/419705 (https://phabricator.wikimedia.org/T181315) [10:42:31] (03CR) 10Vgutierrez: [C: 031] varnish: remove gethdr_extrachance [puppet] - 10https://gerrit.wikimedia.org/r/419705 (https://phabricator.wikimedia.org/T181315) (owner: 10Ema) [10:43:48] (03PS1) 10Muehlenhoff: Revert "Depool rdb1001 for kernel update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419706 [10:44:24] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4052691 (10elukey) Also I found this bit interesting: https://puppet.com/docs/puppet/4.8/function.html#include ``` [10:46:44] (03CR) 10Ema: [C: 04-1] "Note that this should only be merged while upgrading the fleet, with puppet disabled. Otherwise this would result in dropping the paramete" [puppet] - 10https://gerrit.wikimedia.org/r/419705 (https://phabricator.wikimedia.org/T181315) (owner: 10Ema) [10:46:50] (03CR) 10Muehlenhoff: [C: 032] Revert "Depool rdb1001 for kernel update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419706 (owner: 10Muehlenhoff) [10:47:32] (03PS1) 10Gehel: wdqs: add pigz package [puppet] - 10https://gerrit.wikimedia.org/r/419707 (https://phabricator.wikimedia.org/T187766) [10:48:38] (03CR) 10Ema: [C: 04-1] "> Note that this should only be merged while upgrading the fleet," [puppet] - 10https://gerrit.wikimedia.org/r/419705 (https://phabricator.wikimedia.org/T181315) (owner: 10Ema) [10:48:40] !log jmm@tin Synchronized wmf-config/ProductionServices.php: Repooling rdb1001 after kernel security update (duration: 01m 14s) [10:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:46] (03CR) 10jenkins-bot: Revert "Depool rdb1001 for kernel update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419706 (owner: 10Muehlenhoff) [10:51:19] (03CR) 10Muehlenhoff: "We're already using pigz in the mariadb and bacula classes and since it's tiny and generally useful we could just as well add it to standa" [puppet] - 10https://gerrit.wikimedia.org/r/419707 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [10:52:36] (03PS1) 10Muehlenhoff: Depool rdb1003 for kernel update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419708 [10:53:30] (03CR) 10Gehel: "@jynus told me tried adding it to standard packages, but it was failing for some reason. Grepping through our puppet code, I could not und" [puppet] - 10https://gerrit.wikimedia.org/r/419707 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [10:54:45] (03CR) 10Jcrespo: "Related If1d8337a16cb48e1cfba10d" [puppet] - 10https://gerrit.wikimedia.org/r/419707 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [10:56:56] (03CR) 10Arturo Borrero Gonzalez: [C: 032] Toolforge:puppet-apt-pinning upgrading kubernetes version for PAWS [puppet] - 10https://gerrit.wikimedia.org/r/419599 (https://phabricator.wikimedia.org/T189680) (owner: 10Chico Venancio) [10:56:59] (03PS1) 10Jcrespo: Revert "Revert "Install parallel gzip (pigz) and parallel xz (pxz) on all servers"" [puppet] - 10https://gerrit.wikimedia.org/r/419709 [10:57:27] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Ping me if you need me to merge this." [puppet] - 10https://gerrit.wikimedia.org/r/419599 (https://phabricator.wikimedia.org/T189680) (owner: 10Chico Venancio) [10:58:19] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "Install parallel gzip (pigz) and parallel xz (pxz) on all servers"" [puppet] - 10https://gerrit.wikimedia.org/r/419709 (owner: 10Jcrespo) [10:58:34] (03CR) 10Jcrespo: "I've create I44d410fdced4f6122e but now needs amends to remove it from mariadb and wdqs" [puppet] - 10https://gerrit.wikimedia.org/r/419707 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [10:59:47] (03CR) 10Muehlenhoff: [C: 032] Depool rdb1003 for kernel update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419708 (owner: 10Muehlenhoff) [11:00:14] (03CR) 10jenkins-bot: Depool rdb1003 for kernel update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419708 (owner: 10Muehlenhoff) [11:01:30] !log jmm@tin Synchronized wmf-config/ProductionServices.php: Depooling rdb1003 for kernel security update (duration: 01m 14s) [11:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:25] (03PS2) 10Esanders: Enable wgCiteResponsiveReferences on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419528 (https://phabricator.wikimedia.org/T189658) [11:04:33] !log rebooting rdb1003 for kernel security update [11:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:40] (03PS1) 10Gehel: base: add pigz to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/419710 [11:11:25] (03PS1) 10Muehlenhoff: Revert "Depool rdb1003 for kernel update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419711 [11:13:55] (03CR) 10Muehlenhoff: [C: 032] Revert "Depool rdb1003 for kernel update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419711 (owner: 10Muehlenhoff) [11:14:35] (03CR) 10Muehlenhoff: [V: 032 C: 032] Revert "Depool rdb1003 for kernel update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419711 (owner: 10Muehlenhoff) [11:15:29] (03PS2) 10Gehel: Revert "Revert "Install parallel gzip (pigz) and parallel xz (pxz) on all servers"" [puppet] - 10https://gerrit.wikimedia.org/r/419709 (owner: 10Jcrespo) [11:16:12] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "Install parallel gzip (pigz) and parallel xz (pxz) on all servers"" [puppet] - 10https://gerrit.wikimedia.org/r/419709 (owner: 10Jcrespo) [11:16:31] (03PS3) 10Gehel: Revert "Revert "Install parallel gzip (pigz) and parallel xz (pxz) on all servers"" [puppet] - 10https://gerrit.wikimedia.org/r/419709 (owner: 10Jcrespo) [11:16:36] !log jmm@tin Synchronized wmf-config/ProductionServices.php: Repooling rdb1003 after kernel security update (duration: 01m 14s) [11:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:26] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "Install parallel gzip (pigz) and parallel xz (pxz) on all servers"" [puppet] - 10https://gerrit.wikimedia.org/r/419709 (owner: 10Jcrespo) [11:17:38] (03PS4) 10Gehel: Install parallel gzip (pigz) and parallel xz (pxz) on all servers [puppet] - 10https://gerrit.wikimedia.org/r/419709 (owner: 10Jcrespo) [11:18:46] (03CR) 10jenkins-bot: Revert "Depool rdb1003 for kernel update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419711 (owner: 10Muehlenhoff) [11:19:46] (03PS1) 10Muehlenhoff: Depool rdb1005 for kernel security update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419712 [11:25:53] (03CR) 10Muehlenhoff: [C: 032] Depool rdb1005 for kernel security update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419712 (owner: 10Muehlenhoff) [11:29:00] !log jmm@tin Synchronized wmf-config/ProductionServices.php: Depooling rdb1005 for kernel security update (duration: 01m 10s) [11:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:18] (03CR) 10jenkins-bot: Depool rdb1005 for kernel security update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419712 (owner: 10Muehlenhoff) [11:32:23] (03PS1) 10Marostegui: db-eqiad.php: Depool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419713 [11:33:26] (03PS1) 10Marostegui: es1013.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/419714 [11:38:14] PROBLEM - Check health of redis instance on 6381 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6381 [11:38:49] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: depool puppetmaster1002 for stretch reimage [puppet] - 10https://gerrit.wikimedia.org/r/419704 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [11:38:53] PROBLEM - Check health of redis instance on 6379 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [11:38:57] (03PS2) 10Filippo Giunchedi: hieradata: depool puppetmaster1002 for stretch reimage [puppet] - 10https://gerrit.wikimedia.org/r/419704 (https://phabricator.wikimedia.org/T184562) [11:39:14] RECOVERY - Check health of redis instance on 6381 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 125308 keys, up 9 hours 25 minutes - replication_delay is 0 [11:39:54] RECOVERY - Check health of redis instance on 6379 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 126532 keys, up 9 hours 31 minutes - replication_delay is 0 [11:41:19] (03PS1) 10Muehlenhoff: Revert "Depool rdb1005 for kernel security update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419716 [11:42:11] !log depool puppetmaster1002 for stretch reimage [11:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:53] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419713 (owner: 10Marostegui) [11:44:08] (03Merged) 10jenkins-bot: db-eqiad.php: Depool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419713 (owner: 10Marostegui) [11:44:39] (03PS1) 10Volans: python-build: remove requirement for source code [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/419717 [11:44:42] (03CR) 10Muehlenhoff: [C: 032] Revert "Depool rdb1005 for kernel security update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419716 (owner: 10Muehlenhoff) [11:47:19] (03PS1) 10Arturo Borrero Gonzalez: openstack: reimage labtestmetal2001 [puppet] - 10https://gerrit.wikimedia.org/r/419718 (https://phabricator.wikimedia.org/T189682) [11:47:47] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool es1013 for socket path location update (duration: 01m 14s) [11:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:08] !log reimage puppetmaster1002 with stretch [11:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:01] (03CR) 10jenkins-bot: db-eqiad.php: Depool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419713 (owner: 10Marostegui) [11:49:49] !log jmm@tin Synchronized wmf-config/ProductionServices.php: Repooling rdb1005 after kernel security update (duration: 01m 14s) [11:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:13] !log rebooted rdb1005 for kernel security update [11:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:27] (03PS1) 10Muehlenhoff: Depool rdb1007 for kernel security update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419719 [11:52:04] !log Stop MySQL on es1013 for socket path upgrade [11:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:33] (03PS2) 10Marostegui: es1013.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/419714 [11:54:05] (03CR) 10Marostegui: [C: 032] es1013.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/419714 (owner: 10Marostegui) [11:54:18] (03CR) 10Muehlenhoff: [C: 032] Depool rdb1007 for kernel security update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419719 (owner: 10Muehlenhoff) [11:55:24] (03PS3) 10Alexandros Kosiaris: Remove the now defunct ops-staff-group [puppet] - 10https://gerrit.wikimedia.org/r/409872 [11:55:41] (03CR) 10Alexandros Kosiaris: "this has been sitting around long enough. merging" [puppet] - 10https://gerrit.wikimedia.org/r/409872 (owner: 10Alexandros Kosiaris) [11:55:49] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Remove the now defunct ops-staff-group [puppet] - 10https://gerrit.wikimedia.org/r/409872 (owner: 10Alexandros Kosiaris) [11:56:05] (03CR) 10Giuseppe Lavagetto: [C: 031] python-build: remove requirement for source code [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/419717 (owner: 10Volans) [11:56:07] !log jmm@tin Synchronized wmf-config/ProductionServices.php: Depooling rdb1007 for kernel security update (duration: 01m 14s) [11:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:51] !log rebooting rdb1007 for kernel security update [11:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:28] (03PS1) 10Volans: Initial import [software/puppetboard/deploy] - 10https://gerrit.wikimedia.org/r/419720 [12:00:30] (03PS1) 10Volans: Built artifacts for jessie and stretch [software/puppetboard/deploy] - 10https://gerrit.wikimedia.org/r/419721 [12:00:36]  [12:02:34] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419724 [12:02:36] (03PS1) 10Rduran: Encapsulate the osc behavior into a class [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419725 [12:03:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419724 (owner: 10Marostegui) [12:05:07] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419724 (owner: 10Marostegui) [12:05:30] (03PS1) 10Muehlenhoff: Revert "Depool rdb1007 for kernel security update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419726 [12:06:51] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool es1013 after socket path location update (duration: 01m 14s) [12:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:23] (03CR) 10Muehlenhoff: [C: 032] Revert "Depool rdb1007 for kernel security update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419726 (owner: 10Muehlenhoff) [12:08:36] (03CR) 10Volans: [V: 032 C: 032] python-build: remove requirement for source code [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/419717 (owner: 10Volans) [12:08:51] !log jmm@tin Synchronized wmf-config/ProductionServices.php: Repooling rdb1007 after kernel security update (duration: 01m 14s) [12:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:43] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: reimage labtestmetal2001 [puppet] - 10https://gerrit.wikimedia.org/r/419718 (https://phabricator.wikimedia.org/T189682) (owner: 10Arturo Borrero Gonzalez) [12:09:52] (03PS2) 10Arturo Borrero Gonzalez: openstack: reimage labtestmetal2001 [puppet] - 10https://gerrit.wikimedia.org/r/419718 (https://phabricator.wikimedia.org/T189682) [12:26:16] !log T189682 reimage labtestmetal2001 with jessie and a new partition layout [12:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:22] T189682: reimage labtestmetal2001 with virt compatible partman as Jessie - https://phabricator.wikimedia.org/T189682 [12:30:48] (03PS1) 10Jcrespo: prometheus-mysql-exporter: Reflect latest m2 changes, remove dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/419727 (https://phabricator.wikimedia.org/T186596) [12:32:06] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4052922 (10Marostegui) [12:33:50] 10Operations, 10DBA, 10Patch-For-Review: Switchover m2 master from db1020 to db1051 - https://phabricator.wikimedia.org/T189656#4052935 (10jcrespo) [12:33:55] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4052936 (10jcrespo) [12:33:58] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4052937 (10jcrespo) [12:36:02] 10Operations, 10DBA, 10Patch-For-Review: Switchover m2 master from db1020 to db1051 - https://phabricator.wikimedia.org/T189656#4052954 (10jcrespo) [12:36:07] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4052955 (10jcrespo) [12:36:33] !log installing curl security updates on jessie/stretch [12:36:34] 10Operations, 10DBA, 10Patch-For-Review: Switchover m2 master from db1020 to db1051 - https://phabricator.wikimedia.org/T189656#4049095 (10jcrespo) [12:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:44] (03PS2) 10Volans: Initial import [software/puppetboard/deploy] - 10https://gerrit.wikimedia.org/r/419720 [12:36:46] (03PS2) 10Volans: Built artifacts for jessie and stretch [software/puppetboard/deploy] - 10https://gerrit.wikimedia.org/r/419721 [12:38:19] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4052963 (10jcrespo) [12:39:56] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4052967 (10jcrespo) a:05Marostegui>03jcrespo [12:40:43] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4052968 (10Marostegui) [12:42:24] (03PS1) 10Ayounsi: LibreNMS: IRC alerts on -operations [puppet] - 10https://gerrit.wikimedia.org/r/419731 [12:42:28] (03CR) 10Jcrespo: [C: 032] prometheus-mysql-exporter: Reflect latest m2 changes, remove dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/419727 (https://phabricator.wikimedia.org/T186596) (owner: 10Jcrespo) [12:51:05] (03PS2) 10Rduran: Encapsulate the osc behavior into a class [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419725 [12:54:53] !log Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds [12:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:58] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [12:55:01] sjoerddebruin2: ^ [12:57:48] (03PS1) 10Rush: openstack: pass in relevant params for template fulfillment [puppet] - 10https://gerrit.wikimedia.org/r/419733 (https://phabricator.wikimedia.org/T188266) [12:58:29] 10Operations, 10Community-Liaisons, 10Security-Reviews, 10Surveys: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#4053007 (10Elitre) >>! In T109606#4045762, @Bawolff wrote: > So to clarify - There is still interest in using lime survey, right (The third party site, not the software pack... [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 8 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180315T1300). [13:00:05] rxy and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] (03CR) 10Rush: "labtestcontrol2003.wikimedia.org,labtestneutron2001.codfw.wmnet,labtestvirt2003.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/419733 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [13:00:20] I can SWAT today [13:00:55] Urbanecm_: your patch has merge conflict [13:01:03] I'm here [13:01:08] Urbanecm_: (on rebase) [13:01:13] please rebase your patch [13:01:18] zeljkof, will do in a minute [13:01:38] rxy: around for swat? [13:01:50] (03PS2) 10Zfilipin: Change autoconfirmed settings and Enable flood group at zhwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418008 (https://phabricator.wikimedia.org/T189289) (owner: 10Rxy) [13:02:09] PROBLEM - dhclient process on labtestmetal2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:02:33] zeljkof, if rxy isn't around, I can check his patch too [13:03:03] Urbanecm: great, apparently he is not around, please do [13:03:07] (03CR) 10Rush: [C: 032] "http://puppet-compiler.wmflabs.org/10458/" [puppet] - 10https://gerrit.wikimedia.org/r/419733 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [13:03:13] (03PS2) 10Rush: openstack: pass in relevant params for template fulfillment [puppet] - 10https://gerrit.wikimedia.org/r/419733 (https://phabricator.wikimedia.org/T188266) [13:03:23] zeljkof, ok, taking it over. Please push it to mwdebug1002 and ping me :) [13:03:58] PROBLEM - puppet last run on labtestmetal2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:04:08] ^ arturo as it came back up it started alerting [13:04:43] I see [13:04:47] (03PS2) 10Urbanecm: New throttle rule, clean expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418875 (https://phabricator.wikimedia.org/T189442) [13:04:55] zeljkof, my throttle patch was rebased [13:05:09] RECOVERY - dhclient process on labtestmetal2001 is OK: PROCS OK: 0 processes with command name dhclient [13:05:14] chasemp: but it was downtimed, right? `12:14:33 | labtestmetal2001.codfw.wmnet | Downtimed on Icinga` [13:05:46] arturo: when you reinstall and it drops out of icinga during resource collection and gets recreated w/ new install it does not carry over downtime iiuc [13:05:55] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418875 (https://phabricator.wikimedia.org/T189442) (owner: 10Urbanecm) [13:06:20] I usually check icinga as a host comes up reinstalled to re-silence, maybe there is some better way I don't know [13:06:20] Urbanecm: I'll deploy throttle rule first, I'll ping you when the second commit is at mwdebug [13:06:47] zeljkof, order is entirely up to you ;). Ack, will wait. [13:07:19] (03Merged) 10jenkins-bot: New throttle rule, clean expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418875 (https://phabricator.wikimedia.org/T189442) (owner: 10Urbanecm) [13:08:09] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418008 (https://phabricator.wikimedia.org/T189289) (owner: 10Rxy) [13:09:24] (03Merged) 10jenkins-bot: Change autoconfirmed settings and Enable flood group at zhwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418008 (https://phabricator.wikimedia.org/T189289) (owner: 10Rxy) [13:09:50] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:418875|New throttle rule, clean expired rules (T189442)]] (duration: 01m 15s) [13:09:54] !log restarting HHVM on canaries to pick up curl security update [13:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:56] T189442: Request to lift the cap on IP address to create accounts on mrwiki - https://phabricator.wikimedia.org/T189442 [13:09:59] Urbanecm: throttle rule deployed [13:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:05] ack [13:10:42] Urbanecm: the other patch is at mwdebug [13:11:04] zeljkof, testing [13:12:46] zeljkof, working, please deploy [13:12:56] Urbanecm: deploying [13:13:00] ack [13:14:05] zeljkof: Hello [13:14:14] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:418008|Change autoconfirmed settings and Enable flood group at zhwikiquote (T189289)]] (duration: 01m 15s) [13:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:20] T189289: Request to change autoconfirmed settings, allow autoconfirmed user to suppress redirects and allow sysop to grant and remove flood flags on zh.wikiquote - https://phabricator.wikimedia.org/T189289 [13:14:23] Jayprakash12345: hi [13:14:32] Urbanecm: deployed, please test [13:14:54] Jayprakash12345: I see your patch, ready for deployment :) [13:15:02] Seems to be fine, thanks! [13:15:12] Jayprakash12345: I'll ping you in a few minutes when it's at mwdebug [13:15:12] zeljkof: In my pach, 417762, There is need to run script before merging the patch [13:15:22] Jayprakash12345: _before_ merging? [13:15:30] it's usually _after_ [13:15:43] zeljkof: In my pach, 417762, mwscript extensions/WikimediaMaintenance/createExtensionTables.php shorturl --wiki=knwikisource [13:15:44] Urbanecm: thanks for deploying with #releng! ;) [13:16:12] Well, i cannot deploy my patches with anyone else :D [13:16:22] Urbanecm: :D [13:16:42] zeljkof, Jayprakash12345's patch needs the script _after_ :) [13:16:45] Syntax is in the patch itself [13:16:52] Have a nice day both zeljkof and Jayprakash12345! [13:17:03] Urbanecm: thanks for the tip [13:17:36] yw [13:17:44] zeljkof: See marco's Comment on https://phabricator.wikimedia.org/T189287 [13:18:14] (03PS2) 10Zfilipin: Enable ShortUrl Extension at knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417762 (https://phabricator.wikimedia.org/T189287) (owner: 10Jayprakash12345) [13:18:17] 10Operations, 10DBA, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532#4053048 (10Paladox) [13:18:42] (03PS3) 10Rduran: Encapsulate the osc behavior into a class [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419725 [13:19:24] Jayprakash12345: looking.... o.O [13:21:14] (03PS4) 10Rduran: [WIP] Add port of osc_host.sh [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419725 [13:22:02] ^LOL@ TODO: write TODOs [13:23:34] (03PS5) 10Rduran: [WIP] Add port of osc_host.sh [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419725 [13:25:03] Jayprakash12345: where did you find createExtensionTables.php? [13:25:12] Extension:ShortUrl [13:25:36] sorry, Extension:ShortUrl says `(Optional) Run populateShortUrlTable.php` [13:26:35] https://gerrit.wikimedia.org/r/#/c/416338/ [13:26:38] like [13:26:50] Jayprakash12345: instructions say to run update.php? [13:27:02] 10Operations, 10ops-codfw: mc2036 mainboard fuse failure - https://phabricator.wikimedia.org/T185587#4053082 (10Papaul) @MoritzMuehlenhoff the cable is connected, please check switch side if the port is active. I have no light on that port. [13:27:21] I am not fully sure, But this extension needs bd table [13:27:38] hashar: Are you here [13:28:21] Jayprakash12345: that patch is for a different extension :| [13:28:41] Jayprakash12345: I am reluctant to deploy it if neither of us knows which script to run :| [13:29:35] see https://gerrit.wikimedia.org/r/#/c/386779/ [13:29:43] Hashar [13:29:43] Nov 7 7:41 PM [13:29:43] ↩ [13:29:43] Patch Set 6: Code-Review+2 [13:29:43] Table created [13:29:44] $ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=pawiki ShortUrl [13:29:45] Creating ShortUrl tables...done! [13:30:23] (03Abandoned) 10Rduran: [WIP] Add port of osc_host.sh [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419366 (owner: 10Rduran) [13:30:37] zeljkof: ? [13:31:29] Jayprakash12345: looking [13:33:06] !log T189682 reimage labtestmetal2001 with jessie and a new partition layout, again. Last time didn't pick the right partman config [13:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:12] T189682: reimage labtestmetal2001 with virt compatible partman as Jessie - https://phabricator.wikimedia.org/T189682 [13:33:39] (03CR) 10Zfilipin: "According to 386779:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417762 (https://phabricator.wikimedia.org/T189287) (owner: 10Jayprakash12345) [13:34:39] Jayprakash12345: ok, looks like hashar did it and nothing broke, will deploy [13:35:17] <_joe_> I would suggest you should not deploy unless you're sure of what you are doing :) [13:35:52] _joe_: _sure_?! I'm never _sure_ what I am doing ;) [13:36:08] zeljkof: Whats is the result of mwscript extensions/WikimediaMaintenance/createExtensionTables.php shorturl --wiki=knwikisource ? [13:36:11] <_joe_> zeljkof: well apart from the healthy skepitcism I mean :P [13:36:17] I mean, I'm deploying code other people wrote, there is always things to go wrong there [13:36:34] Jayprakash12345: did not run it yet, just a minute [13:36:55] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417762 (https://phabricator.wikimedia.org/T189287) (owner: 10Jayprakash12345) [13:37:27] _joe_: it's a very similar patch, and hashar (whom I trust) ran the script, so it's likely the thing to do [13:38:00] (03PS1) 10Marostegui: db-eqiad.php: Fully repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419735 [13:38:09] (03Merged) 10jenkins-bot: Enable ShortUrl Extension at knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417762 (https://phabricator.wikimedia.org/T189287) (owner: 10Jayprakash12345) [13:40:22] Jayprakash12345: the patch is at mwdebug1002, I did not run the script yet, will in a minute, you can start testing [13:41:29] PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:41:29] PROBLEM - puppet last run on mc1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:42:10] PROBLEM - puppet last run on dbproxy1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:42:19] PROBLEM - puppet last run on wtp1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:43:19] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - apiserver_request_latencies is 148462969 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:43:19] PROBLEM - puppet last run on dbproxy1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:43:29] PROBLEM - puppet last run on mw1313 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:43:50] PROBLEM - puppet last run on lvs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:44:10] PROBLEM - puppet last run on db1088 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:44:10] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:44:10] PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:44:20] PROBLEM - puppet last run on cp1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:44:23] (03CR) 10Zfilipin: [C: 032] "zfilipin@terbium:~$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=knwikisource ShortUrl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417762 (https://phabricator.wikimedia.org/T189287) (owner: 10Jayprakash12345) [13:44:51] Jayprakash12345: I have run the script, looks good, please test at mwdebug1002 and let me know if I can deploy [13:44:58] ok [13:45:10] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - apiserver_request_latencies is 79127197 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:45:19] RECOVERY - Request latencies on argon is OK: OK - apiserver_request_latencies is 6924 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:46:25] zeljkof: Looks Good,https://kn.wikisource.org/s/2 Please deploy [13:47:03] Jayprakash12345: ok, deploying [13:47:52] (03PS1) 10Andrew Bogott: Rename role::mariadb::wikitech to role::mariadb::labtestwikitech [puppet] - 10https://gerrit.wikimedia.org/r/419736 [13:48:10] RECOVERY - Request latencies on chlorine is OK: OK - apiserver_request_latencies is 5298 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:48:30] I'm taking a look at the puppet failures [13:48:33] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:418008|Change autoconfirmed settings and Enable flood group at zhwikiquote (T189289)]] (duration: 01m 14s) [13:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:38] T189289: Request to change autoconfirmed settings, allow autoconfirmed user to suppress redirects and allow sysop to grant and remove flood flags on zh.wikiquote - https://phabricator.wikimedia.org/T189289 [13:48:50] Jayprakash12345: deployed, please test and thanks for deploying with #releng! ;) [13:48:52] looks like nitrogen oom'ing tho [13:49:10] RECOVERY - puppet last run on db1088 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:49:11] !log EU SWAT finished [13:49:12] zeljkof: Thanks [13:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:21] !log reboot kafka1001 (eventbus/job-queues eqiad) for kernel updates [13:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:48] (03PS1) 10Rush: openstack: trial of mixed mitaka/liberty nova compute [puppet] - 10https://gerrit.wikimedia.org/r/419737 (https://phabricator.wikimedia.org/T187954) [14:01:34] Urbanecm: thanks for testing and deploy my patch. i forgot that [14:03:54] (03PS2) 10Rush: openstack: trial of mixed mitaka/liberty nova compute [puppet] - 10https://gerrit.wikimedia.org/r/419737 (https://phabricator.wikimedia.org/T187954) [14:04:18] * rxy annoying summer time [14:04:25] (03PS1) 10Imarlier: navtiming.py: Make sure to record country specific when oversampling [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) [14:04:30] (03CR) 10jerkins-bot: [V: 04-1] openstack: trial of mixed mitaka/liberty nova compute [puppet] - 10https://gerrit.wikimedia.org/r/419737 (https://phabricator.wikimedia.org/T187954) (owner: 10Rush) [14:04:53] (03CR) 10jerkins-bot: [V: 04-1] navtiming.py: Make sure to record country specific when oversampling [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) (owner: 10Imarlier) [14:05:07] (03CR) 10Rush: "labtestvirt2003.codfw.wmnet,labtestvirt2001.codfw.wmnet,labvirt1001.eqiad.wmnet,labvirt1002.eqiad.wmnet,labvirt1021.eqiad.wmnet,labvirt102" [puppet] - 10https://gerrit.wikimedia.org/r/419737 (https://phabricator.wikimedia.org/T187954) (owner: 10Rush) [14:08:44] RECOVERY - puppet last run on dbproxy1002 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [14:09:14] RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [14:09:24] RECOVERY - puppet last run on cp1073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:09:49] (03CR) 10Andrew Bogott: "For the moment this doesn't work on Stretch because of lots of things like" [puppet] - 10https://gerrit.wikimedia.org/r/419736 (owner: 10Andrew Bogott) [14:11:24] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:11:24] RECOVERY - puppet last run on mc1020 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:12:14] RECOVERY - puppet last run on dbproxy1009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:12:15] RECOVERY - puppet last run on wtp1028 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:13:34] RECOVERY - puppet last run on mw1313 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:13:54] RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:14:15] RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:16:17] (03PS3) 10Elukey: profile::hadoop: refactor depencencies to reduce confusion [puppet] - 10https://gerrit.wikimedia.org/r/419694 (https://phabricator.wikimedia.org/T188294) [14:16:22] (03PS2) 10Imarlier: navtiming.py: Make sure to record country specific when oversampling [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) [14:16:52] (03CR) 10jerkins-bot: [V: 04-1] navtiming.py: Make sure to record country specific when oversampling [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) (owner: 10Imarlier) [14:19:08] 10Operations, 10cloud-services-team: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#4053181 (10chasemp) [14:21:25] 10Operations, 10cloud-services-team: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#4053183 (10chasemp) a:05chasemp>03RobH >>! In T183937#4051948, @RobH wrote: > Ok, escalating this to @chasemp for completion. The systems are installed and calling into puppet. Their 1G port... [14:21:48] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/10460/ doesn't touch druid anymore :)" [puppet] - 10https://gerrit.wikimedia.org/r/419694 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [14:22:10] 10Operations, 10ORES, 10Scoring-platform-team (Current): Reboot oresrdb - https://phabricator.wikimedia.org/T189781#4053189 (10Halfak) [14:23:14] PROBLEM - Host oresrdb2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:23:41] (03PS3) 10Rush: wip: openstack: trial of mixed mitaka/liberty nova compute [puppet] - 10https://gerrit.wikimedia.org/r/419737 (https://phabricator.wikimedia.org/T187954) [14:23:43] that's me ^ [14:24:00] (03PS3) 10Imarlier: navtiming.py: Make sure to record country specific when oversampling [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) [14:24:18] (03CR) 10jerkins-bot: [V: 04-1] wip: openstack: trial of mixed mitaka/liberty nova compute [puppet] - 10https://gerrit.wikimedia.org/r/419737 (https://phabricator.wikimedia.org/T187954) (owner: 10Rush) [14:25:28] (03CR) 10Ottomata: profile::hadoop: refactor depencencies to reduce confusion (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419694 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [14:27:14] (03CR) 10Elukey: profile::hadoop: refactor depencencies to reduce confusion (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419694 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [14:28:13] RECOVERY - Host oresrdb2002 is UP: PING OK - Packet loss = 0%, RTA = 36.06 ms [14:30:56] !log reboot druid1004 for kernel updates [14:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:04] (03CR) 10Ottomata: profile::hadoop: refactor depencencies to reduce confusion (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419694 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [14:43:15] (03CR) 10Elukey: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419694 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [14:46:33] PROBLEM - Host oresrdb1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:46:42] that's me ^ [14:46:48] (03PS4) 10Elukey: profile::hadoop: refactor depencencies to reduce confusion [puppet] - 10https://gerrit.wikimedia.org/r/419694 (https://phabricator.wikimedia.org/T188294) [14:47:03] RECOVERY - Host oresrdb1001 is UP: PING OK - Packet loss = 0%, RTA = 1.22 ms [14:54:30] (03PS1) 10Giuseppe Lavagetto: Convert netbox to use the docker-pkg build system [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/419744 [14:54:37] (03CR) 10Ottomata: [C: 031] "One nit, but +1 from me! Merge at will thank you!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419694 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [14:54:50] Do we have any issues with the job queue right now? [14:55:09] 13 global renames haven't even started [14:58:27] <_joe_> Hauskatze: what job is a global rename, sorry? [14:58:32] (03PS5) 10Elukey: profile::hadoop: refactor depencencies to reduce confusion [puppet] - 10https://gerrit.wikimedia.org/r/419694 (https://phabricator.wikimedia.org/T188294) [14:58:57] _joe_: either LocalRenameJob or RenameJob afaik [14:59:14] <_joe_> Hauskatze: eek yeah something went very wrong with the redis restarts this morning [14:59:29] they're stuck then? [14:59:59] according to wikitech [15:00:00] <_joe_> Hauskatze: yeah I can fix this [15:00:01] mwscript showJobs.php --wiki=<...> --type LocalRenameUserJob [15:00:01] mwscript showJobs.php --wiki=<...> --type RenameUserJob [15:00:07] those are the jobs [15:00:11] <_joe_> Hauskatze: no don't worry I know what happened [15:00:22] _joe_: appreciate your assistance :) [15:00:53] if you're using the unblock script, please avoid --ignorestatus or Special:CentralAuth dates, etc will be messed [15:00:57] (03PS2) 10Marostegui: db-eqiad.php: Fully repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419735 [15:01:10] (03CR) 10Elukey: [C: 032] profile::hadoop: refactor depencencies to reduce confusion [puppet] - 10https://gerrit.wikimedia.org/r/419694 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [15:01:28] <_joe_> Hauskatze: nope, the issue is outside of mediawiki [15:01:47] <_joe_> it's a tiny crappy piece of techdebt I can't wait to remove from existance [15:02:15] I'll leave it then on your experienced hands :) [15:02:36] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419735 (owner: 10Marostegui) [15:05:04] <_joe_> !log restarted jobrunner, jobchron on the eqiad jobrunners [15:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:06] <_joe_> Hauskatze: your job should start being processed now [15:07:36] _joe_ did they stop processing? [15:07:50] <_joe_> elukey: since the redis restarts [15:07:51] (03PS4) 10Imarlier: navtiming.py: Make sure to record country specific when oversampling [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) [15:08:05] <_joe_> because they fail so badly that the php children threads do not return [15:08:14] <_joe_> so the main process remains up [15:08:22] * elukey cries in a corner [15:08:23] <_joe_> and systemd cannot know it needs to restart them [15:08:29] _joe_: yes, I see on centralauth_p that they're being picked up [15:08:35] thanks [15:08:57] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419735 (owner: 10Marostegui) [15:10:13] <_joe_> Hauskatze: thanks for reporting this [15:10:29] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool es1013 after socket path location update (duration: 01m 15s) [15:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:35] <_joe_> elukey: yeah, shame on me, I should've remembered [15:10:48] _joe_: my pleasure [15:10:52] <_joe_> or at least check the queue after moritzm did the reboots [15:10:54] <_joe_> sigh [15:10:56] (03PS2) 10Chico Venancio: Toolforge:puppet-apt-pinning upgrading kubernetes version for PAWS [puppet] - 10https://gerrit.wikimedia.org/r/419599 (https://phabricator.wikimedia.org/T189680) [15:12:22] _joe_: I would say thank you to have fixed the issue in no time but as you prefer :D [15:13:01] _joe_: maybe Type=forking in the systemd unit might help? depends on the details though [15:13:17] !log Stop MySQL on s6 codfw master (db2039) this will break replicaiton on s6 codfw [15:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:34] hoo|away: thanks [15:17:02] :) [15:17:35] <_joe_> volans: no, it won't [15:17:57] <_joe_> volans: type=forking helps only with certain daemons because the parent process exits [15:18:16] <_joe_> here it stays running, but it basically does nothing but hold the children running :P [15:18:38] agree it doesn't help in this case then [15:18:54] <_joe_> but yes, it could be fixed [15:19:03] <_joe_> we could have better alerts on such events [15:19:13] <_joe_> so many things we could improve [15:19:17] <_joe_> if only we had time [15:19:37] <_joe_> and btw, today this happened because I was in a rush and working with three people at the same time [15:20:07] <_joe_> else I would've remembered this from last time :P [15:20:08] !log reboot druid1003 for kernel updates [15:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:20] no procedure should be dependent on human memory ;) [15:21:23] s/memory// [15:22:46] <_joe_> volans: well, this is a bug in a script that has seen maybe 1 week of work in its multi-year history of running the jobqueue for wikimedia, we're several levels below solid, documented systems [15:24:00] <_joe_> and btw, we're already in the process of replacing this system with a newer, less broken one :P [15:27:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, and 2 others: decommission elastic1021 - https://phabricator.wikimedia.org/T189727#4053338 (10RobH) a:05Gehel>03RobH Thakns @gehel, I'll steal and proceed from here! [15:27:30] <_joe_> I'll write an incident report when I have the time to [15:28:09] (03PS1) 10Elukey: Assign role::analytics_cluster::hadoop::worker to analytics1076 [puppet] - 10https://gerrit.wikimedia.org/r/419757 (https://phabricator.wikimedia.org/T188294) [15:30:57] (03CR) 10Elukey: [C: 032] Assign role::analytics_cluster::hadoop::worker to analytics1076 [puppet] - 10https://gerrit.wikimedia.org/r/419757 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [15:31:45] 10Operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#4053354 (10Aklapper) What's left to do in this task? [15:37:36] (03PS1) 10Filippo Giunchedi: hieradata: add puppetmaster1002 back, offline [puppet] - 10https://gerrit.wikimedia.org/r/419758 (https://phabricator.wikimedia.org/T184562) [15:39:06] (03PS2) 10Filippo Giunchedi: hieradata: add puppetmaster1002 back, offline [puppet] - 10https://gerrit.wikimedia.org/r/419758 (https://phabricator.wikimedia.org/T184562) [15:39:58] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: add puppetmaster1002 back, offline [puppet] - 10https://gerrit.wikimedia.org/r/419758 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [15:41:31] (03PS1) 10Andrew Bogott: Script and config to sync images from wikitech [wikitech-static] - 10https://gerrit.wikimedia.org/r/419760 [15:42:34] (03PS2) 10Bstorm: toolsdb: Remove stale accounts if present in maintain-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/419630 (https://phabricator.wikimedia.org/T188680) [15:45:18] (03CR) 10Bstorm: toolsdb: Remove stale accounts if present in maintain-dbusers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419630 (https://phabricator.wikimedia.org/T188680) (owner: 10Bstorm) [15:46:58] (03PS1) 10Rush: openstack: apply profile::openstack::labtestn::neutron::ml2 [puppet] - 10https://gerrit.wikimedia.org/r/419761 (https://phabricator.wikimedia.org/T188266) [15:47:40] !log installing libvirt security updates [15:47:41] (03CR) 10Rush: [C: 032] openstack: apply profile::openstack::labtestn::neutron::ml2 [puppet] - 10https://gerrit.wikimedia.org/r/419761 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [15:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:44] (03PS1) 10Volans: Puppetboard: add role, profile and configuration [puppet] - 10https://gerrit.wikimedia.org/r/419762 (https://phabricator.wikimedia.org/T184563) [15:48:46] (03PS1) 10Volans: Puppetboard: add varnish director entries [puppet] - 10https://gerrit.wikimedia.org/r/419763 (https://phabricator.wikimedia.org/T184563) [15:49:19] (03PS1) 10Filippo Giunchedi: hieradata: repool puppetmaster1002 [puppet] - 10https://gerrit.wikimedia.org/r/419764 (https://phabricator.wikimedia.org/T184562) [15:50:03] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: repool puppetmaster1002 [puppet] - 10https://gerrit.wikimedia.org/r/419764 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [15:50:32] (03CR) 10Chico Venancio: [C: 031] toolsdb: Remove stale accounts if present in maintain-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/419630 (https://phabricator.wikimedia.org/T188680) (owner: 10Bstorm) [15:51:19] !log repool puppetmaster1002 [15:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:48] (03PS1) 10Rush: openstack: type in template path for ml2 [puppet] - 10https://gerrit.wikimedia.org/r/419765 (https://phabricator.wikimedia.org/T188266) [15:52:07] (03PS2) 10Rush: openstack: type in template path for ml2 [puppet] - 10https://gerrit.wikimedia.org/r/419765 (https://phabricator.wikimedia.org/T188266) [15:52:32] (03CR) 10jerkins-bot: [V: 04-1] openstack: type in template path for ml2 [puppet] - 10https://gerrit.wikimedia.org/r/419765 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [15:53:25] (03PS3) 10Rush: openstack: typo in template path for ml2 [puppet] - 10https://gerrit.wikimedia.org/r/419765 (https://phabricator.wikimedia.org/T188266) [15:53:35] (03PS4) 10Rush: openstack: typo in template path for ml2 [puppet] - 10https://gerrit.wikimedia.org/r/419765 (https://phabricator.wikimedia.org/T188266) [15:54:00] (03CR) 10jerkins-bot: [V: 04-1] openstack: typo in template path for ml2 [puppet] - 10https://gerrit.wikimedia.org/r/419765 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [15:54:56] (03PS5) 10Rush: openstack: typo in template path for ml2 [puppet] - 10https://gerrit.wikimedia.org/r/419765 (https://phabricator.wikimedia.org/T188266) [15:55:45] (03PS6) 10Rush: openstack: typo in template path for ml2 [puppet] - 10https://gerrit.wikimedia.org/r/419765 (https://phabricator.wikimedia.org/T188266) [15:56:19] !log Stop MySQL on s5 codfw master (db2052) this will break replication on s5 codfw [15:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:30] Reedy: I'm looking at https://www.mediawiki.org/wiki/Manual:Imagelinks_table, specifically "Note that some images may be on a foreign file repository." [15:56:49] !log pruning obsolete packages from jessie-wikimedia/experimental [15:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:57] Does the imagelinks table know where the images are? Or is there some kind of fall-through search like "If we can't find the file locally then check commons, then check etc." [15:58:26] !log updated facts on both CI puppet-compilers [15:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:41] andrewbogott: there is a globalimagelinks table [15:59:06] thedj: ok, I'll look at that. Thanks [15:59:19] https://github.com/wikimedia/mediawiki-extensions-GlobalUsage [15:59:31] https://www.mediawiki.org/wiki/Extension:GlobalUsage [16:00:04] godog, moritzm, and _joe_: I, the Bot under the Fountain, allow thee, The Deployer, to do Puppet SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180315T1600). [16:00:04] marlier: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:07] thedj: wait, does mediawiki actually use that to find the source of a given image? Or is that just available for external consumers? [16:00:16] no, it's for reverse lookups [16:00:32] (03CR) 10Rush: [C: 032] openstack: typo in template path for ml2 [puppet] - 10https://gerrit.wikimedia.org/r/419765 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [16:00:36] i'm here [16:00:53] for actual use, it does a lookup, that searches thtrough the multiple defined filerepo's (of the config) [16:00:53] thedj: sorry, I don't know which way is 'forward' and which is 'reverse' in this case [16:01:12] thedj: ok, so then, what I said before "If we can't find the file locally then check commons, then check etc." is correct [16:01:17] correct [16:01:20] (that's good news for me, I think) [16:01:24] excellent, thank you [16:01:41] (03PS2) 10Muehlenhoff: Add conftool::scripts to Prometheus servers [puppet] - 10https://gerrit.wikimedia.org/r/415328 [16:05:41] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4053461 (10jcrespo) [16:05:44] 10Operations, 10DBA, 10Patch-For-Review: Switchover m2 master from db1020 to db1051 - https://phabricator.wikimedia.org/T189656#4053455 (10jcrespo) 05Open>03Resolved a:03jcrespo This is technically done, not without issues, but not a lot of real actionable once those are fixed. We can prepare an incide... [16:07:43] andrewbogott: most of the time, we use wfFindFile, which according to the docs: [16:07:46] RepoGroup::singleton()->findFile(). Use RepoGroup::singleton()->getLocalRepo()->findFile() if you need to get files only from the local repository. [16:08:29] so by default, look all the defined repo's, until you find something, if you want just a local file, make sure you just search in the localrepo :) [16:09:17] (03PS3) 10Arturo Borrero Gonzalez: Toolforge:puppet-apt-pinning upgrading kubernetes version for PAWS [puppet] - 10https://gerrit.wikimedia.org/r/419599 (https://phabricator.wikimedia.org/T189680) (owner: 10Chico Venancio) [16:10:39] anyone mind if I add two patches to the current SWAT window? [16:11:31] PROBLEM - Check systemd state on labtestneutron2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:16:38] (03PS1) 10Filippo Giunchedi: utils: fetch puppet ca server from agent config [puppet] - 10https://gerrit.wikimedia.org/r/419767 (https://phabricator.wikimedia.org/T184562) [16:16:48] (03CR) 10Gilles: navtiming.py: Make sure to record country specific when oversampling (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) (owner: 10Imarlier) [16:18:17] (03PS1) 10Marostegui: db-codfw.php: Depool db2050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419768 [16:18:59] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for smartd [puppet] - 10https://gerrit.wikimedia.org/r/419769 (https://phabricator.wikimedia.org/T135991) [16:19:37] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4053507 (10Cmjohnson) adding second ethernet cables labvirt1021 (eth3) ge-4/0/34 (added sfp-t to new switch) labvirt1022 (eth3) ge-8/0/23 [16:19:41] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419768 (owner: 10Marostegui) [16:20:06] (03PS6) 10Rduran: [WIP] Add port of osc_host.sh [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419725 [16:20:54] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419768 (owner: 10Marostegui) [16:21:27] (03PS1) 10Rush: openstack: pass network_flat_interface to ml2 setup [puppet] - 10https://gerrit.wikimedia.org/r/419771 (https://phabricator.wikimedia.org/T188266) [16:21:39] (03PS2) 10Rush: openstack: pass network_flat_interface to ml2 setup [puppet] - 10https://gerrit.wikimedia.org/r/419771 (https://phabricator.wikimedia.org/T188266) [16:22:19] (03CR) 10jerkins-bot: [V: 04-1] openstack: pass network_flat_interface to ml2 setup [puppet] - 10https://gerrit.wikimedia.org/r/419771 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [16:22:22] (03Abandoned) 10Filippo Giunchedi: utils: fetch puppet ca server from agent config [puppet] - 10https://gerrit.wikimedia.org/r/419767 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [16:22:38] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2050 for data checks (duration: 01m 15s) [16:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:40] 10Operations, 10ops-codfw: mc2036 mainboard fuse failure - https://phabricator.wikimedia.org/T185587#4053533 (10RobH) Looks like its acutally set to admin state down: ge-8/0/0 down down mc2036 I've gone ahead and re-enabled it, and ensured its in the proper (internal) vlan. [16:24:03] (03PS3) 10Rush: openstack: pass network_flat_interface to ml2 setup [puppet] - 10https://gerrit.wikimedia.org/r/419771 (https://phabricator.wikimedia.org/T188266) [16:24:20] (03PS4) 10Rush: openstack: pass network_flat_interface to ml2 setup [puppet] - 10https://gerrit.wikimedia.org/r/419771 (https://phabricator.wikimedia.org/T188266) [16:27:02] (03CR) 10Rush: "labtestneutron2001.codfw.wmnet,labtestcontrol2003.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/419771 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [16:28:01] !log ppchelko@tin Started deploy [restbase/deploy@8dbc93c]: Release lint and media endpoints [16:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:42] (03PS1) 10Filippo Giunchedi: Depool codfw puppetmaster [dns] - 10https://gerrit.wikimedia.org/r/419774 (https://phabricator.wikimedia.org/T184562) [16:30:28] (03PS1) 10Volans: Puppetboard: add dummy secret [labs/private] - 10https://gerrit.wikimedia.org/r/419775 [16:31:10] (03CR) 10Rush: [C: 032] openstack: pass network_flat_interface to ml2 setup [puppet] - 10https://gerrit.wikimedia.org/r/419771 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [16:33:17] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-apache-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419778 (https://phabricator.wikimedia.org/T135991) [16:33:48] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for prometheus-apache-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419778 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:34:02] (03PS1) 10Filippo Giunchedi: hiera: use puppet.codfw.wmnet alias for labtestpuppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/419781 (https://phabricator.wikimedia.org/T184562) [16:34:11] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-apache-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419778 (https://phabricator.wikimedia.org/T135991) [16:37:22] 10Operations, 10Ops-Access-Requests: Requesting deployment access for samwilson - https://phabricator.wikimedia.org/T189414#4053643 (10RobH) [16:37:42] 10Operations, 10Ops-Access-Requests, 10Ops-Access-Reviews, 10Patch-For-Review: Requesting access to terbium.eqiad.wmnet for bmansurov - https://phabricator.wikimedia.org/T189285#4053655 (10RobH) [16:37:58] (03PS1) 10Arturo Borrero Gonzalez: openstack: enable audit check for kernel versions [puppet] - 10https://gerrit.wikimedia.org/r/419783 (https://phabricator.wikimedia.org/T188266) [16:38:58] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for katielin (katie) - https://phabricator.wikimedia.org/T187623#4053660 (10RobH) Sorry for the delay in getting back to this. I got an answer back from @Mega... [16:39:14] (03PS3) 10Muehlenhoff: Drop use of experimental repository component for caches [puppet] - 10https://gerrit.wikimedia.org/r/415814 (https://phabricator.wikimedia.org/T188545) [16:40:05] (03PS2) 10RobH: new shell user katie lin [puppet] - 10https://gerrit.wikimedia.org/r/412745 (https://phabricator.wikimedia.org/T187623) [16:40:15] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/10465/" [puppet] - 10https://gerrit.wikimedia.org/r/419781 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [16:40:43] (03CR) 10EBernhardson: Add cirrussearch settings for wikibase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [16:41:28] (03CR) 10RobH: [C: 032] new shell user katie lin [puppet] - 10https://gerrit.wikimedia.org/r/412745 (https://phabricator.wikimedia.org/T187623) (owner: 10RobH) [16:43:22] !log ppchelko@tin Finished deploy [restbase/deploy@8dbc93c]: Release lint and media endpoints (duration: 15m 22s) [16:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:10] (03PS2) 10RobH: adding shell user katie to groups [puppet] - 10https://gerrit.wikimedia.org/r/412747 (https://phabricator.wikimedia.org/T187623) [16:46:17] (03CR) 10RobH: [C: 032] adding shell user katie to groups [puppet] - 10https://gerrit.wikimedia.org/r/412747 (https://phabricator.wikimedia.org/T187623) (owner: 10RobH) [16:46:23] 10Operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#4053691 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff Long done, can be closed. [16:47:01] 10Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for katielin (katie) - https://phabricator.wikimedia.org/T187623#4053696 (10RobH) a:05jrobell>03None [16:47:54] 10Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for katielin (katie) - https://phabricator.wikimedia.org/T187623#3981063 (10RobH) 05Open>03Resolved a:03RobH @katielin: Your shell access is now live. I'd give it about 30 m... [16:48:37] (03CR) 10Herron: [C: 031] Depool codfw puppetmaster [dns] - 10https://gerrit.wikimedia.org/r/419774 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [16:50:54] andrewbogott: when you get a chance, does https://gerrit.wikimedia.org/r/c/419781/ look good to you? [16:52:58] (03CR) 10RobH: [C: 031] elastic: decommission elastic1021 [puppet] - 10https://gerrit.wikimedia.org/r/419702 (https://phabricator.wikimedia.org/T189727) (owner: 10Gehel) [16:53:43] (03CR) 10Filippo Giunchedi: [C: 031] Puppetboard: add dummy secret [labs/private] - 10https://gerrit.wikimedia.org/r/419775 (owner: 10Volans) [16:54:14] (03CR) 10Volans: [V: 032 C: 032] Puppetboard: add dummy secret [labs/private] - 10https://gerrit.wikimedia.org/r/419775 (owner: 10Volans) [16:56:30] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290#4037747 (10ema) The most surprising thing about journald's rate limiting, in the context of the issue that triggered this task, is that by looking at pybal logs we... [16:58:27] (03PS1) 10Paladox: Change kind cache: short-circuit on root commits [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/419790 [16:58:36] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290#4053749 (10MoritzMuehlenhoff) >>! In T189290#4053740, @ema wrote: > It would have been much more useful to get such messages into `journalctl -u pybal.service`'s ou... [16:59:19] (03CR) 10Paladox: "I've cherry picked this from https://gerrit-review.googlesource.com/c/gerrit/+/166310" [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/419790 (owner: 10Paladox) [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: Dear deployers, time to do the Services – Graphoid / Parsoid / Citoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180315T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:00:47] no parsoid deploy today [17:01:04] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290#4053757 (10Vgutierrez) yup... or at least a warning on `systemctl status pybal` similar to the one that you get when the journal has been rotated [17:02:13] (03PS2) 10Arturo Borrero Gonzalez: openstack: enable audit check for kernel versions [puppet] - 10https://gerrit.wikimedia.org/r/419783 (https://phabricator.wikimedia.org/T188266) [17:02:52] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: enable audit check for kernel versions [puppet] - 10https://gerrit.wikimedia.org/r/419783 (https://phabricator.wikimedia.org/T188266) (owner: 10Arturo Borrero Gonzalez) [17:02:54] (03PS2) 10Volans: Puppetboard: add role, profile and configuration [puppet] - 10https://gerrit.wikimedia.org/r/419762 (https://phabricator.wikimedia.org/T184563) [17:02:56] (03PS2) 10Volans: Puppetboard: add varnish director entries [puppet] - 10https://gerrit.wikimedia.org/r/419763 (https://phabricator.wikimedia.org/T184563) [17:03:46] (03CR) 10Smalyshev: [C: 031] wdqs: add pigz package [puppet] - 10https://gerrit.wikimedia.org/r/419707 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [17:06:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473#4053781 (10RobH) So all of these hosts were on the eqiad spare tracking, but need to be decommissioned: Asset Tag Hostname WMF3129 wmf3129 WMF3248 old ms... [17:07:29] (03PS1) 10Filippo Giunchedi: install_server: use stretch for puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/419794 (https://phabricator.wikimedia.org/T184562) [17:07:31] (03PS1) 10Filippo Giunchedi: cache: depool puppetmaster2001 from config-master.w.o [puppet] - 10https://gerrit.wikimedia.org/r/419795 (https://phabricator.wikimedia.org/T184562) [17:10:36] (03PS2) 10Andrew Bogott: hiera: use puppet.codfw.wmnet alias for labtestpuppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/419781 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [17:10:38] (03PS1) 10Volans: Add keyholder dummy keys for puppetboard [labs/private] - 10https://gerrit.wikimedia.org/r/419796 (https://phabricator.wikimedia.org/T184563) [17:11:07] (03CR) 10Andrew Bogott: [C: 032] hiera: use puppet.codfw.wmnet alias for labtestpuppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/419781 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [17:14:24] (03CR) 10Volans: [V: 032 C: 032] Add keyholder dummy keys for puppetboard [labs/private] - 10https://gerrit.wikimedia.org/r/419796 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [17:18:42] !log installing dbus updates from stretch 9.4 point release [17:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:11] (03CR) 10Volans: "Compiler results available at:" [puppet] - 10https://gerrit.wikimedia.org/r/419762 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [17:20:40] !log bsitzmann@tin Started deploy [mobileapps/deploy@97d9085]: Update mobileapps to c5e1522 (T184327) [17:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:46] T184327: Add section header to references output - https://phabricator.wikimedia.org/T184327 [17:20:50] godog: I merged that change but it broke my certs and now I'm struggling to sort things out. Want to have a look? [17:21:12] (03CR) 10Volans: [C: 04-2] "This should not be merged until puppetboard is properly installed, configured and tested." [puppet] - 10https://gerrit.wikimedia.org/r/419763 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [17:23:09] jouncebot: next [17:23:10] In 0 hour(s) and 36 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180315T1800) [17:24:28] (03PS1) 10Volans: Add missing reserved LVS IPs comments [dns] - 10https://gerrit.wikimedia.org/r/419799 [17:24:30] (03PS1) 10Volans: Add puppetboard.wikimedia.org entry [dns] - 10https://gerrit.wikimedia.org/r/419800 [17:25:34] (03CR) 10Volans: [C: 04-2] "To not be merged until I9819bb8080121b9b7ff741b4e28b88f9688d6cce is merged." [dns] - 10https://gerrit.wikimedia.org/r/419800 (owner: 10Volans) [17:25:54] !log ppchelko@tin Started deploy [changeprop/deploy@9f4f380]: Purge media endpoint and update sources [17:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:06] (03CR) 10Volans: "While making a different patch I found those missing comments." [dns] - 10https://gerrit.wikimedia.org/r/419799 (owner: 10Volans) [17:26:10] (03PS2) 10Filippo Giunchedi: Depool codfw puppetmaster [dns] - 10https://gerrit.wikimedia.org/r/419774 (https://phabricator.wikimedia.org/T184562) [17:26:12] (03PS1) 10Filippo Giunchedi: lower TTL for puppetmaster-related CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/419802 (https://phabricator.wikimedia.org/T184562) [17:26:19] !log bsitzmann@tin Finished deploy [mobileapps/deploy@97d9085]: Update mobileapps to c5e1522 (T184327) (duration: 05m 38s) [17:26:22] (03CR) 10jerkins-bot: [V: 04-1] Depool codfw puppetmaster [dns] - 10https://gerrit.wikimedia.org/r/419774 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [17:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:24] T184327: Add section header to references output - https://phabricator.wikimedia.org/T184327 [17:26:27] andrewbogott: sure, labtestpuppetmaster2001 ? [17:26:33] yes [17:27:13] andrewbogott: what certs/errors are you seeing? [17:27:18] !log ppchelko@tin Finished deploy [changeprop/deploy@9f4f380]: Purge media endpoint and update sources (duration: 01m 23s) [17:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:23] godog: all of them :) [17:27:36] Specifically, the cert request never appears on puppetmaster2001 [17:27:38] so I can't sign it [17:27:43] but it complains about mismatches nonetheless [17:27:51] 10Operations, 10ops-eqiad: setup backup1001.eqiad.wmnet - https://phabricator.wikimedia.org/T189801#4053902 (10RobH) p:05Triage>03Normal [17:27:58] (03PS2) 10Volans: Add puppetboard.wikimedia.org entry [dns] - 10https://gerrit.wikimedia.org/r/419800 (https://phabricator.wikimedia.org/T184563) [17:29:12] andrewbogott: I wasn't expecting that change to require a cert request, is a puppet run on labtestpuppetmaster2001 what fails? [17:29:22] yes [17:29:48] (03PS5) 10Imarlier: navtiming.py: Make sure to record country specific when oversampling [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) [17:30:06] andrewbogott: ok I'm looking [17:30:12] thanks [17:31:02] (03CR) 10Imarlier: navtiming.py: Make sure to record country specific when oversampling (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) (owner: 10Imarlier) [17:32:22] (03CR) 10Imarlier: "Also added a test that will switch on the new logic." [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) (owner: 10Imarlier) [17:36:00] PROBLEM - puppet last run on labtestvirt2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:39:03] andrewbogott: ah I see what the problem is, we can use puppet or the hostname but not puppet.$site.wmnet because it isn't in the SAN. I see the patch originally was to test a puppet v4 master, would it be ok to point to eqiad (i.e. just "puppet") as the master since that's v4 now? [17:39:25] godog: I think that's fine [17:41:02] 10Operations, 10ops-eqiad: WMF4727 hardware issue - disks dont detect in installer - https://phabricator.wikimedia.org/T189804#4053963 (10RobH) p:05Triage>03Normal [17:41:06] andrewbogott: ok, and you removed the cert for labtestpuppetmaster2001 from one/some puppetmaster* ? [17:41:22] godog: everywhere that I could find it :) [17:41:59] ok I'll add it back [17:42:53] (03PS1) 10Bstorm: tools-static: handle redirects at the proxy level [puppet] - 10https://gerrit.wikimedia.org/r/419807 (https://phabricator.wikimedia.org/T189761) [17:44:35] (03PS1) 10Filippo Giunchedi: hieradata: use default puppet master for labtestpuppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/419808 [17:44:37] (03PS4) 10Gehel: elastic: decommission elastic1021 [puppet] - 10https://gerrit.wikimedia.org/r/419702 (https://phabricator.wikimedia.org/T189727) [17:47:42] (03CR) 10Gehel: [C: 032] elastic: decommission elastic1021 [puppet] - 10https://gerrit.wikimedia.org/r/419702 (https://phabricator.wikimedia.org/T189727) (owner: 10Gehel) [17:47:51] (03CR) 10Filippo Giunchedi: [C: 032] "PCC https://puppet-compiler.wmflabs.org/compiler02/10469/labtestpuppetmaster2001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/419808 (owner: 10Filippo Giunchedi) [17:47:58] (03PS2) 10Filippo Giunchedi: hieradata: use default puppet master for labtestpuppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/419808 [17:48:10] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] hieradata: use default puppet master for labtestpuppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/419808 (owner: 10Filippo Giunchedi) [17:49:04] (03PS2) 10Gehel: wdqs: add pigz package [puppet] - 10https://gerrit.wikimedia.org/r/419707 (https://phabricator.wikimedia.org/T187766) [17:52:33] (03CR) 10Gehel: [C: 032] wdqs: add pigz package [puppet] - 10https://gerrit.wikimedia.org/r/419707 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [17:52:57] (03PS2) 10Gehel: base: add pigz to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/419710 [17:53:47] (03Abandoned) 10Gehel: base: add pigz to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/419710 (owner: 10Gehel) [17:53:57] (03PS5) 10Gehel: Install parallel gzip (pigz) and parallel xz (pxz) on all servers [puppet] - 10https://gerrit.wikimedia.org/r/419709 (owner: 10Jcrespo) [17:54:16] andrewbogott: mhh ok I think also part of the problem is that puppet agent is configured with "server" pointing to production, but "ca_server" is pointing to labtestpuppetmaster itself [17:54:54] andrewbogott: so labtestpuppetmaster should it be self-hosted or act as another production puppet agent ? [17:55:01] godog: That rings a bell, I think that things are maybe not set up to allow for ca_server to vary between the agent and server section [17:55:15] labtestpuppetmaster is a puppet master, but not its own puppetmaster [17:56:11] godog: history here https://phabricator.wikimedia.org/T176437 [17:56:25] (03PS1) 10Rush: openstack: ml2 for neutron server require package [puppet] - 10https://gerrit.wikimedia.org/r/419809 (https://phabricator.wikimedia.org/T188266) [17:58:00] (03PS6) 10Gehel: Install parallel gzip (pigz) and parallel xz (pxz) on all servers [puppet] - 10https://gerrit.wikimedia.org/r/419709 (owner: 10Jcrespo) [17:58:05] (03PS2) 10Rush: openstack: ml2 for neutron server require package [puppet] - 10https://gerrit.wikimedia.org/r/419809 (https://phabricator.wikimedia.org/T188266) [17:58:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, and 2 others: decommission elastic1021 - https://phabricator.wikimedia.org/T189727#4054047 (10RobH) [17:58:44] andrewbogott: ok I replaced the ca in /var/lib/puppet/ssl/certs/ca.pem with the production ca and that seems to have done the trick [17:59:10] andrewbogott: so labtestpuppetmaster is now a regular client, and its certificate is accepted by puppetmaster1001 [17:59:12] godog: is that going to affect clients using that master? [17:59:27] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, and 2 others: decommission elastic1021 - https://phabricator.wikimedia.org/T189727#4051243 (10RobH) Since the host is down, I cannot power it on and disable puppet. I have disabled the switch port, so if it does power on it will be fine. [17:59:29] (03CR) 10Rush: [C: 032] openstack: ml2 for neutron server require package [puppet] - 10https://gerrit.wikimedia.org/r/419809 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [17:59:32] I don't think so, but I'm not sure [18:00:05] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Morning SWAT (Max 8 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180315T1800). [18:00:05] MaxSem and mdholloway: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:11] o/ [18:00:15] I'll do it [18:01:53] (03PS3) 10MaxSem: Undeploy the disabled ArticleCreationWorkflow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419492 (https://phabricator.wikimedia.org/T186570) [18:01:53] andrewbogott: I'm not sure where the previous ca.pem came from, I moved it out of the way as ca.pem_ [18:02:22] godog: ok, thanks [18:02:24] (03CR) 10MaxSem: [C: 032] Undeploy the disabled ArticleCreationWorkflow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419492 (https://phabricator.wikimedia.org/T186570) (owner: 10MaxSem) [18:04:14] (03Merged) 10jenkins-bot: Undeploy the disabled ArticleCreationWorkflow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419492 (https://phabricator.wikimedia.org/T186570) (owner: 10MaxSem) [18:04:31] (03PS3) 10Filippo Giunchedi: Depool codfw puppetmaster [dns] - 10https://gerrit.wikimedia.org/r/419774 (https://phabricator.wikimedia.org/T184562) [18:08:14] * mdholloway shakes his fist at T189329 [18:09:46] !log maxsem@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/419492/ (duration: 01m 15s) [18:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:43] !log maxsem@tin Synchronized wmf-config: https://gerrit.wikimedia.org/r/#/c/419492/ (duration: 01m 16s) [18:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:21] MaxSem: Lemme know if you want me to test the pingzzz. [18:12:30] (03PS1) 10Andrew Bogott: get_images.py: added all_used which pulls images from commons [wikitech-static] - 10https://gerrit.wikimedia.org/r/419810 [18:13:04] œf cóürśę [18:13:24] (03PS2) 10Andrew Bogott: get_images.py: added all_used which pulls images from commons [wikitech-static] - 10https://gerrit.wikimedia.org/r/419810 [18:13:34] (03PS2) 10MaxSem: Enable ping from edit summary in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417329 (https://phabricator.wikimedia.org/T188469) [18:13:38] (03CR) 10MaxSem: [C: 032] Enable ping from edit summary in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417329 (https://phabricator.wikimedia.org/T188469) (owner: 10MaxSem) [18:14:53] (03Merged) 10jenkins-bot: Enable ping from edit summary in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417329 (https://phabricator.wikimedia.org/T188469) (owner: 10MaxSem) [18:15:25] (03CR) 10Andrew Bogott: [V: 032 C: 032] Script and config to sync images from wikitech [wikitech-static] - 10https://gerrit.wikimedia.org/r/419760 (owner: 10Andrew Bogott) [18:15:34] (03CR) 10Andrew Bogott: [V: 032 C: 032] get_images.py: added all_used which pulls images from commons [wikitech-static] - 10https://gerrit.wikimedia.org/r/419810 (owner: 10Andrew Bogott) [18:15:56] (03PS1) 10RobH: decom elastic1021 [puppet] - 10https://gerrit.wikimedia.org/r/419811 (https://phabricator.wikimedia.org/T189727) [18:16:40] (03CR) 10RobH: [C: 032] decom elastic1021 [puppet] - 10https://gerrit.wikimedia.org/r/419811 (https://phabricator.wikimedia.org/T189727) (owner: 10RobH) [18:18:15] Niharika: on mwdebug1002 [18:18:46] Ack. [18:18:46] (03PS1) 10RobH: decom elastic1021 prod dns [dns] - 10https://gerrit.wikimedia.org/r/419813 (https://phabricator.wikimedia.org/T189727) [18:19:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, 10Discovery-Search (Current work): decommission elastic1021 - https://phabricator.wikimedia.org/T189727#4054121 (10RobH) [18:20:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, 10Discovery-Search (Current work): decommission elastic1021 - https://phabricator.wikimedia.org/T189727#4051243 (10RobH) a:05RobH>03Cmjohnson This is now ready for disk wipe and unracking. Please note that since the system won't power on, you... [18:22:28] (03PS1) 10Andrew Bogott: Added wikitech-static mediawiki config [wikitech-static] - 10https://gerrit.wikimedia.org/r/419814 [18:22:30] (03PS1) 10Andrew Bogott: Added apache vhosts [wikitech-static] - 10https://gerrit.wikimedia.org/r/419815 [18:22:41] (03CR) 10Andrew Bogott: [V: 032 C: 032] Added wikitech-static mediawiki config [wikitech-static] - 10https://gerrit.wikimedia.org/r/419814 (owner: 10Andrew Bogott) [18:22:51] (03CR) 10Andrew Bogott: [V: 032 C: 032] Added apache vhosts [wikitech-static] - 10https://gerrit.wikimedia.org/r/419815 (owner: 10Andrew Bogott) [18:23:18] MaxSem: Works! [18:24:41] (03CR) 10DCausse: Add cirrussearch settings for wikibase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [18:25:05] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/417329/ (duration: 01m 15s) [18:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:18] (03PS3) 10DCausse: Add cirrussearch settings for wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) [18:25:25] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10hardware-requests, 10netops: unrack/decom pfw1-codfw and pfw2-codfw - https://phabricator.wikimedia.org/T176427#4054146 (10RobH) 05Open>03Resolved [18:25:29] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: codfw: rack frack refresh equipment - https://phabricator.wikimedia.org/T169643#4054147 (10RobH) [18:25:41] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10hardware-requests, 10netops: unrack/decom pfw1-codfw and pfw2-codfw - https://phabricator.wikimedia.org/T176427#3624959 (10RobH) [18:25:50] (03PS1) 10Andrew Bogott: Added import-wikitech script, which manages data and images [wikitech-static] - 10https://gerrit.wikimedia.org/r/419816 [18:26:00] (03CR) 10Zhuyifei1999: tools-static: handle redirects at the proxy level (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419807 (https://phabricator.wikimedia.org/T189761) (owner: 10Bstorm) [18:26:12] (03CR) 10Andrew Bogott: [V: 032 C: 032] Added import-wikitech script, which manages data and images [wikitech-static] - 10https://gerrit.wikimedia.org/r/419816 (owner: 10Andrew Bogott) [18:28:02] (03PS1) 10Andrew Bogott: import-wikitech-sh: make executable [wikitech-static] - 10https://gerrit.wikimedia.org/r/419817 [18:28:13] (03CR) 10Andrew Bogott: [V: 032 C: 032] import-wikitech-sh: make executable [wikitech-static] - 10https://gerrit.wikimedia.org/r/419817 (owner: 10Andrew Bogott) [18:28:17] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10hardware-requests, 10netops: unrack/decom pfw1-codfw and pfw2-codfw - https://phabricator.wikimedia.org/T176427#4054160 (10RobH) So we cannot actually blank the port description on the scs once its been set; I set them each to 'empty port #' with their... [18:30:33] (03PS1) 10Andrew Bogott: Move images to 'images' [wikitech-static] - 10https://gerrit.wikimedia.org/r/419818 [18:30:47] (03CR) 10Andrew Bogott: [V: 032 C: 032] Move images to 'images' [wikitech-static] - 10https://gerrit.wikimedia.org/r/419818 (owner: 10Andrew Bogott) [18:31:04] (03CR) 10Bstorm: tools-static: handle redirects at the proxy level (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419807 (https://phabricator.wikimedia.org/T189761) (owner: 10Bstorm) [18:32:11] (03PS1) 10Andrew Bogott: Tell mediawiki to look for local images in /srv/mediawiki/images [wikitech-static] - 10https://gerrit.wikimedia.org/r/419820 [18:32:19] (03CR) 10Andrew Bogott: [V: 032 C: 032] Tell mediawiki to look for local images in /srv/mediawiki/images [wikitech-static] - 10https://gerrit.wikimedia.org/r/419820 (owner: 10Andrew Bogott) [18:32:34] (03PS2) 10Bstorm: tools-static: handle redirects at the proxy level [puppet] - 10https://gerrit.wikimedia.org/r/419807 (https://phabricator.wikimedia.org/T189761) [18:33:34] * Niharika is taking over SWAT [18:33:44] mdholloway: Can I deploy all your patches in one go? [18:34:01] Niharika: yep! that should be fine [18:34:47] mdholloway: And can you test them on mwdebug1002? [18:34:56] They're not there yet. [18:34:59] Niharika: sure [18:35:01] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1264 bytes in 0.177 second response time [18:35:08] Okay, I'll let you know when they are. [18:35:17] Zuul being lazy as usual. [18:36:01] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 32387 bytes in 0.343 second response time [18:37:23] Niharika: Zuul is excellent training in the virtue of patience ;) [18:38:03] "Zuul being lazy as usual" sounds like one of those palindromes... [18:38:25] mdholloway: That's true. You can't curse it or punch it. You can constantly refresh the tab but it doesn't really help. [18:45:21] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4054235 (10awight) [18:48:53] mdholloway: The patches are on mwdebug1002. [18:49:05] Niharika: thanks, checking now! [18:50:13] Niharika: looks good! [18:50:24] thanks again for SWATting! [18:50:33] mdholloway: No problem. Syncing now. [18:51:38] !log niharika29@tin Synchronized php-1.31.0-wmf.25/extensions/MobileApp/: https://gerrit.wikimedia.org/r/#/c/419785/; https://gerrit.wikimedia.org/r/#/c/419784/; https://gerrit.wikimedia.org/r/#/c/419776/ (duration: 01m 14s) [18:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:50] mdholloway: Done! [18:52:03] Niharika: \o/ Thanks again! [19:00:05] no_justification: Your horoscope predicts another unfortunate MediaWiki train deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180315T1900). [19:00:05] No GERRIT patches in the queue for this window AFAICS. [19:00:25] (03PS1) 10Andrew Bogott: sync files with --verbose [wikitech-static] - 10https://gerrit.wikimedia.org/r/419823 [19:00:33] (03CR) 10Andrew Bogott: [V: 032 C: 032] sync files with --verbose [wikitech-static] - 10https://gerrit.wikimedia.org/r/419823 (owner: 10Andrew Bogott) [19:05:15] (03CR) 10Dzahn: [C: 031] "looks good. i would call it "make bmansurov a deployer". that's what it is, terbium is just a side-effect" [puppet] - 10https://gerrit.wikimedia.org/r/419387 (https://phabricator.wikimedia.org/T189285) (owner: 10Vgutierrez) [19:05:59] (03CR) 10Volans: [C: 031] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/419802 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [19:06:20] (03PS1) 10Andrew Bogott: config: change instructions to copy vs symlink [wikitech-static] - 10https://gerrit.wikimedia.org/r/419825 [19:06:30] (03CR) 10Dzahn: [C: 031] "if he wants to deploy of course.. otherwise restricted is enough" [puppet] - 10https://gerrit.wikimedia.org/r/419387 (https://phabricator.wikimedia.org/T189285) (owner: 10Vgutierrez) [19:07:38] (03PS2) 10Ottomata: Puppetization for newer SWAP (JupyterHub) [puppet] - 10https://gerrit.wikimedia.org/r/419656 (https://phabricator.wikimedia.org/T183145) [19:08:00] (03PS1) 10Chad: group1 to wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419826 [19:08:02] (03CR) 10Chad: [C: 032] group1 to wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419826 (owner: 10Chad) [19:08:14] (03CR) 10jerkins-bot: [V: 04-1] Puppetization for newer SWAP (JupyterHub) [puppet] - 10https://gerrit.wikimedia.org/r/419656 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [19:08:38] (03CR) 10Volans: [C: 031] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/419774 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [19:09:15] (03Merged) 10jenkins-bot: group1 to wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419826 (owner: 10Chad) [19:09:20] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/419795 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [19:09:45] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/419794 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [19:14:06] !log demon@tin rebuilt and synchronized wikiversions files: group2 to wmf.25 [19:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:35] (03PS3) 10Ottomata: Puppetization for newer SWAP (JupyterHub) [puppet] - 10https://gerrit.wikimedia.org/r/419656 (https://phabricator.wikimedia.org/T183145) [19:15:11] (03CR) 10jerkins-bot: [V: 04-1] Puppetization for newer SWAP (JupyterHub) [puppet] - 10https://gerrit.wikimedia.org/r/419656 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [19:15:54] (03PS4) 10Ottomata: Puppetization for newer SWAP (JupyterHub) [puppet] - 10https://gerrit.wikimedia.org/r/419656 (https://phabricator.wikimedia.org/T183145) [19:16:30] (03CR) 10jerkins-bot: [V: 04-1] Puppetization for newer SWAP (JupyterHub) [puppet] - 10https://gerrit.wikimedia.org/r/419656 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [19:17:21] (03PS2) 10Dzahn: mediawiki_singlenode: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/415508 [19:17:57] (03PS3) 10Bstorm: tools-static: handle redirects at the proxy level [puppet] - 10https://gerrit.wikimedia.org/r/419807 (https://phabricator.wikimedia.org/T189761) [19:22:25] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/10474/" [puppet] - 10https://gerrit.wikimedia.org/r/415508 (owner: 10Dzahn) [19:25:48] (03PS5) 10Ottomata: Puppetization for newer SWAP (JupyterHub) [puppet] - 10https://gerrit.wikimedia.org/r/419656 (https://phabricator.wikimedia.org/T183145) [19:26:24] (03CR) 10jerkins-bot: [V: 04-1] Puppetization for newer SWAP (JupyterHub) [puppet] - 10https://gerrit.wikimedia.org/r/419656 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [19:29:47] (03CR) 10Ottomata: "Woot, no op on 1001, expected on 1003,1004" [puppet] - 10https://gerrit.wikimedia.org/r/419656 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [19:29:53] (03PS6) 10Ottomata: Puppetization for newer SWAP (JupyterHub) [puppet] - 10https://gerrit.wikimedia.org/r/419656 (https://phabricator.wikimedia.org/T183145) [19:29:57] (03CR) 10Ottomata: [V: 032 C: 032] Puppetization for newer SWAP (JupyterHub) [puppet] - 10https://gerrit.wikimedia.org/r/419656 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [19:38:41] (03PS4) 10Bstorm: tools-static: handle redirects at the proxy level [puppet] - 10https://gerrit.wikimedia.org/r/419807 (https://phabricator.wikimedia.org/T189761) [19:39:22] (03CR) 10Bstorm: [C: 032] tools-static: handle redirects at the proxy level [puppet] - 10https://gerrit.wikimedia.org/r/419807 (https://phabricator.wikimedia.org/T189761) (owner: 10Bstorm) [19:40:18] (03PS1) 10Rush: openstack: neutron change for agent registration [puppet] - 10https://gerrit.wikimedia.org/r/419833 (https://phabricator.wikimedia.org/T188266) [19:42:18] (03CR) 10Rush: "labtestcontrol2003.wikimedia.org,labtestvirt2003.codfw.wmnet,labtestneutron2001.codfw.wmnet,labvirt1001.eqiad.wmnet,labcontrol1001.wikimed" [puppet] - 10https://gerrit.wikimedia.org/r/419833 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [19:48:40] (03PS1) 10Ottomata: Use venv instead of jupyter-venv for user venv dirs [puppet] - 10https://gerrit.wikimedia.org/r/419835 (https://phabricator.wikimedia.org/T183145) [19:51:18] (03PS4) 10Dzahn: cassandra/icinga: make monitoring configurable, skip on dev [puppet] - 10https://gerrit.wikimedia.org/r/419339 (https://phabricator.wikimedia.org/T189050) [19:54:49] (03CR) 10Ottomata: [C: 032] Use venv instead of jupyter-venv for user venv dirs [puppet] - 10https://gerrit.wikimedia.org/r/419835 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [19:54:54] (03PS2) 10Ottomata: Use venv instead of jupyter-venv for user venv dirs [puppet] - 10https://gerrit.wikimedia.org/r/419835 (https://phabricator.wikimedia.org/T183145) [19:54:56] (03CR) 10Ottomata: [V: 032 C: 032] Use venv instead of jupyter-venv for user venv dirs [puppet] - 10https://gerrit.wikimedia.org/r/419835 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [19:56:48] (03CR) 10Andrew Bogott: [V: 032 C: 032] config: change instructions to copy vs symlink [wikitech-static] - 10https://gerrit.wikimedia.org/r/419825 (owner: 10Andrew Bogott) [19:57:45] (03PS2) 10Rush: openstack: neutron change for agent registration [puppet] - 10https://gerrit.wikimedia.org/r/419833 (https://phabricator.wikimedia.org/T188266) [19:58:14] (03PS1) 10Andrew Bogott: import-wikitech: set proper ownership on /srv/mediawiki/images [wikitech-static] - 10https://gerrit.wikimedia.org/r/419838 [19:58:28] (03CR) 10Andrew Bogott: [V: 032 C: 032] import-wikitech: set proper ownership on /srv/mediawiki/images [wikitech-static] - 10https://gerrit.wikimedia.org/r/419838 (owner: 10Andrew Bogott) [20:03:13] (03PS1) 10Rush: rabbitmq: setup monitor manifest user via rabbitmq::user [puppet] - 10https://gerrit.wikimedia.org/r/419839 (https://phabricator.wikimedia.org/T188266) [20:05:30] !log disable puppet for cloud things for a safe rollout [20:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:20] (03PS3) 10Rush: openstack: neutron change for agent registration [puppet] - 10https://gerrit.wikimedia.org/r/419833 (https://phabricator.wikimedia.org/T188266) [20:10:16] (03CR) 10Rush: [C: 032] openstack: neutron change for agent registration [puppet] - 10https://gerrit.wikimedia.org/r/419833 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [20:10:53] ottomata: "Ottomata: Use venv instead of jupyter-venv for user venv dirs" merge? [20:12:10] ottomata: I'm merging your sleeper change on puppetmaster1001, I looked at the change and I'm not afraid of it. [20:12:31] oh oops [20:12:33] yes thanks [20:12:34] sorry [20:12:36] meant to merge that [20:12:38] great choice! [20:15:11] RECOVERY - Check systemd state on labtestneutron2002 is OK: OK - running: The system is fully operational [20:28:35] (03CR) 10Chad: [C: 032] robots.txt: Combine various NS_SPECIAL disallows [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411255 (owner: 10Chad) [20:29:51] (03Merged) 10jenkins-bot: robots.txt: Combine various NS_SPECIAL disallows [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411255 (owner: 10Chad) [20:30:51] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [20:31:54] (03PS1) 10Dereckson: Update production SSH key for dereckson [puppet] - 10https://gerrit.wikimedia.org/r/419844 [20:32:22] (03CR) 10Dereckson: [C: 031] Update production SSH key for dereckson [puppet] - 10https://gerrit.wikimedia.org/r/419844 (owner: 10Dereckson) [20:38:25] !log demon@tin Synchronized robots.txt: minor tidying (duration: 00m 58s) [20:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:04] (03PS1) 10Rush: openstack: neutron ml2 for labtestn virt [puppet] - 10https://gerrit.wikimedia.org/r/419847 (https://phabricator.wikimedia.org/T188266) [20:43:16] (03CR) 10Rush: [C: 032] openstack: neutron ml2 for labtestn virt [puppet] - 10https://gerrit.wikimedia.org/r/419847 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [20:44:49] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4054495 (10mmodell) >>! In T180628#4050430, @demon wrote: > I can't think of any compelling reason why it would be *required* on the masters...but I... [20:45:19] (03PS1) 10Andrew Bogott: get_images: don't wait after the last retry [wikitech-static] - 10https://gerrit.wikimedia.org/r/419848 [20:45:21] (03PS1) 10Andrew Bogott: Don't purge old thumbnails in get_images.py [wikitech-static] - 10https://gerrit.wikimedia.org/r/419849 [20:45:33] (03CR) 10Andrew Bogott: [V: 032 C: 032] get_images: don't wait after the last retry [wikitech-static] - 10https://gerrit.wikimedia.org/r/419848 (owner: 10Andrew Bogott) [20:45:40] (03CR) 10Andrew Bogott: [V: 032 C: 032] Don't purge old thumbnails in get_images.py [wikitech-static] - 10https://gerrit.wikimedia.org/r/419849 (owner: 10Andrew Bogott) [20:51:25] 10Operations, 10monitoring, 10Patch-For-Review: restbase: skip icinga monitoring if on "dev" machines - https://phabricator.wikimedia.org/T189050#4054532 (10Dzahn) p:05Triage>03Normal partially resolved [20:53:49] (03PS4) 10DCausse: Add cirrussearch settings for wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) [20:54:42] (03PS1) 10Andrew Bogott: get_images: look on commons for anything that fails to download [wikitech-static] - 10https://gerrit.wikimedia.org/r/419851 [20:54:50] (03CR) 10Andrew Bogott: [V: 032 C: 032] get_images: look on commons for anything that fails to download [wikitech-static] - 10https://gerrit.wikimedia.org/r/419851 (owner: 10Andrew Bogott) [20:57:18] (03PS1) 10Andrew Bogott: typo fix [wikitech-static] - 10https://gerrit.wikimedia.org/r/419853 [20:58:35] (03CR) 10Andrew Bogott: [V: 032 C: 032] typo fix [wikitech-static] - 10https://gerrit.wikimedia.org/r/419853 (owner: 10Andrew Bogott) [21:29:23] (03PS1) 10Ladsgroup: labs: Disable reading from term_search_key from wb_terms table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419859 (https://phabricator.wikimedia.org/T189776) [21:34:05] (03PS1) 10Rush: openstack: glance bootstrapping with debian image [puppet] - 10https://gerrit.wikimedia.org/r/419871 (https://phabricator.wikimedia.org/T188266) [21:39:10] PROBLEM - puppet last run on cp1071 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:39:11] PROBLEM - puppet last run on ms-be1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:39:11] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:40:11] PROBLEM - puppet last run on mw1297 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:40:40] PROBLEM - puppet last run on dbproxy1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:41:11] PROBLEM - puppet last run on etcd1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:41:30] PROBLEM - puppet last run on ganeti1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:41:30] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:41:30] PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:42:20] PROBLEM - puppet last run on db1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:42:26] puppetdb [21:43:11] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:43:20] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:43:20] PROBLEM - puppet last run on mw1234 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:44:11] PROBLEM - puppet last run on ms-fe1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:45:02] (03PS1) 10Andrew Bogott: set wgUploadPath [wikitech-static] - 10https://gerrit.wikimedia.org/r/419875 [21:45:10] (03CR) 10Andrew Bogott: [V: 032 C: 032] set wgUploadPath [wikitech-static] - 10https://gerrit.wikimedia.org/r/419875 (owner: 10Andrew Bogott) [21:50:45] (03Draft2) 10Reedy: Add COPYING [wikitech-static] - 10https://gerrit.wikimedia.org/r/419877 [21:51:32] (03CR) 10Ladsgroup: [C: 032] labs: Disable reading from term_search_key from wb_terms table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419859 (https://phabricator.wikimedia.org/T189776) (owner: 10Ladsgroup) [21:52:45] 10Operations, 10Cassandra, 10Services, 10hardware-requests, 10User-Eevans: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4054767 (10Eevans) p:05Triage>03Normal [21:52:56] (03Merged) 10jenkins-bot: labs: Disable reading from term_search_key from wb_terms table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419859 (https://phabricator.wikimedia.org/T189776) (owner: 10Ladsgroup) [21:53:51] tin is rebased ^ [22:04:11] (03PS3) 10MaxSem: wiki replicas: add GlobalPreferences to maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/403833 (https://phabricator.wikimedia.org/T184666) [22:07:20] RECOVERY - puppet last run on db1016 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [22:08:11] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [22:08:20] RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [22:08:20] RECOVERY - puppet last run on mw1234 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [22:09:04] (03CR) 10Andrew Bogott: [V: 032 C: 032] "thanks!" [wikitech-static] - 10https://gerrit.wikimedia.org/r/419877 (owner: 10Reedy) [22:09:10] RECOVERY - puppet last run on cp1071 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:09:11] RECOVERY - puppet last run on ms-fe1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:09:11] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:09:11] RECOVERY - puppet last run on ms-be1013 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:10:41] RECOVERY - puppet last run on dbproxy1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:11:11] RECOVERY - puppet last run on etcd1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:11:30] RECOVERY - puppet last run on ganeti1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:11:30] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:11:30] RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:15:11] RECOVERY - puppet last run on mw1297 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:29:45] no_justification you may want to have a look at https://phabricator.wikimedia.org/T189827 (does that get added wmf wmf.25 blockers? [22:31:15] Yeah add to the blockers task plz [22:31:20] ok [22:31:37] I'll look a bit closer, just getting home from the store [22:33:30] done [22:35:42] AF seems a bit broken [22:35:42] https://phabricator.wikimedia.org/T189829 [22:40:20] We can easily roll it back to wmf.24 version [22:41:09] https://github.com/wikimedia/mediawiki-extensions-AbuseFilter/compare/wmf/1.31.0-wmf.24...wmf/1.31.0-wmf.25 [22:42:48] I think the blocking might be https://github.com/wikimedia/mediawiki-extensions-AbuseFilter/commit/2dd8d27c34f8fa583ad7d9a15a5994d18062e7d5 [22:43:22] should we revert it on the wmf branch? [22:43:35] If the problem is missing global... [22:44:00] Easy fix [22:44:07] Hmmm [22:44:29] making a patch [22:45:20] https://gerrit.wikimedia.org/r/419944 [22:49:40] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=55%) [22:54:33] (03CR) 10Bstorm: [C: 031] wiki replicas: add GlobalPreferences to maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/403833 (https://phabricator.wikimedia.org/T184666) (owner: 10MaxSem) [22:55:38] (03CR) 10Bstorm: [C: 032] wiki replicas: add GlobalPreferences to maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/403833 (https://phabricator.wikimedia.org/T184666) (owner: 10MaxSem) [22:58:52] !log reedy@tin Synchronized php-1.31.0-wmf.25/extensions/AbuseFilter/: add some missing globals (duration: 00m 58s) [22:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180315T2300). [23:00:04] ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:10] (03PS1) 10Madhuvishy: dumps: Open up nfs ports in labstore1006|7 to analytics_networks [puppet] - 10https://gerrit.wikimedia.org/r/419948 [23:01:14] i suppose if it's just me, i'll ship it [23:06:33] ebernhardson: http://www.humoar.com/if-it-fits-it-ships/ [23:10:04] (03PS2) 10Madhuvishy: dumps: Open up nfs ports in labstore1006|7 to analytics_networks [puppet] - 10https://gerrit.wikimedia.org/r/419948 [23:14:07] (03CR) 10Madhuvishy: [C: 032] dumps: Open up nfs ports in labstore1006|7 to analytics_networks [puppet] - 10https://gerrit.wikimedia.org/r/419948 (owner: 10Madhuvishy) [23:14:50] Reedy: Also a lot of "Notice: Undefined index: Q2189172817218912789217821873 in /srv/mediawiki/php-1.31.0-wmf.25/extensions/WikibaseQualityConstraints/src/Api/CachingResultsBuilder.php on line 242" [23:14:55] (Q number being random) [23:15:00] :( [23:15:38] Tons of Collection undefined indexes, but we've been playing whack a mole with that for years. [23:15:55] I filed a task for a couple of Collection bugs earlier this week [23:16:12] CollectionPageTemplate? https://phabricator.wikimedia.org/T189636 [23:17:11] ebernhardson: just fyi, there's a merged AbuseFilter change that's not deployed yet [23:18:13] Reedy: thanks for the heads up, i'll try not to accidentally full-scap :) [23:18:18] * ebernhardson has once or twice ... [23:18:24] You can deploy it, it's not a major issue [23:18:37] oh, jerkins said no [23:18:39] so it hasn't merged yet [23:20:42] !log ebernhardson@tin Synchronized php-1.31.0-wmf.25/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js: SWAT: T187148: Turn off Cirrus AB test (duration: 00m 58s) [23:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:48] T187148: Evaluate features provided by `query_explorer` functionality of ltr plugin - https://phabricator.wikimedia.org/T187148 [23:25:48] !log reedy@tin Synchronized php-1.31.0-wmf.25/extensions/AbuseFilter/: Fix display issues (duration: 00m 59s) [23:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:52] (03PS1) 10Madhuvishy: dumps: Add stat* to nfs exports [puppet] - 10https://gerrit.wikimedia.org/r/419952 (https://phabricator.wikimedia.org/T181431) [23:51:01] (03CR) 10Madhuvishy: [C: 032] dumps: Add stat* to nfs exports [puppet] - 10https://gerrit.wikimedia.org/r/419952 (https://phabricator.wikimedia.org/T181431) (owner: 10Madhuvishy) [23:55:38] no_justification: Filed a few more bugs with errors from the logs