[00:00:04] twentyafterfour: That opportune time is upon us again. Time for a Phabricator update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201001T0000). [00:00:29] (03CR) 10Krinkle: [C: 03+2] Reject ParserCache entries from the last wmf.11 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631318 (https://phabricator.wikimedia.org/T264257) (owner: 10Krinkle) [00:01:06] (03Merged) 10jenkins-bot: Reject ParserCache entries from the last wmf.11 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631318 (https://phabricator.wikimedia.org/T264257) (owner: 10Krinkle) [00:03:37] (03PS1) 10Krinkle: Reject ParserCache entries from the last wmf.11 deployment (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631320 (https://phabricator.wikimedia.org/T263851) [00:03:41] twentyafterfour: ^ [00:04:31] (03CR) 1020after4: [C: 03+2] Reject ParserCache entries from the last wmf.11 deployment (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631320 (https://phabricator.wikimedia.org/T263851) (owner: 10Krinkle) [00:09:23] (03PS2) 10Krinkle: Reject ParserCache entries from the last wmf.11 deployment (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631320 (https://phabricator.wikimedia.org/T263851) [00:13:08] (03PS1) 1020after4: rolled back everything to 1.36.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631323 [00:13:19] (03CR) 1020after4: [C: 03+2] rolled back everything to 1.36.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631323 (owner: 1020after4) [00:13:44] (03CR) 10Krinkle: [C: 03+2] Reject ParserCache entries from the last wmf.11 deployment (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631320 (https://phabricator.wikimedia.org/T263851) (owner: 10Krinkle) [00:13:57] (03CR) 1020after4: [C: 03+2] "For the record this happened a while ago and I forgot to push the change to gerrit, this is just getting things back in sync" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631323 (owner: 1020after4) [00:14:18] (03Merged) 10jenkins-bot: rolled back everything to 1.36.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631323 (owner: 1020after4) [00:14:34] Krinkle: merged [00:14:49] Krinkle: I'll let you deploy the cache clearing patch? [00:14:58] (03Merged) 10jenkins-bot: Reject ParserCache entries from the last wmf.11 deployment (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631320 (https://phabricator.wikimedia.org/T263851) (owner: 10Krinkle) [00:15:01] ok [00:34:23] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10CDanis) Around 23:48 we got another user report in #wikimedia-ai of Redis exceptions when using the ORES API. [00:43:32] (03PS1) 10Krinkle: HACK/ParserCache: Forche cache-miss if mUsedOptions is undefined [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/631324 (https://phabricator.wikimedia.org/T264257) [00:48:36] (03CR) 10DannyS712: HACK/ParserCache: Forche cache-miss if mUsedOptions is undefined (031 comment) [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/631324 (https://phabricator.wikimedia.org/T264257) (owner: 10Krinkle) [00:50:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:51:31] (03CR) 10Krinkle: HACK/ParserCache: Forche cache-miss if mUsedOptions is undefined (031 comment) [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/631324 (https://phabricator.wikimedia.org/T264257) (owner: 10Krinkle) [00:51:36] (03PS2) 10Krinkle: HACK/ParserCache: Force cache-miss if mUsedOptions is undefined [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/631324 (https://phabricator.wikimedia.org/T264257) [00:52:09] (03CR) 10Krinkle: [V: 03+2 C: 03+2] HACK/ParserCache: Force cache-miss if mUsedOptions is undefined [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/631324 (https://phabricator.wikimedia.org/T264257) (owner: 10Krinkle) [00:52:25] (03CR) 10Krinkle: "recheck" [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/631324 (https://phabricator.wikimedia.org/T264257) (owner: 10Krinkle) [00:53:20] (03CR) 10Krinkle: "Will let Jenkins run async while we verify on mwdebug2001" [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/631324 (https://phabricator.wikimedia.org/T264257) (owner: 10Krinkle) [01:12:16] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: 1721d2aa0 - Reject ParserCache entries from the last wmf.11 deployment (duration: 05m 13s) [01:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:38] !log krinkle@deploy1001 Synchronized php-1.36.0-wmf.10/includes/parser/: Ia3357b2f593c (duration: 00m 58s) [01:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:02:07] (03PS1) 10Ppchelko: Revert "Revert "Revert "Hard deprecate all public properties in CacheTime and ParserOutput""" [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631240 (https://phabricator.wikimedia.org/T264257) [03:59:16] (03PS1) 10Andrew Bogott: cloudvirt1017 to Buster and Ceph [puppet] - 10https://gerrit.wikimedia.org/r/631325 (https://phabricator.wikimedia.org/T259399) [03:59:52] (03PS2) 10Andrew Bogott: cloudvirt1017 to Buster and Ceph [puppet] - 10https://gerrit.wikimedia.org/r/631325 (https://phabricator.wikimedia.org/T259399) [04:01:40] (03PS3) 10Andrew Bogott: cloudvirt1017 to Buster and Ceph [puppet] - 10https://gerrit.wikimedia.org/r/631325 (https://phabricator.wikimedia.org/T259399) [04:02:24] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1017 to Buster and Ceph [puppet] - 10https://gerrit.wikimedia.org/r/631325 (https://phabricator.wikimedia.org/T259399) (owner: 10Andrew Bogott) [04:18:35] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [04:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:20:36] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [04:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:04] RECOVERY - Wikitech-static main page has content on cloudweb2001-dev is OK: HTTP OK: HTTP/1.1 200 OK - 25402 bytes in 0.253 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:00:04] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 25401 bytes in 0.245 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:01:12] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 25401 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:18:29] (03PS1) 10Marostegui: mariadb: Repool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/631326 [05:18:45] (03CR) 10Marostegui: [C: 03+2] mariadb: Repool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/631326 (owner: 10Marostegui) [05:19:19] !log Repool labsdb1011 [05:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:29] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) ` root@es2026:~# megacli -PDRbld -ShowProg -physdrv[32:2] -aALL Rebuild Progress on Device at Enclosure 32, Slot 2 Completed 62% in 875 Minutes. ` [05:29:28] !log Deploy schema change on s3 (testwikidatawiki) T264109 [05:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:33] T264109: Schema change to drop three indexes from wb_changes - https://phabricator.wikimedia.org/T264109 [05:33:43] (03PS1) 10Marostegui: mariadb: Remove es2016 puppet entries [puppet] - 10https://gerrit.wikimedia.org/r/631327 (https://phabricator.wikimedia.org/T264156) [05:34:33] (03PS1) 10Marostegui: dns: Remove es2016 dns entries [dns] - 10https://gerrit.wikimedia.org/r/631328 (https://phabricator.wikimedia.org/T264156) [05:35:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:13] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove es2016 puppet entries [puppet] - 10https://gerrit.wikimedia.org/r/631327 (https://phabricator.wikimedia.org/T264156) (owner: 10Marostegui) [05:38:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [05:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:45] (03CR) 10Marostegui: [C: 03+2] dns: Remove es2016 dns entries [dns] - 10https://gerrit.wikimedia.org/r/631328 (https://phabricator.wikimedia.org/T264156) (owner: 10Marostegui) [05:40:06] 10Operations, 10ops-codfw, 10decommission-hardware: decommission es2016.codfw.wmnet - https://phabricator.wikimedia.org/T264156 (10Marostegui) [05:43:27] (03PS1) 10Marostegui: es2011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/631329 (https://phabricator.wikimedia.org/T264261) [05:43:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2011 T264261', diff saved to https://phabricator.wikimedia.org/P12866 and previous config saved to /var/cache/conftool/dbconfig/20201001-054335-marostegui.json [05:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:41] T264261: decommission es2011.codfw.wmnet - https://phabricator.wikimedia.org/T264261 [05:44:08] (03CR) 10Marostegui: [C: 03+2] es2011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/631329 (https://phabricator.wikimedia.org/T264261) (owner: 10Marostegui) [05:45:51] !log Stop MySQL on es2011 T264261 [05:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:23] marostegui quick question about that task. Is `servee` in the description meant to be `server`? Or is `servee` intentional (https://en.wiktionary.org/wiki/servee)? [05:48:05] DannyS712: about which task? [05:48:11] https://phabricator.wikimedia.org/T264261 [05:49:02] DannyS712: servee - essentially assign it to the person on the DC where the server is located [05:49:24] okay, wasn't sure if it was a typo or not. Thanks for confirming [05:49:28] :) [06:01:09] (03PS1) 10Elukey: geoip::data::archive: run the systemd timer as analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/631330 (https://phabricator.wikimedia.org/T264152) [06:05:38] (03PS2) 10Elukey: geoip::data::archive: run the systemd timer as analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/631330 (https://phabricator.wikimedia.org/T264152) [06:06:16] (03CR) 10Elukey: [C: 03+2] geoip::data::archive: run the systemd timer as analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/631330 (https://phabricator.wikimedia.org/T264152) (owner: 10Elukey) [06:08:39] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:14:49] (03CR) 10Ayounsi: [C: 03+1] "IPs and syntax checked." [homer/public] - 10https://gerrit.wikimedia.org/r/631261 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [06:18:21] !log imported envoyproxy 1.15.1 to buster-wikimedia, stretch-wikimedia - T264157 [06:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:29] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Joe) The request hammering from okapi is continuing, we might need to ban the UA at the edge. @RBrounley_WMF can you ensure that the reque... [06:25:41] 10Operations, 10ops-eqiad, 10Traffic, 10netops: lvs1016 enp5s0f0 interface errors - https://phabricator.wikimedia.org/T264227 (10ayounsi) No errors on the switch side. `lang=bash lvs1016:~$ sudo ethtool -S enp5s0f0 | grep crc rx_crc_errors: 27387518 lvs1016:~$ sudo ethtool -S enp5s0f0 | grep crc... [06:31:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Make es2033 master of es2 T261717', diff saved to https://phabricator.wikimedia.org/P12867 and previous config saved to /var/cache/conftool/dbconfig/20201001-063104-marostegui.json [06:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:10] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:36:51] (03PS1) 10JMeybohm: envoy: New upstream version 1.15.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/631383 (https://phabricator.wikimedia.org/T264157) [06:39:54] (03PS1) 10JMeybohm: citoid: Update to envoy 1.15.1-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/631384 (https://phabricator.wikimedia.org/T264157) [06:40:47] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [06:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:43] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) [06:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:16] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) Bootstrapped 12,15,16 - disk/partitions look good! [07:03:28] 10Operations, 10Analytics-Radar, 10Patch-For-Review: Move Hue to a Buster VM - https://phabricator.wikimedia.org/T258768 (10MoritzMuehlenhoff) 05Open→03Resolved Ack, closing this task. [07:03:43] (03PS1) 10Giuseppe Lavagetto: cache-text: add throttling for calls to ORES from the OKAPI [puppet] - 10https://gerrit.wikimedia.org/r/631385 (https://phabricator.wikimedia.org/T263910) [07:10:04] (03PS1) 10Muehlenhoff: Remove access for nathante, shiladsen [puppet] - 10https://gerrit.wikimedia.org/r/631387 [07:10:12] (03CR) 10Volans: [C: 04-1] "There are some inaccuracies, details inline" (039 comments) [dns] - 10https://gerrit.wikimedia.org/r/630647 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [07:11:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2083', diff saved to https://phabricator.wikimedia.org/P12869 and previous config saved to /var/cache/conftool/dbconfig/20201001-071155-marostegui.json [07:11:56] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for nathante, shiladsen [puppet] - 10https://gerrit.wikimedia.org/r/631387 (owner: 10Muehlenhoff) [07:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:23] !log restart hdfs namenodes on an-worker100[1,2] to pick up new hadoop workers settings [07:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2083', diff saved to https://phabricator.wikimedia.org/P12870 and previous config saved to /var/cache/conftool/dbconfig/20201001-071241-marostegui.json [07:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2086:3318', diff saved to https://phabricator.wikimedia.org/P12871 and previous config saved to /var/cache/conftool/dbconfig/20201001-071321-marostegui.json [07:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2086:3318', diff saved to https://phabricator.wikimedia.org/P12872 and previous config saved to /var/cache/conftool/dbconfig/20201001-071347-marostegui.json [07:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2091 ', diff saved to https://phabricator.wikimedia.org/P12873 and previous config saved to /var/cache/conftool/dbconfig/20201001-071413-marostegui.json [07:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2091', diff saved to https://phabricator.wikimedia.org/P12874 and previous config saved to /var/cache/conftool/dbconfig/20201001-071442-marostegui.json [07:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:52] 10Operations, 10DBA, 10Sustainability (Incident Followup), 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10akosiaris) >>! In T263842#6506360, @RhinosF1 wrote: > I just noticed the IR on wikitech says: >>duplicate key... [07:17:07] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Joe) I've created the patch above, that can be used in case the okapi causes further problems to ORES. It's here as a stopgap measure if issu... [07:22:32] 10Operations, 10SRE-swift-storage, 10User-fgiunchedi: ms-be2017 slower than the rest of the cluster while rebalancing - https://phabricator.wikimedia.org/T264270 (10fgiunchedi) [07:22:38] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "To be merged only in case problems arise again." [puppet] - 10https://gerrit.wikimedia.org/r/631385 (https://phabricator.wikimedia.org/T263910) (owner: 10Giuseppe Lavagetto) [07:24:12] (03CR) 10Ayounsi: Migrate ESAMS to Netbox Automation (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/630647 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [07:24:45] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] am: send resolved alerts [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/631162 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [07:26:29] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557 (10ema) 05Open→03Resolved a:03ema All production nodes are now running Varnish 6.0.6-1wm1. Closing! [07:28:38] wow! --^ [07:45:49] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM. Given that the only remaining addon of bastion::pop over the generic bastions is the ipmi::mgmt class (which only installs a script " [puppet] - 10https://gerrit.wikimedia.org/r/629496 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [07:45:52] (03PS6) 10Volans: Migrate ESAMS to Netbox Automation [dns] - 10https://gerrit.wikimedia.org/r/630647 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [07:45:54] (03CR) 10Volans: "addressed comments" (036 comments) [dns] - 10https://gerrit.wikimedia.org/r/630647 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [07:49:09] (03CR) 10Volans: [C: 03+1] "With the addressed comments LGTM" [dns] - 10https://gerrit.wikimedia.org/r/630647 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [07:50:16] (03PS1) 10Volans: scripts: dns, mark esams as migrated to Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/631388 (https://phabricator.wikimedia.org/T258729) [07:52:14] (03PS1) 10Volans: Set esams as migrated to the DNS Netbox automation [cookbooks] - 10https://gerrit.wikimedia.org/r/631389 (https://phabricator.wikimedia.org/T258729) [07:52:26] (03PS2) 10Volans: scripts: dns, mark esams as migrated to Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/631388 (https://phabricator.wikimedia.org/T258729) [07:53:03] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [07:53:04] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:47] (03CR) 10Alexandros Kosiaris: [C: 03+2] proton: remove conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/627859 (https://phabricator.wikimedia.org/T255877) (owner: 10Giuseppe Lavagetto) [07:54:53] 10Operations, 10Analytics, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10elukey) I am not an expert in `perf` but I tried to do the following on cp5012: `sudo perf record -F 99 -p 29945 --call-graph dwarf sleep 10` (the pid is varnishkafka-webrequest) And I... [07:56:25] (03PS2) 10Alexandros Kosiaris: proton: remove the ganeti VMs from puppet [puppet] - 10https://gerrit.wikimedia.org/r/627860 (https://phabricator.wikimedia.org/T255877) (owner: 10Giuseppe Lavagetto) [07:57:05] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10akosiaris) [07:59:32] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10JMeybohm) [08:03:14] (03CR) 10Muehlenhoff: "The cumin aliases in modules/profile/templates/cumin/aliases.yaml can also go away." [puppet] - 10https://gerrit.wikimedia.org/r/627860 (https://phabricator.wikimedia.org/T255877) (owner: 10Giuseppe Lavagetto) [08:03:49] (03CR) 10Muehlenhoff: "nvm, that's in the subsequent patch already." [puppet] - 10https://gerrit.wikimedia.org/r/627860 (https://phabricator.wikimedia.org/T255877) (owner: 10Giuseppe Lavagetto) [08:09:06] (03PS1) 10Elukey: Add hadoop worker node role to an-worker1103 [puppet] - 10https://gerrit.wikimedia.org/r/631391 [08:09:30] (03CR) 10jerkins-bot: [V: 04-1] Add hadoop worker node role to an-worker1103 [puppet] - 10https://gerrit.wikimedia.org/r/631391 (owner: 10Elukey) [08:12:19] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10JMeybohm) [08:13:03] (03PS1) 10Elukey: Remove fake TLS keystore/truststores for hadoop clusters [labs/private] - 10https://gerrit.wikimedia.org/r/631392 [08:13:06] (03PS1) 10Elukey: Add fake kerberos keytabs for new Hadoop workers [labs/private] - 10https://gerrit.wikimedia.org/r/631393 [08:13:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2109 ', diff saved to https://phabricator.wikimedia.org/P12875 and previous config saved to /var/cache/conftool/dbconfig/20201001-081308-marostegui.json [08:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:41] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10JMeybohm) [08:13:44] 10Operations, 10Data-Persistence-Backup, 10Epic, 10Goal: Plan WMF infrastructure for 100% coverage of data recovery - https://phabricator.wikimedia.org/T264272 (10jcrespo) [08:14:00] (03CR) 10Alexandros Kosiaris: [C: 03+2] proton: remove the ganeti VMs from puppet [puppet] - 10https://gerrit.wikimedia.org/r/627860 (https://phabricator.wikimedia.org/T255877) (owner: 10Giuseppe Lavagetto) [08:14:11] (03PS2) 10Elukey: Add hadoop worker node role to an-worker1103 [puppet] - 10https://gerrit.wikimedia.org/r/631391 (https://phabricator.wikimedia.org/T255140) [08:14:17] (03CR) 10Elukey: [V: 03+2 C: 03+2] Remove fake TLS keystore/truststores for hadoop clusters [labs/private] - 10https://gerrit.wikimedia.org/r/631392 (owner: 10Elukey) [08:14:49] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake kerberos keytabs for new Hadoop workers [labs/private] - 10https://gerrit.wikimedia.org/r/631393 (owner: 10Elukey) [08:15:04] (03PS2) 10Alexandros Kosiaris: proton: remove all puppet code, other references to the non-k8s service [puppet] - 10https://gerrit.wikimedia.org/r/627861 (https://phabricator.wikimedia.org/T255877) (owner: 10Giuseppe Lavagetto) [08:16:35] !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission [08:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:44] 10Operations, 10Data-Persistence-Backup, 10Goal: Create a system/methodology to track WMF datasets and its current or planned procedure for recovery - https://phabricator.wikimedia.org/T264274 (10jcrespo) [08:20:44] 10Operations, 10SRE-swift-storage, 10User-fgiunchedi: ms-be2017 slower than the rest of the cluster while rebalancing - https://phabricator.wikimedia.org/T264270 (10fgiunchedi) Also "tested" a reboot of ms-be2017 which had 400+ days of uptime, as expected that doesn't seem to have had a significant impact.... [08:21:18] 10Operations, 10Data-Persistence-Backup, 10Epic, 10Goal: Track all directly-owned SRE datasets into the new inventory system - https://phabricator.wikimedia.org/T264275 (10jcrespo) [08:21:40] 10Operations, 10Data-Persistence-Backup, 10Goal: Track all directly-owned SRE datasets into the new inventory system - https://phabricator.wikimedia.org/T264275 (10jcrespo) [08:22:43] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Epic, 10Goal: WMF media storage must be adequately backed up in a remote location - https://phabricator.wikimedia.org/T262668 (10jcrespo) [08:22:43] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [08:22:44] !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission [08:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:50] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: `proton1001.eqiad.wmnet` - proton1001.eqiad.wmnet (**... [08:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:02] !log akosiaris@cumin1001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) [08:25:02] !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission [08:25:04] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [08:25:05] !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission [08:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:09] !log akosiaris@cumin1001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) [08:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:21] !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission [08:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:05] volans: same issue as last time ^. https://phabricator.wikimedia.org/P12876, just so you aren't wondering [08:27:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] proton: remove all puppet code, other references to the non-k8s service [puppet] - 10https://gerrit.wikimedia.org/r/627861 (https://phabricator.wikimedia.org/T255877) (owner: 10Giuseppe Lavagetto) [08:27:57] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [08:27:57] !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission [08:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:04] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: `proton1002.eqiad.wmnet` - proton1002.eqiad.wmnet (**... [08:29:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:30:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:30:14] 10Operations: Migrate puppetboard to Buster - https://phabricator.wikimedia.org/T264276 (10MoritzMuehlenhoff) [08:30:59] (03PS3) 10JMeybohm: lvs: Remove mathoid non-TLS endpoint 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/629328 (https://phabricator.wikimedia.org/T255875) [08:33:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2109', diff saved to https://phabricator.wikimedia.org/P12877 and previous config saved to /var/cache/conftool/dbconfig/20201001-083321-marostegui.json [08:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:41] (03CR) 10JMeybohm: [C: 03+2] lvs: Remove mathoid non-TLS endpoint 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/629328 (https://phabricator.wikimedia.org/T255875) (owner: 10JMeybohm) [08:33:55] (03PS1) 10Muehlenhoff: Enabled managed sources.list for all of production [puppet] - 10https://gerrit.wikimedia.org/r/631396 (https://phabricator.wikimedia.org/T158562) [08:34:31] (03PS1) 10Alexandros Kosiaris: decommission: Avoid matching some IPs in regexp [cookbooks] - 10https://gerrit.wikimedia.org/r/631397 [08:34:44] (03PS6) 10JMeybohm: services: retire the ORES http endpoint (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/628802 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [08:35:23] (03CR) 10JMeybohm: [C: 03+2] services: retire the ORES http endpoint (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/628802 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [08:35:39] (03CR) 10jerkins-bot: [V: 04-1] decommission: Avoid matching some IPs in regexp [cookbooks] - 10https://gerrit.wikimedia.org/r/631397 (owner: 10Alexandros Kosiaris) [08:36:16] (03PS1) 10Alexandros Kosiaris: Remove proton{1,2}00{1,2} [dns] - 10https://gerrit.wikimedia.org/r/631398 (https://phabricator.wikimedia.org/T255877) [08:37:30] (03PS1) 10Vgutierrez: vcl: Use synthetic warning for 2% of ECDHE-ECDSA-AES128-SHA pageviews [puppet] - 10https://gerrit.wikimedia.org/r/631399 (https://phabricator.wikimedia.org/T258405) [08:38:00] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [08:38:00] !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission [08:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:07] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: `proton2001.codfw.wmnet` - proton2001.codfw.wmnet (**... [08:38:19] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Joe) I also noticed that the number of requests for `/v3/precache` has increased a lot around the time of the issue. This points to Change-Pr... [08:40:06] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/25575/" [puppet] - 10https://gerrit.wikimedia.org/r/631391 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [08:41:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:42:01] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/630661 (owner: 10Dzahn) [08:43:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:44:44] 10Operations: Migrate puppetboard to Buster - https://phabricator.wikimedia.org/T264276 (10Volans) @MoritzMuehlenhoff ping me when this work will start as we might want to upgrade puppetboard too. At the time I had to apply a couple of internal patches because not yet merged upstream. We should re-check the stat... [08:46:40] (03CR) 10Jbond: "See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/631291 (owner: 10Dzahn) [08:47:43] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/629439 (owner: 10Dzahn) [08:49:52] (03CR) 10Jbond: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/631261 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [08:50:05] 10Operations, 10SRE-swift-storage, 10User-fgiunchedi: ms-be2017 slower than the rest of the cluster while rebalancing - https://phabricator.wikimedia.org/T264270 (10fgiunchedi) For comparison, the same strace running on ms-be2016's rsync (although this host has already finished the majority of the rebalancin... [08:52:40] (03CR) 10Jbond: [C: 03+1] "LGTM wonder if we should add this to cloud.yaml as well?" [puppet] - 10https://gerrit.wikimedia.org/r/631396 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff) [08:53:13] (03PS1) 10Muehlenhoff: Fix bug in detection of quarters in the past [puppet] - 10https://gerrit.wikimedia.org/r/631401 [08:53:35] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [08:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:41] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: `proton2002.codfw.wmnet` - proton2002.codfw.wmnet (**... [08:53:43] (03PS7) 10Jbond: profile::swift::proxy: pass swift parameters via profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/631159 [08:53:55] (03PS3) 10Jbond: swift: move swift parameters to the profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/631169 [08:55:28] (03CR) 10Jbond: [C: 03+2] profile::swift::proxy: pass swift parameters via profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/631159 (owner: 10Jbond) [08:55:54] !log adding buster host restbase1028-b to cassandra [08:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:24] (03PS2) 10Muehlenhoff: Fix bug in detection of quarters in the past [puppet] - 10https://gerrit.wikimedia.org/r/631401 [08:58:01] (03CR) 10Jbond: [C: 03+2] swift: move swift parameters to the profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/631169 (owner: 10Jbond) [09:02:18] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/631396 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff) [09:04:54] (03CR) 10Hnowlan: [C: 03+1] citoid: Update to envoy 1.15.1-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/631384 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [09:05:03] 10Operations, 10Data-Persistence-Backup, 10Goal: Define a methodology to track WMF services backup requirements - https://phabricator.wikimedia.org/T264274 (10jcrespo) [09:05:49] (03CR) 10Hnowlan: [C: 03+1] envoy: New upstream version 1.15.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/631383 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [09:09:15] (03PS1) 10ArielGlenn: allow deployment-prep scap pull to work from instances with new dns names [puppet] - 10https://gerrit.wikimedia.org/r/631404 (https://phabricator.wikimedia.org/T245402) [09:12:06] (03PS1) 10JMeybohm: lvs: Remove wikifeeds non-TLS endpoint 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/631405 (https://phabricator.wikimedia.org/T255878) [09:12:09] (03PS1) 10JMeybohm: lvs: Remove wikifeeds non-TLS endpoint 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/631406 (https://phabricator.wikimedia.org/T255878) [09:12:15] (03CR) 10ArielGlenn: [C: 03+2] allow deployment-prep scap pull to work from instances with new dns names [puppet] - 10https://gerrit.wikimedia.org/r/631404 (https://phabricator.wikimedia.org/T245402) (owner: 10ArielGlenn) [09:40:53] 10Operations: Migrate puppetboard to Buster - https://phabricator.wikimedia.org/T264276 (10jbond) >>! In T264276#6507984, @Volans wrote: > @MoritzMuehlenhoff ping me when this work will start as we might want to upgrade puppetboard too. At the time I had to apply a couple of internal patches because not yet merg... [09:46:27] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10observability, 10serviceops: illegal_argument_exception - https://phabricator.wikimedia.org/T262429 (10MSantos) 05Open→03Resolved a:03MSantos I'm going to be bold and close this task because it looks resolved and we... [09:48:12] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: valid ROAs alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [09:48:25] (03PS3) 10Muehlenhoff: Fix bug in detection of quarters in the past [puppet] - 10https://gerrit.wikimedia.org/r/631401 [09:50:04] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Services, 10Service-deployment-requests: New Service Request: Wikimedia push notification service - https://phabricator.wikimedia.org/T250452 (10Jgiannelos) 05Open→03Resolved a:03Jgiannelos [09:50:12] 10Operations, 10Maps: Migrate maps to Buster - https://phabricator.wikimedia.org/T264292 (10MoritzMuehlenhoff) [09:52:19] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: bootstrap repo for thirdparty/kubedam-k8s-1-17 [puppet] - 10https://gerrit.wikimedia.org/r/631410 (https://phabricator.wikimedia.org/T263284) [09:56:24] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [09:57:18] (03CR) 10Vgutierrez: trafficserver: replace hiera() with lookup() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/631291 (owner: 10Dzahn) [10:00:04] mvolz: My dear minions, it's time we take the moon! Just kidding. Time for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201001T1000). [10:01:03] (03PS4) 10Muehlenhoff: Fix bug in detection of quarters in the past [puppet] - 10https://gerrit.wikimedia.org/r/631401 [10:01:45] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/631410 (https://phabricator.wikimedia.org/T263284) (owner: 10Arturo Borrero Gonzalez) [10:04:28] (03CR) 10Muehlenhoff: "Yeah, when this is applied to all of prod, I'll coordinate whether to apply this for cloud as well. I'd expect that some people might have" [puppet] - 10https://gerrit.wikimedia.org/r/631396 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff) [10:07:32] (03PS2) 10Muehlenhoff: Enabled managed sources.list for all of production [puppet] - 10https://gerrit.wikimedia.org/r/631396 (https://phabricator.wikimedia.org/T158562) [10:08:04] 10Operations, 10Maps: Migrate maps to Buster - https://phabricator.wikimedia.org/T264292 (10Gehel) [10:08:06] 10Operations, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Gehel) [10:12:48] (03PS1) 10Kormat: dbutil: Drop get_wikis() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631412 [10:13:39] (03CR) 10jerkins-bot: [V: 04-1] dbutil: Drop get_wikis() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631412 (owner: 10Kormat) [10:15:00] (03PS2) 10Kormat: dbutil: Drop get_wikis() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631412 [10:15:50] (03CR) 10Muehlenhoff: [C: 03+2] Enabled managed sources.list for all of production [puppet] - 10https://gerrit.wikimedia.org/r/631396 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff) [10:18:48] (03CR) 10Kormat: [C: 03+2] dbutil: Drop get_wikis() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631412 (owner: 10Kormat) [10:19:39] (03Merged) 10jenkins-bot: dbutil: Drop get_wikis() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631412 (owner: 10Kormat) [10:20:53] (03CR) 10Jbond: trafficserver: replace hiera() with lookup() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/631291 (owner: 10Dzahn) [10:23:26] 10Operations, 10Maps, 10Wikimedia-Logstash, 10observability, and 3 others: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 (10MSantos) a:03MSantos [10:24:52] (03PS1) 10Kormat: mypy: Move config to setup.cfg [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631415 [10:26:24] (03CR) 10Kormat: [C: 03+2] mypy: Move config to setup.cfg [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631415 (owner: 10Kormat) [10:27:15] (03Merged) 10jenkins-bot: mypy: Move config to setup.cfg [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631415 (owner: 10Kormat) [10:28:18] (03CR) 10Hnowlan: [C: 03+2] changeprop: lower log level to error [deployment-charts] - 10https://gerrit.wikimedia.org/r/631200 (https://phabricator.wikimedia.org/T264195) (owner: 10Hnowlan) [10:28:26] (03CR) 10Elukey: [C: 03+2] Add hadoop worker node role to an-worker1103 [puppet] - 10https://gerrit.wikimedia.org/r/631391 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [10:28:32] (03PS3) 10Elukey: Add hadoop worker node role to an-worker1103 [puppet] - 10https://gerrit.wikimedia.org/r/631391 (https://phabricator.wikimedia.org/T255140) [10:30:33] (03Merged) 10jenkins-bot: changeprop: lower log level to error [deployment-charts] - 10https://gerrit.wikimedia.org/r/631200 (https://phabricator.wikimedia.org/T264195) (owner: 10Hnowlan) [10:34:39] (03PS1) 10Kormat: dbutil: Pass mypy --strict [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631418 [10:36:33] (03CR) 10Muehlenhoff: [C: 03+2] Fix bug in detection of quarters in the past [puppet] - 10https://gerrit.wikimedia.org/r/631401 (owner: 10Muehlenhoff) [10:37:31] (03PS2) 10Kormat: dbutil: Pass mypy --strict [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631418 [10:39:42] (03CR) 10Kormat: [C: 03+2] dbutil: Pass mypy --strict [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631418 (owner: 10Kormat) [10:40:43] (03Merged) 10jenkins-bot: dbutil: Pass mypy --strict [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631418 (owner: 10Kormat) [10:40:54] (03CR) 10Kosta Harlan: [C: 04-1] "I generally don't favor target=blank; either the user wants to open in a new tab (in which case they probably know about control clicking " [puppet] - 10https://gerrit.wikimedia.org/r/631237 (owner: 10Paladox) [10:47:03] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [10:47:04] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:16] !log kormat@cumin1001 dbctl commit (dc=all): 'db2119 depooling: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12878 and previous config saved to /var/cache/conftool/dbconfig/20201001-104716-kormat.json [10:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:21] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [10:58:34] jouncebot: next [10:58:34] In 0 hour(s) and 1 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201001T1100) [10:59:48] ACKNOWLEDGEMENT - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. ayounsi docker-reporter-releng-image: https://phabricator.wikimedia.org/T251918 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: How many deployers does it take to do European mid-day backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201001T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:21] A: none? [11:00:50] ;) [11:02:55] 10Operations, 10Traffic, 10observability, 10User-fgiunchedi: Aggregated metrics for ats-tls <-> clients ttfb percentiles - https://phabricator.wikimedia.org/T263536 (10fgiunchedi) Added a panel to https://grafana.wikimedia.org/d/000000479/frontend-traffic to showcase the top p95 offenders: {F32369902} I'... [11:04:33] (03CR) 10Volans: [C: 03+1] "Arzhel: 2 more records to check for you 😊" (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/630647 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [11:04:37] (03PS1) 10Urbanecm: kuwiktionary: Create Jinûvesazî namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631420 (https://phabricator.wikimedia.org/T262046) [11:05:26] (03CR) 10Urbanecm: [C: 03+2] kuwiktionary: Create Jinûvesazî namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631420 (https://phabricator.wikimedia.org/T262046) (owner: 10Urbanecm) [11:05:36] Lucas_WMDE: no patches, but I'm deploying anyway :D [11:06:09] (03Merged) 10jenkins-bot: kuwiktionary: Create Jinûvesazî namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631420 (https://phabricator.wikimedia.org/T262046) (owner: 10Urbanecm) [11:06:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: bootstrap repo for thirdparty/kubedam-k8s-1-17 [puppet] - 10https://gerrit.wikimedia.org/r/631410 (https://phabricator.wikimedia.org/T263284) (owner: 10Arturo Borrero Gonzalez) [11:07:00] (03PS2) 10Santhosh: wgSkipSkins: Exclude contenttranslation skin from skin options for users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628065 (https://phabricator.wikimedia.org/T263093) [11:07:48] (03PS1) 10Arturo Borrero Gonzalez: aptrepro: thirdparty/kubeadm-k8s-1-17: introduce helm3 package [puppet] - 10https://gerrit.wikimedia.org/r/631421 (https://phabricator.wikimedia.org/T264221) [11:08:15] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 58a8c8271d75ff477ce0507ac5021edcfc2f6453: kuwiktionary: Create Jinûvesazî namespace (T262046) (duration: 01m 01s) [11:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:22] T262046: Create a NAMESPACE with the title "Jinûvesazî" and it's talkpage "Gotûbêja jinûvesaziyê" on ku.wiktionary.org - https://phabricator.wikimedia.org/T262046 [11:09:26] (03CR) 10Santhosh: "Now ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628065 (https://phabricator.wikimedia.org/T263093) (owner: 10Santhosh) [11:09:42] !log [urbanecm@mwmaint2001 ~]$ mwscript namespaceDupes.php --wiki=kuwiktionary --fix # T262046 [11:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:06] (03CR) 10Ayounsi: Migrate ESAMS to Netbox Automation (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/630647 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [11:11:00] 10Operations: Migrate puppetboard to Buster - https://phabricator.wikimedia.org/T264276 (10Volans) The PRs I had there got eventually merged, at that time puppetboard was barely maintained so there was no release, but I guess they are included in the latest releases. Then there is https://github.com/voxpupuli/pu... [11:13:52] (03PS1) 10Urbanecm: Enable bot passwords at all fishbowl and private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631423 (https://phabricator.wikimedia.org/T258356) [11:14:26] !log pulling packages into reprepro for buster-wikimedia/thirdpardy/kubeadm-k8s-1-17 (T263284) [11:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:31] T263284: Upgrade Toolforge K8s to 1.17 - https://phabricator.wikimedia.org/T263284 [11:15:38] (03PS12) 10ArielGlenn: Swift object servers: enable object-expirer [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) (owner: 10Dave Pifke) [11:18:30] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: add missing update reference for thirdparty/kubeadm-k8s-1-17 [puppet] - 10https://gerrit.wikimedia.org/r/631424 (https://phabricator.wikimedia.org/T263284) [11:18:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: add missing update reference for thirdparty/kubeadm-k8s-1-17 [puppet] - 10https://gerrit.wikimedia.org/r/631424 (https://phabricator.wikimedia.org/T263284) (owner: 10Arturo Borrero Gonzalez) [11:20:41] (03PS2) 10Arturo Borrero Gonzalez: aptrepro: thirdparty/kubeadm-k8s-1-17: introduce helm3 package [puppet] - 10https://gerrit.wikimedia.org/r/631421 (https://phabricator.wikimedia.org/T264221) [11:24:28] (03CR) 10Volans: "one comment inline, thanks for the patch" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/631397 (owner: 10Alexandros Kosiaris) [11:25:26] (03PS1) 10Elukey: Generalize profile::statistics::gpu [puppet] - 10https://gerrit.wikimedia.org/r/631425 (https://phabricator.wikimedia.org/T255138) [11:26:27] (03CR) 10jerkins-bot: [V: 04-1] Generalize profile::statistics::gpu [puppet] - 10https://gerrit.wikimedia.org/r/631425 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey) [11:28:19] (03PS2) 10Elukey: Generalize profile::statistics::gpu [puppet] - 10https://gerrit.wikimedia.org/r/631425 (https://phabricator.wikimedia.org/T255138) [11:29:21] (03CR) 10jerkins-bot: [V: 04-1] Generalize profile::statistics::gpu [puppet] - 10https://gerrit.wikimedia.org/r/631425 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey) [11:31:16] Luca, why don't you run_ci_locally to avoid spamming? Yep will do :) [11:35:46] (03PS3) 10Elukey: Generalize profile::statistics::gpu [puppet] - 10https://gerrit.wikimedia.org/r/631425 (https://phabricator.wikimedia.org/T255138) [11:41:34] (03PS1) 10Muehlenhoff: Remove profile::ipmi::mgmt from role::bastionhost::pop [puppet] - 10https://gerrit.wikimedia.org/r/631430 [11:42:31] (03CR) 10Muehlenhoff: [C: 03+1] "Created 631430 for my last comment." [puppet] - 10https://gerrit.wikimedia.org/r/629496 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [11:43:57] (03PS4) 10Elukey: Generalize profile::statistics::gpu [puppet] - 10https://gerrit.wikimedia.org/r/631425 (https://phabricator.wikimedia.org/T255138) [11:53:59] 10Operations, 10netops, 10observability: active/active links monitoring - https://phabricator.wikimedia.org/T264300 (10ayounsi) p:05Triage→03Medium [11:54:16] !log kormat@cumin1001 dbctl commit (dc=all): 'db2119 (re)pooling @ 25%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12879 and previous config saved to /var/cache/conftool/dbconfig/20201001-115415-kormat.json [11:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:22] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [11:57:16] 10Operations, 10Traffic, 10netops, 10Epic: Capacity planning for (& optimization of) transport backhaul vs edge egress - https://phabricator.wikimedia.org/T263275 (10ayounsi) [11:57:21] 10Operations, 10netops: Consider balancing VRRP primaries to cr1/cr2 - https://phabricator.wikimedia.org/T263212 (10ayounsi) 05Open→03Resolved Monitoring discussion moved to T264300. Balancing is done. [11:57:26] (03PS5) 10Elukey: Generalize profile::statistics::gpu [puppet] - 10https://gerrit.wikimedia.org/r/631425 (https://phabricator.wikimedia.org/T255138) [11:59:33] too late to backport something? [11:59:47] Urbanecm: if you're still around? [12:00:07] kostajh: well we've a minute left, but if it's important for GE :) [12:00:10] jouncebot: next [12:00:10] In 0 hour(s) and 59 minute(s): Mediawiki train - American+European Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201001T1300) [12:00:14] jouncebot: now [12:00:14] No deployments scheduled for the next 0 hour(s) and 59 minute(s) [12:00:27] Urbanecm: yes https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/631235 will prevent fatals on Special:Homepage on some wikis [12:00:36] okay, let's go ahead then [12:00:39] sorry, I thought it was backported yesterday [12:00:52] Urbanecm: thanks [12:00:57] (03CR) 10Urbanecm: [C: 03+2] Prevent returning the full templatelinks table in TemplateFilter [extensions/GrowthExperiments] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/631235 (https://phabricator.wikimedia.org/T264029) (owner: 10Catrope) [12:01:21] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/25581/" [puppet] - 10https://gerrit.wikimedia.org/r/631425 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey) [12:03:21] kostajh: i guess we want the wmf.11 version to get merged too, to prevent regression in wmf.11, right? [12:03:40] Urbanecm: yes, although maybe we will go straight to wmf.12 next week given the train status? not sure [12:03:59] Urbanecm: anyway, seems safer to merge that also [12:04:03] kostajh: ack [12:04:14] (03CR) 10Urbanecm: [C: 03+2] "to prevent this from regressing in wmf.11" [extensions/GrowthExperiments] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631236 (https://phabricator.wikimedia.org/T264029) (owner: 10Catrope) [12:04:15] i'll update the deployment calendar [12:04:18] thx [12:05:11] kostajh: I think this can't be really tested, right? [12:05:46] Urbanecm: yes I can test [12:06:15] ok, that's good. Will ping you once it's ready [12:08:10] (03PS1) 10Elukey: Set an-worker110[45] as Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/631434 (https://phabricator.wikimedia.org/T255140) [12:09:01] (03Merged) 10jenkins-bot: Prevent returning the full templatelinks table in TemplateFilter [extensions/GrowthExperiments] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/631235 (https://phabricator.wikimedia.org/T264029) (owner: 10Catrope) [12:09:19] !log kormat@cumin1001 dbctl commit (dc=all): 'db2119 (re)pooling @ 50%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12880 and previous config saved to /var/cache/conftool/dbconfig/20201001-120919-kormat.json [12:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:25] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [12:09:31] (03PS2) 10Elukey: Set an-worker110[45] as Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/631434 (https://phabricator.wikimedia.org/T255140) [12:10:26] kostajh: pulled onto mwdebug2001, can you test, please? [12:10:34] Urbanecm: yes looking [12:11:11] Urbanecm: success :) [12:11:16] thanks, syncing then :) [12:11:24] thanks very much [12:11:41] it looks like it was registered last night but didn't get done in the backport window, so I didn't completely imagine that this was happening yesterday :) [12:12:15] (03Merged) 10jenkins-bot: Prevent returning the full templatelinks table in TemplateFilter [extensions/GrowthExperiments] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631236 (https://phabricator.wikimedia.org/T264029) (owner: 10Catrope) [12:12:58] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.10/extensions/GrowthExperiments/includes/NewcomerTasks/TemplateFilter.php: 500d0c70c84936bcdecdd0927bcbb9ff7265afa9: Prevent returning the full templatelinks table in TemplateFilter (T264029) (duration: 01m 00s) [12:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:03] T264029: Special:Homepage runs out of memory - https://phabricator.wikimedia.org/T264029 [12:13:23] (03CR) 10Elukey: [C: 03+2] Set an-worker110[45] as Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/631434 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [12:13:48] kostajh: synced wmf.10, syncing wmf.11 too [12:14:06] Urbanecm: thanks again [12:15:23] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.11/extensions/GrowthExperiments/includes/NewcomerTasks/TemplateFilter.php: 500d0c70c84936bcdecdd0927bcbb9ff7265afa9: Prevent returning the full templatelinks table in TemplateFilter (T264029) (duration: 00m 59s) [12:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:33] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM. Deploying the udev rule seems totally favorable over using a backport of systemd" [puppet] - 10https://gerrit.wikimedia.org/r/631425 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey) [12:15:39] kostajh: it seems we're done now :) [12:23:38] PROBLEM - Check systemd state on an-worker1104 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:24:22] !log kormat@cumin1001 dbctl commit (dc=all): 'db2119 (re)pooling @ 75%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12881 and previous config saved to /var/cache/conftool/dbconfig/20201001-122422-kormat.json [12:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:28] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [12:24:38] RECOVERY - Check systemd state on an-worker1104 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:25:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:26:48] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:27:24] 10Operations, 10SRE-swift-storage, 10User-fgiunchedi: ms-be2017 slower than the rest of the cluster while rebalancing - https://phabricator.wikimedia.org/T264270 (10fgiunchedi) >>! In T264270#6507797, @fgiunchedi wrote: > Also "tested" a reboot of ms-be2017 which had 400+ days of uptime, as expected that doe... [12:37:42] (03PS3) 10Gehel: Add dsh groups config for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/630081 (https://phabricator.wikimedia.org/T252124) (owner: 10ZPapierski) [12:37:54] (03CR) 10Alexandros Kosiaris: [C: 03+1] lvs: Remove wikifeeds non-TLS endpoint 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/631405 (https://phabricator.wikimedia.org/T255878) (owner: 10JMeybohm) [12:38:09] (03CR) 10Alexandros Kosiaris: [C: 03+1] lvs: Remove wikifeeds non-TLS endpoint 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/631406 (https://phabricator.wikimedia.org/T255878) (owner: 10JMeybohm) [12:39:14] (03CR) 10Gehel: [C: 03+2] Add dsh groups config for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/630081 (https://phabricator.wikimedia.org/T252124) (owner: 10ZPapierski) [12:39:26] !log kormat@cumin1001 dbctl commit (dc=all): 'db2119 (re)pooling @ 100%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12882 and previous config saved to /var/cache/conftool/dbconfig/20201001-123925-kormat.json [12:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:32] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [12:39:34] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM – the default is MIGRATION_NEW in wmf.10 already." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631431 (https://phabricator.wikimedia.org/T264286) (owner: 10Tobias Andersson) [12:41:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_proton_cluster_eqiad,swagger_check_restbase_esams} site={eqiad,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:41:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove proton{1,2}00{1,2} [dns] - 10https://gerrit.wikimedia.org/r/631398 (https://phabricator.wikimedia.org/T255877) (owner: 10Alexandros Kosiaris) [12:44:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:44:42] (03CR) 10Elukey: [C: 03+2] Generalize profile::statistics::gpu [puppet] - 10https://gerrit.wikimedia.org/r/631425 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey) [12:46:14] (03PS2) 10Vgutierrez: vcl: Use synthetic warning for 2% of ECDHE-ECDSA-AES128-SHA pageviews [puppet] - 10https://gerrit.wikimedia.org/r/631399 (https://phabricator.wikimedia.org/T258405) [12:49:15] (03PS2) 10Alexandros Kosiaris: decommission: Avoid matching some IPs in regexp [cookbooks] - 10https://gerrit.wikimedia.org/r/631397 [12:49:25] (03PS1) 10Filippo Giunchedi: citoid: stop using gelf for logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/631437 (https://phabricator.wikimedia.org/T219919) [12:50:05] 10Operations, 10Citoid, 10Wikimedia-Logstash, 10observability, and 3 others: Move citoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219919 (10fgiunchedi) >>! In T219919#6498391, @Mvolz wrote: >>>! In T219919#6478632, @fgiunchedi wrote: >> It looks like citoid is now on k8s but sti... [12:52:21] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10akosiaris) [12:52:47] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10akosiaris) [12:53:06] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10akosiaris) 05Open→03Resolved All old stuff has been removed, I 'll resolve this. [12:54:06] (03CR) 10Alexandros Kosiaris: "Actually, it's cpjobqueue that was causing the issues. We should change this value for it as well." [deployment-charts] - 10https://gerrit.wikimedia.org/r/631200 (https://phabricator.wikimedia.org/T264195) (owner: 10Hnowlan) [12:56:31] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/630661 (owner: 10Dzahn) [12:56:52] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [12:56:53] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:08] !log kormat@cumin1001 dbctl commit (dc=all): 'db2136 depooling: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12883 and previous config saved to /var/cache/conftool/dbconfig/20201001-125707-kormat.json [12:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:14] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [13:00:04] twentyafterfour and hashar: Dear deployers, time to do the Mediawiki train - American+European Version (secondary timeslot) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201001T1300). [13:00:48] (03PS1) 10Filippo Giunchedi: profile: remove rollout flag for rsyslog queues [puppet] - 10https://gerrit.wikimedia.org/r/631438 (https://phabricator.wikimedia.org/T226703) [13:00:50] I am still processing with the tasks blockers from yesterday [13:00:50] (03PS1) 10Filippo Giunchedi: hieradata: enable rsyslog kafka queues in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/631439 (https://phabricator.wikimedia.org/T226703) [13:02:18] (03PS3) 10Alexandros Kosiaris: decommission: Avoid matching some IPs in regexp [cookbooks] - 10https://gerrit.wikimedia.org/r/631397 [13:03:33] (03PS1) 10Muehlenhoff: Remove obsolete Hiera settings for Thanos [puppet] - 10https://gerrit.wikimedia.org/r/631440 [13:04:29] (03PS2) 10Gehel: cloudelastic: fully configure pool/depool scripts [puppet] - 10https://gerrit.wikimedia.org/r/629829 (https://phabricator.wikimedia.org/T263073) (owner: 10Ryan Kemper) [13:04:34] (03PS3) 10Gehel: cloudelastic: fully configure pool/depool scripts [puppet] - 10https://gerrit.wikimedia.org/r/629829 (https://phabricator.wikimedia.org/T263073) (owner: 10Ryan Kemper) [13:05:23] (03CR) 10Filippo Giunchedi: "noop effectively https://puppet-compiler.wmflabs.org/compiler1003/25584/" [puppet] - 10https://gerrit.wikimedia.org/r/631438 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [13:06:00] (03CR) 10Gehel: [C: 03+2] cloudelastic: fully configure pool/depool scripts [puppet] - 10https://gerrit.wikimedia.org/r/629829 (https://phabricator.wikimedia.org/T263073) (owner: 10Ryan Kemper) [13:06:44] (03CR) 10JMeybohm: [C: 03+2] lvs: Remove wikifeeds non-TLS endpoint 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/631405 (https://phabricator.wikimedia.org/T255878) (owner: 10JMeybohm) [13:07:35] (03PS3) 10JMeybohm: lvs: Remove zotero non-TLS endpoint 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/629338 (https://phabricator.wikimedia.org/T255869) [13:07:50] (03CR) 10Herron: [C: 03+1] "thanks for this, lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/631440 (owner: 10Muehlenhoff) [13:08:16] (03CR) 10JMeybohm: [C: 03+2] lvs: Remove zotero non-TLS endpoint 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/629338 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [13:11:44] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete Hiera settings for Thanos [puppet] - 10https://gerrit.wikimedia.org/r/631440 (owner: 10Muehlenhoff) [13:14:25] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Please replace Shannon Baileys SSH key - https://phabricator.wikimedia.org/T264127 (10herron) Hi @Sbailey as a security precaution, could you please use your existing shell access to upload the desired new ssh key onto one of the bastions (let's say ba... [13:14:32] 10Operations, 10Phatality, 10observability, 10Developer Productivity: Deploying "Phatality" plugin for Kibana invokes oom-killer on logstash::collector nodes - https://phabricator.wikimedia.org/T237706 (10fgiunchedi) Does Phatality work out of the box with Kibana 7 aka https://logstash-next.wikimedia.org ?... [13:22:24] !log installing curl security updates on stretch [13:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:21] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/631397 (owner: 10Alexandros Kosiaris) [13:28:28] (03PS1) 10Elukey: amd_rocm: add rock-dkms package [puppet] - 10https://gerrit.wikimedia.org/r/631444 (https://phabricator.wikimedia.org/T255138) [13:28:31] RECOVERY - Check systemd state on prometheus3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:29:35] (03CR) 10Elukey: [C: 03+2] amd_rocm: add rock-dkms package [puppet] - 10https://gerrit.wikimedia.org/r/631444 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey) [13:29:42] !log restarting mw canaries to pick up curl update [13:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:22] (03PS2) 10Filippo Giunchedi: hieradata: enable rsyslog kafka queues in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/631439 (https://phabricator.wikimedia.org/T226703) [13:31:24] (03PS1) 10Filippo Giunchedi: profile: look up rsyslog queue_size in scope [puppet] - 10https://gerrit.wikimedia.org/r/631446 (https://phabricator.wikimedia.org/T226703) [13:33:12] (03PS1) 10Elukey: Add an-worker1097 as hadoop worker [puppet] - 10https://gerrit.wikimedia.org/r/631447 (https://phabricator.wikimedia.org/T255138) [13:33:50] (03PS1) 10Jforrester: SpecialInvestigateBlock: Don't assume 'DisableUTEdit' exists [extensions/CheckUser] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631466 (https://phabricator.wikimedia.org/T264302) [13:35:01] (03CR) 10Elukey: [C: 03+2] Add an-worker1097 as hadoop worker [puppet] - 10https://gerrit.wikimedia.org/r/631447 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey) [13:39:21] (03CR) 10Ema: [C: 03+1] vcl: Use synthetic warning for 2% of ECDHE-ECDSA-AES128-SHA pageviews [puppet] - 10https://gerrit.wikimedia.org/r/631399 (https://phabricator.wikimedia.org/T258405) (owner: 10Vgutierrez) [13:43:02] (03CR) 10Vgutierrez: [C: 03+2] vcl: Use synthetic warning for 2% of ECDHE-ECDSA-AES128-SHA pageviews [puppet] - 10https://gerrit.wikimedia.org/r/631399 (https://phabricator.wikimedia.org/T258405) (owner: 10Vgutierrez) [13:43:54] !log use synthetic warning for 2% of ECDHE-ECDSA-AES128-SHA pageviews - T258405 [13:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:59] T258405: Deprecate TLSv1.2 weak ciphersuites - https://phabricator.wikimedia.org/T258405 [13:44:49] PROBLEM - Hadoop DataNode on an-worker1096 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:45:47] PROBLEM - Hadoop NodeManager on an-worker1096 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:48:19] this is a reboot --^ [13:49:06] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime [13:49:07] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:29] elukey: should we look forward to the release of the new an-worker1096 origin story? [13:50:56] !log rebooting an-worker1096 for cluster maintenance [13:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:50] !log kormat@cumin1001 dbctl commit (dc=all): 'db2136 (re)pooling @ 33%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12884 and previous config saved to /var/cache/conftool/dbconfig/20201001-135149-kormat.json [13:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:55] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [13:52:20] kormat: the what sorry ?? :D :D [13:52:53] elukey: https://en.wikipedia.org/wiki/Reboot_(fiction) [13:53:07] usually followed by an origin story movie, e.g. batman begins [13:53:29] And here I thought it was a dig at my novels [13:53:34] I thought that was going to be https://en.wikipedia.org/wiki/ReBoot and I got even more confused [13:53:40] * elukey goes in the shame corner for his ignorance [13:55:12] RECOVERY - Hadoop DataNode on an-worker1096 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:58:31] (03PS1) 10Alexandros Kosiaris: Add pytest and a simple test for decommission [cookbooks] - 10https://gerrit.wikimedia.org/r/631448 [13:59:36] !log installing nginx security updates on schema* [13:59:39] (03CR) 10jerkins-bot: [V: 04-1] Add pytest and a simple test for decommission [cookbooks] - 10https://gerrit.wikimedia.org/r/631448 (owner: 10Alexandros Kosiaris) [13:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:21] (03PS2) 10Alexandros Kosiaris: Add pytest and a simple test for decommission [cookbooks] - 10https://gerrit.wikimedia.org/r/631448 [14:02:23] (03CR) 10jerkins-bot: [V: 04-1] Add pytest and a simple test for decommission [cookbooks] - 10https://gerrit.wikimedia.org/r/631448 (owner: 10Alexandros Kosiaris) [14:02:47] (03PS3) 10Filippo Giunchedi: hieradata: enable rsyslog kafka queues in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/631439 (https://phabricator.wikimedia.org/T226703) [14:03:51] (03PS3) 10Alexandros Kosiaris: Add pytest and a simple test for decommission [cookbooks] - 10https://gerrit.wikimedia.org/r/631448 [14:04:17] 10Operations, 10Data-Persistence-Backup, 10Goal: Define a methodology to track WMF services backup requirements - https://phabricator.wikimedia.org/T264274 (10LSobanski) [14:04:33] 10Operations, 10Data-Persistence-Backup, 10Goal: Define a methodology to track WMF services backup requirements - https://phabricator.wikimedia.org/T264274 (10LSobanski) [14:05:04] (03CR) 10jerkins-bot: [V: 04-1] Add pytest and a simple test for decommission [cookbooks] - 10https://gerrit.wikimedia.org/r/631448 (owner: 10Alexandros Kosiaris) [14:05:12] (03PS4) 10Filippo Giunchedi: hieradata: enable rsyslog kafka queues in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/631439 (https://phabricator.wikimedia.org/T226703) [14:06:53] !log kormat@cumin1001 dbctl commit (dc=all): 'db2136 (re)pooling @ 67%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12885 and previous config saved to /var/cache/conftool/dbconfig/20201001-140653-kormat.json [14:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:59] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [14:07:05] 10Operations, 10Data-Persistence-Backup, 10Goal: Define a methodology to track WMF services backup requirements - https://phabricator.wikimedia.org/T264274 (10LSobanski) [14:07:13] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/25592/" [puppet] - 10https://gerrit.wikimedia.org/r/631439 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [14:07:46] RECOVERY - Hadoop NodeManager on an-worker1096 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [14:07:46] (03PS4) 10Alexandros Kosiaris: Add pytest and a simple test for decommission [cookbooks] - 10https://gerrit.wikimedia.org/r/631448 [14:08:28] !log installing pillow security updates [14:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:38] (03PS1) 10Andrew Bogott: cloudvirt-wdqs* to Buster [puppet] - 10https://gerrit.wikimedia.org/r/631450 (https://phabricator.wikimedia.org/T259399) [14:09:46] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt-wdqs* to Buster [puppet] - 10https://gerrit.wikimedia.org/r/631450 (https://phabricator.wikimedia.org/T259399) (owner: 10Andrew Bogott) [14:09:52] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1003/25593/" [puppet] - 10https://gerrit.wikimedia.org/r/631446 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [14:09:54] (03PS2) 10Andrew Bogott: cloudvirt-wdqs* to Buster [puppet] - 10https://gerrit.wikimedia.org/r/631450 (https://phabricator.wikimedia.org/T259399) [14:12:31] !log enable puppet on mw2271 [14:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:33] (03PS1) 10Klausman: amd_rocm Prometheus script: hard-code Py3.7 usage [puppet] - 10https://gerrit.wikimedia.org/r/631452 (https://phabricator.wikimedia.org/T255138) [14:14:22] !log reimaging cloudvirt-wdqs1001 to buster [14:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:58] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:21:19] (03CR) 10Elukey: [C: 03+1] amd_rocm Prometheus script: hard-code Py3.7 usage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/631452 (https://phabricator.wikimedia.org/T255138) (owner: 10Klausman) [14:21:35] (03CR) 10Klausman: [C: 03+2] amd_rocm Prometheus script: hard-code Py3.7 usage [puppet] - 10https://gerrit.wikimedia.org/r/631452 (https://phabricator.wikimedia.org/T255138) (owner: 10Klausman) [14:21:57] !log kormat@cumin1001 dbctl commit (dc=all): 'db2136 (re)pooling @ 100%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12886 and previous config saved to /var/cache/conftool/dbconfig/20201001-142156-kormat.json [14:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:03] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [14:22:57] (03PS2) 10Klausman: amd_rocm Prometheus script: hard-code Py3.7 usage [puppet] - 10https://gerrit.wikimedia.org/r/631452 (https://phabricator.wikimedia.org/T255138) [14:23:00] PROBLEM - cassandra-c CQL 10.64.0.211:9042 on restbase1028 is CRITICAL: connect to address 10.64.0.211 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:23:00] PROBLEM - cassandra-a SSL 10.64.16.180:7001 on restbase1029 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [14:23:02] PROBLEM - cassandra-c CQL 10.64.48.236:9042 on restbase1030 is CRITICAL: connect to address 10.64.48.236 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:23:04] PROBLEM - cassandra-b CQL 10.64.16.181:9042 on restbase1029 is CRITICAL: connect to address 10.64.16.181 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:23:08] PROBLEM - cassandra-b CQL 10.64.48.235:9042 on restbase1030 is CRITICAL: connect to address 10.64.48.235 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:23:12] PROBLEM - cassandra-c service on restbase1030 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:23:20] PROBLEM - cassandra-c SSL 10.64.16.182:7001 on restbase1029 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [14:23:26] PROBLEM - cassandra-a CQL 10.64.16.180:9042 on restbase1029 is CRITICAL: connect to address 10.64.16.180 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:23:32] PROBLEM - cassandra-b SSL 10.64.16.181:7001 on restbase1029 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [14:23:40] PROBLEM - cassandra-b service on restbase1030 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:23:46] PROBLEM - cassandra-c CQL 10.64.16.182:9042 on restbase1029 is CRITICAL: connect to address 10.64.16.182 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:23:50] PROBLEM - cassandra-a service on restbase1029 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:23:56] PROBLEM - cassandra-c service on restbase1029 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:23:56] PROBLEM - cassandra-a SSL 10.64.48.234:7001 on restbase1030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [14:24:02] PROBLEM - cassandra-b service on restbase1029 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:24:16] PROBLEM - cassandra-b SSL 10.64.48.235:7001 on restbase1030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [14:24:22] PROBLEM - cassandra-a service on restbase1030 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:24:24] PROBLEM - cassandra-a CQL 10.64.48.234:9042 on restbase1030 is CRITICAL: connect to address 10.64.48.234 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:24:32] PROBLEM - cassandra-c SSL 10.64.48.236:7001 on restbase1030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [14:24:54] hnowlan: expired downtime? [14:25:42] (03CR) 10Klausman: [C: 03+2] amd_rocm Prometheus script: hard-code Py3.7 usage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/631452 (https://phabricator.wikimedia.org/T255138) (owner: 10Klausman) [14:26:28] (03CR) 10Volans: [C: 04-1] "Some comments on the structure and minor ones on the implementation, see inline." (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/631448 (owner: 10Alexandros Kosiaris) [14:27:09] (03PS1) 10Herron: admin: update dedcode account attributes and set expiry [puppet] - 10https://gerrit.wikimedia.org/r/631455 (https://phabricator.wikimedia.org/T263692) [14:27:43] 10Operations, 10Data-Persistence-Backup, 10Goal: Track all directly-owned SRE datasets into the new inventory system - https://phabricator.wikimedia.org/T264275 (10LSobanski) [14:27:57] (03PS1) 10Andrew Bogott: cloudvirt-wdqs*: change partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/631456 [14:28:14] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 23576 bytes in 0.260 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:28:34] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt-wdqs*: change partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/631456 (owner: 10Andrew Bogott) [14:28:48] moritzm: agh, yes [14:29:04] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [14:29:05] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:08] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [14:29:09] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:12] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [14:29:13] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:46] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Djellel Difallah - https://phabricator.wikimedia.org/T263692 (10herron) Since this is somewhat of an atypical access request (in that the account and group membership are pre-existing, but attributes... [14:35:33] !log Create bot_passwords table at all private wikis (T258356) [14:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:38] T258356: Allow users at all private/fishbowl wikis to use botpasswords - https://phabricator.wikimedia.org/T258356 [14:35:43] (03PS3) 10Arturo Borrero Gonzalez: aptrepo: thirdparty/kubeadm-k8s-1-17: introduce helm3 package [puppet] - 10https://gerrit.wikimedia.org/r/631421 (https://phabricator.wikimedia.org/T264221) [14:35:49] ACKNOWLEDGEMENT - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 2.168e+07 ge 2.592e+05 Gehel waiting on https://phabricator.wikimedia.org/T254014 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [14:36:58] (03PS1) 10Andrew Bogott: cloudvirt-wdqs* type fix for partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/631459 [14:37:38] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt-wdqs* type fix for partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/631459 (owner: 10Andrew Bogott) [14:37:39] 10Operations, 10serviceops, 10Patch-For-Review, 10User-jijiki: Test onhost memcached performance and functionality - https://phabricator.wikimedia.org/T263958 (10jijiki) [14:39:12] (03CR) 10Mvolz: [C: 03+1] citoid: stop using gelf for logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/631437 (https://phabricator.wikimedia.org/T219919) (owner: 10Filippo Giunchedi) [14:40:09] (03PS6) 10JMeybohm: services: retire the ORES http endpoint (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/628803 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [14:40:24] 10Operations, 10serviceops, 10Patch-For-Review, 10User-jijiki: Test onhost memcached performance and functionality - https://phabricator.wikimedia.org/T263958 (10jijiki) [14:40:50] (03CR) 10JMeybohm: [C: 03+2] services: retire the ORES http endpoint (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/628803 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [14:41:13] (03PS2) 10JMeybohm: lvs: Remove wikifeeds non-TLS endpoint 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/631406 (https://phabricator.wikimedia.org/T255878) [14:41:58] (03CR) 10JMeybohm: [C: 03+2] lvs: Remove wikifeeds non-TLS endpoint 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/631406 (https://phabricator.wikimedia.org/T255878) (owner: 10JMeybohm) [14:42:25] (03PS1) 10Cwhite: mtail: move systemd unit customizations to override [puppet] - 10https://gerrit.wikimedia.org/r/631460 [14:42:42] !log running puppet on lvs servers - T244843 T255878 [14:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:48] T244843: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 [14:42:49] T255878: Move wikifeeds to use TLS only - https://phabricator.wikimedia.org/T255878 [14:43:23] godog: elukey: hnowlan: FYI, I can't make today's Cassandra thing... [14:43:28] (03CR) 10jerkins-bot: [V: 04-1] mtail: move systemd unit customizations to override [puppet] - 10https://gerrit.wikimedia.org/r/631460 (owner: 10Cwhite) [14:43:37] incoming pybal alerts [14:44:49] urandom: np, let's cancel it! I don't have much (I followed some interesting apachecon na talks about upgrading cassandra from 2 to 3, but nothing else) [14:46:13] (03PS2) 10Cwhite: mtail: move systemd unit customizations to override [puppet] - 10https://gerrit.wikimedia.org/r/631460 [14:46:50] (03CR) 10Addshore: [C: 04-1] "This patch should be split into 2 patches, so that there is less chance it will e deployed incorrectly." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631431 (https://phabricator.wikimedia.org/T264286) (owner: 10Tobias Andersson) [14:48:36] !log restarting pybal on lvs2010.codfw.wmnet - T244843 T255878 [14:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:42] T244843: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 [14:48:42] T255878: Move wikifeeds to use TLS only - https://phabricator.wikimedia.org/T255878 [14:48:48] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.47:8889, 10.2.1.10:8081]) https://wikitech.wikimedia.org/wiki/PyBal [14:49:11] urandom elukey same here, ok to cancel [14:49:19] (03PS1) 10Elukey: Set an-worker109[78] as Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/631461 (https://phabricator.wikimedia.org/T255138) [14:49:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me. After merging, make sure to update the LDAP membership from cn=wmf to cn=nda." [puppet] - 10https://gerrit.wikimedia.org/r/631455 (https://phabricator.wikimedia.org/T263692) (owner: 10Herron) [14:50:01] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10Cmjohnson) @Jclark-ctr I see that the task has been resolved in coupa but I don't see the servers anywhere and they're not in netbox. Where are they? [14:50:02] !log restarting pybal on lvs1015.eqiad.wmnet,lvs2009.codfw.wmnet - T244843 T255878 [14:50:06] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.47:8889, 10.2.1.10:8081]) https://wikitech.wikimedia.org/wiki/PyBal [14:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:37] (03CR) 10Elukey: [C: 03+2] Set an-worker109[78] as Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/631461 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey) [14:51:28] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.10:8081, 10.2.2.47:8889]) https://wikitech.wikimedia.org/wiki/PyBal [14:52:55] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [14:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:39] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) The RAID finished correctly. @Papaul what do you want to do once the new disk arrives? Should we leave this old one in, or should we pull it out and insert the new one? `... [14:53:48] !log running ipvsadm -D -t 10.2.1.10:8081; ipvsadm -D -t 10.2.1.47:8889 on lvs2010.codfw.wmnet,lvs2009.codfw.wmnet - T244843 T255878 [14:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:54] T244843: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 [14:53:55] T255878: Move wikifeeds to use TLS only - https://phabricator.wikimedia.org/T255878 [14:54:51] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:15] !log installing npm security updates on buster [14:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:22] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:55:34] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Djellel Difallah - https://phabricator.wikimedia.org/T263692 (10herron) [14:55:43] !log running ipvsadm -D -t 10.2.2.10:8081; ipvsadm -D -t 10.2.2.47:8889 on lvs1015.eqiad.wmnet - T244843 T255878 [14:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:19] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) 05Open→03Resolved Going to close this as resolved. @Papaul let me know your thought from the above comment! Thank you [14:56:28] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:56:30] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:57:59] (03CR) 10Cwhite: "PCC checks out: https://puppet-compiler.wmflabs.org/compiler1001/25594/" [puppet] - 10https://gerrit.wikimedia.org/r/631460 (owner: 10Cwhite) [15:01:14] PROBLEM - Disk space on Hadoop worker on an-worker1098 is CRITICAL: NRPE: Command check_disk_space_hadoop_worker not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [15:02:22] PROBLEM - Check systemd state on an-worker1098 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:36] (03PS1) 10Muehlenhoff: Add library hint for LLVM [puppet] - 10https://gerrit.wikimedia.org/r/631464 [15:03:26] RECOVERY - Disk space on Hadoop worker on an-worker1098 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [15:03:40] new nodes -^ [15:03:54] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for LLVM [puppet] - 10https://gerrit.wikimedia.org/r/631464 (owner: 10Muehlenhoff) [15:04:32] RECOVERY - Check systemd state on an-worker1098 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:53] (03CR) 10BBlack: [C: 03+1] Migrate ESAMS to Netbox Automation [dns] - 10https://gerrit.wikimedia.org/r/630647 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [15:10:45] (03CR) 10Tchanders: [C: 03+1] SpecialInvestigateBlock: Don't assume 'DisableUTEdit' exists [extensions/CheckUser] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631466 (https://phabricator.wikimedia.org/T264302) (owner: 10Jforrester) [15:12:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps10[05-10].eqiad.wmnet - https://phabricator.wikimedia.org/T260269 (10Cmjohnson) [15:14:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10Cmjohnson) 05Open→03Resolved [15:21:24] (03CR) 10Filippo Giunchedi: [C: 03+1] mtail: move systemd unit customizations to override [puppet] - 10https://gerrit.wikimedia.org/r/631460 (owner: 10Cwhite) [15:28:37] (03CR) 10Bstorm: [C: 03+1] "I think this is a good idea. Keeping helm3 up to date (especially if we expand its usage in Toolforge) prevents a lot of problems." [puppet] - 10https://gerrit.wikimedia.org/r/631421 (https://phabricator.wikimedia.org/T264221) (owner: 10Arturo Borrero Gonzalez) [15:31:26] 10Operations, 10serviceops, 10Patch-For-Review, 10User-jijiki: Test onhost memcached performance and functionality - https://phabricator.wikimedia.org/T263958 (10jijiki) I installed memcached on mwdebug1001 and configured mcrouter as is described in the task description. Functionality wise, I didn't see an... [15:31:56] (03PS2) 10Arturo Borrero Gonzalez: openstack: cloudgw: introduce native vlan for easier reimaging [puppet] - 10https://gerrit.wikimedia.org/r/630812 (https://phabricator.wikimedia.org/T263622) [15:33:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: cloudgw: introduce native vlan for easier reimaging [puppet] - 10https://gerrit.wikimedia.org/r/630812 (https://phabricator.wikimedia.org/T263622) (owner: 10Arturo Borrero Gonzalez) [15:36:52] (03PS1) 10Ahmon Dancy: .gitignore: Ignore wikiversions-*.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/631493 [15:37:26] (03CR) 10Ahmon Dancy: [C: 03+2] .gitignore: Ignore wikiversions-*.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/631493 (owner: 10Ahmon Dancy) [15:38:02] (03Merged) 10jenkins-bot: .gitignore: Ignore wikiversions-*.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/631493 (owner: 10Ahmon Dancy) [15:47:01] (03PS1) 10Dwisehaupt: Shift payments to codfw for testing 1_35 [dns] - 10https://gerrit.wikimedia.org/r/631494 (https://phabricator.wikimedia.org/T254298) [15:50:08] (03PS5) 10Alexandros Kosiaris: Add pytest and a simple test for decommission [cookbooks] - 10https://gerrit.wikimedia.org/r/631448 [15:50:10] (03CR) 10Alexandros Kosiaris: Add pytest and a simple test for decommission (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/631448 (owner: 10Alexandros Kosiaris) [15:51:33] (03CR) 10Jgreen: [C: 03+2] Shift payments to codfw for testing 1_35 [dns] - 10https://gerrit.wikimedia.org/r/631494 (https://phabricator.wikimedia.org/T254298) (owner: 10Dwisehaupt) [15:52:48] (03PS1) 10Jgreen: Revert "Shift payments to codfw for testing 1_35" [dns] - 10https://gerrit.wikimedia.org/r/631468 [15:53:42] (03CR) 10Jgreen: [C: 03+2] Revert "Shift payments to codfw for testing 1_35" [dns] - 10https://gerrit.wikimedia.org/r/631468 (owner: 10Jgreen) [15:53:44] !log lvs1016: re-disabled puppet with ticket ref in comment, downed interface enp5s0f0 since it's flapping furiously [15:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:51] !log lvs1016: re-disabled puppet with ticket ref in comment, downed interface enp5s0f0 since it's flapping furiously - T264227 [15:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:56] T264227: lvs1016 enp5s0f0 interface errors - https://phabricator.wikimedia.org/T264227 [15:54:00] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal: Prepare a proof of concept of the minimum setup capable of backup and recover testwiki media files - https://phabricator.wikimedia.org/T264189 (10LSobanski) [15:54:17] (03PS1) 10Filippo Giunchedi: am: ensure ends_at is sent with a timezone [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/631495 [15:54:24] (03PS3) 10Tobias Andersson: [DNM] Remove migration settings in Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631431 (https://phabricator.wikimedia.org/T264286) [15:54:52] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [15:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:31] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal: Research storage solutions for media backups - https://phabricator.wikimedia.org/T264190 (10LSobanski) [15:55:41] 10Operations, 10ops-eqiad, 10Traffic, 10netops: lvs1016 enp5s0f0 interface errors - https://phabricator.wikimedia.org/T264227 (10BBlack) The link has gotten worse and began flapping up and down rapidly since last update, causing a loss of routing to the row. I've downtimed the whole host now in icinga, di... [15:55:44] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:55:52] (03CR) 10Alexandros Kosiaris: Add pytest and a simple test for decommission (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/631448 (owner: 10Alexandros Kosiaris) [15:56:16] (03PS6) 10Alexandros Kosiaris: Add pytest and a simple test for decommission [cookbooks] - 10https://gerrit.wikimedia.org/r/631448 [15:56:51] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:03] (03CR) 10Filippo Giunchedi: "Results in a 400 from AM otherwise:" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/631495 (owner: 10Filippo Giunchedi) [15:57:33] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal: Research storage solutions for media backups - https://phabricator.wikimedia.org/T264190 (10LSobanski) [15:59:12] (03PS1) 10Filippo Giunchedi: am: catch and report ApiException [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/631499 [15:59:40] (03CR) 10Cwhite: [C: 03+2] mtail: move systemd unit customizations to override [puppet] - 10https://gerrit.wikimedia.org/r/631460 (owner: 10Cwhite) [16:00:04] jbond42 and cdanis: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201001T1600). [16:00:50] nothing in Puppet request window [16:01:02] (03PS4) 10Alexandros Kosiaris: decommission: Avoid matching some IPs in regexp [cookbooks] - 10https://gerrit.wikimedia.org/r/631397 [16:01:59] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/25596/maps2004.codfw.wmnet/change.maps2004.codfw.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/629439 (owner: 10Dzahn) [16:04:48] (03PS1) 10Arturo Borrero Gonzalez: openstack: cloudgw: replace dots with colons when building host IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/631500 (https://phabricator.wikimedia.org/T261724) [16:05:03] (03PS1) 10Cwhite: mtail: upgrade mtail across the fleet to 3.0.0~rc35-3+wmf3 [puppet] - 10https://gerrit.wikimedia.org/r/631501 (https://phabricator.wikimedia.org/T263728) [16:05:31] (03CR) 10Elukey: "This is great! Can you add the Bug: T257297 to keep track of what has been done in here?" [cookbooks] - 10https://gerrit.wikimedia.org/r/631397 (owner: 10Alexandros Kosiaris) [16:05:40] (03PS2) 10Cwhite: mtail: upgrade mtail across the fleet to 3.0.0~rc35-3+wmf3 [puppet] - 10https://gerrit.wikimedia.org/r/631501 (https://phabricator.wikimedia.org/T263728) [16:07:14] (03PS6) 10Dzahn: maps: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/629439 [16:08:02] (03CR) 10Tobias Andersson: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631496 (https://phabricator.wikimedia.org/T264286) (owner: 10Tobias Andersson) [16:08:10] (03CR) 10Elukey: [C: 03+1] decommission: Avoid matching some IPs in regexp [cookbooks] - 10https://gerrit.wikimedia.org/r/631397 (owner: 10Alexandros Kosiaris) [16:09:26] (03PS1) 1020after4: Revert "Drop scap plugins, moved into scap proper" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631469 [16:11:09] (03CR) 10Cwhite: [C: 03+2] backport patch: MarshalJSON bucket bounds as strings configure gbp to use pdebuild [debs/mtail] (debian/sid) - 10https://gerrit.wikimedia.org/r/631310 (owner: 10Cwhite) [16:11:58] (03CR) 10Lars Wirzenius: [C: 03+2] Revert "Drop scap plugins, moved into scap proper" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631469 (owner: 1020after4) [16:12:43] (03Merged) 10jenkins-bot: Revert "Drop scap plugins, moved into scap proper" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631469 (owner: 1020after4) [16:12:51] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/25597/maps2004.codfw.wmnet/index.html https://puppet-compiler.wmflabs.org/compiler1002/2" [puppet] - 10https://gerrit.wikimedia.org/r/629439 (owner: 10Dzahn) [16:13:32] (03PS1) 10Ebernhardson: envoy: Set appropriate service names for three level wikimedia.org domains [puppet] - 10https://gerrit.wikimedia.org/r/631503 (https://phabricator.wikimedia.org/T263073) [16:13:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:14:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:16:00] (03CR) 10Ebernhardson: envoy: Set appropriate service names for three level wikimedia.org domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/631503 (https://phabricator.wikimedia.org/T263073) (owner: 10Ebernhardson) [16:19:19] (03CR) 10Ebernhardson: "pcc seems reasonable: https://puppet-compiler.wmflabs.org/compiler1001/25599/" [puppet] - 10https://gerrit.wikimedia.org/r/631503 (https://phabricator.wikimedia.org/T263073) (owner: 10Ebernhardson) [16:19:41] !log rebooting lvs1016 to a fresh state for interface config and error counters, etc - T264227 [16:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:46] T264227: lvs1016 enp5s0f0 interface errors - https://phabricator.wikimedia.org/T264227 [16:21:46] jouncebot: next [16:21:46] In 0 hour(s) and 38 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201001T1700) [16:21:52] jouncebot: now [16:21:52] For the next 0 hour(s) and 38 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201001T1600) [16:22:01] (03PS2) 10Arturo Borrero Gonzalez: openstack: cloudgw: replace dots with colons when building host IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/631500 (https://phabricator.wikimedia.org/T261724) [16:22:59] [FYI] chaomodus and I will migrate esams's DNS records to the Netbox-generated ones in ~10 minutes. Shout if we should hold for any reason! [16:25:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: cloudgw: replace dots with colons when building host IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/631500 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [16:30:43] (03PS1) 10RobH: updating sku listing [software] - 10https://gerrit.wikimedia.org/r/631507 [16:31:49] (03CR) 10RobH: [C: 03+2] updating sku listing [software] - 10https://gerrit.wikimedia.org/r/631507 (owner: 10RobH) [16:32:26] 10Operations, 10ops-eqiad, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps10[05-10].eqiad.wmnet - https://phabricator.wikimedia.org/T260269 (10Cmjohnson) [16:33:40] (03CR) 10Cwhite: [C: 03+1] "LGTM" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/631495 (owner: 10Filippo Giunchedi) [16:33:56] (03CR) 10Cwhite: [C: 03+1] am: catch and report ApiException [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/631499 (owner: 10Filippo Giunchedi) [16:40:59] (03PS1) 10Ssingh: dnsdist: add acl parameter for webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) [16:41:45] (03CR) 10Cwhite: [C: 03+1] profile: look up rsyslog queue_size in scope [puppet] - 10https://gerrit.wikimedia.org/r/631446 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [16:42:30] (03CR) 10Cwhite: [C: 03+1] profile: remove rollout flag for rsyslog queues [puppet] - 10https://gerrit.wikimedia.org/r/631438 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [16:42:36] RECOVERY - cassandra-c CQL 10.64.0.211:9042 on restbase1028 is OK: TCP OK - 0.000 second response time on 10.64.0.211 port 9042 https://phabricator.wikimedia.org/T93886 [16:42:54] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) There are two nodes marked with `fails to hit dhcp server, please check cable/port`, @Cmjohnson when you have a moment can you check? an-worker... [16:43:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_cxserver_cluster_eqiad,swagger_check_wikifeeds_codfw} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:44:04] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) * timestamp: 2020-10-01T16:15:49 * host: mw2290 * message: ` [ab3d143d-d371-41e4-ab74-ae92255e704f] /w/api.php?action=query&prop=revis... [16:44:23] (03PS2) 10Ssingh: dnsdist: add acl parameter for webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) [16:44:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:45:03] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Please replace Shannon Baileys SSH key - https://phabricator.wikimedia.org/T264127 (10Sbailey) Ya, tried to do this, but do not have access. I might need to refresh my id_rsa.pub key as well. Not sure how this whole house of cards hangs together: wmf1... [16:46:34] !log migrating esams DNS records to the autogenerated ones from Netbox - T258729 [16:46:34] (03CR) 10Cwhite: [C: 03+1] hieradata: enable rsyslog kafka queues in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/631439 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [16:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:39] T258729: netbox DNS Automation Workflow checklist for Commissioning and Decommissioning 2020Q1 - https://phabricator.wikimedia.org/T258729 [16:47:08] (03PS7) 10Volans: Migrate ESAMS to Netbox Automation [dns] - 10https://gerrit.wikimedia.org/r/630647 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [16:48:48] (03CR) 10CRusnov: [C: 03+2] Migrate ESAMS to Netbox Automation [dns] - 10https://gerrit.wikimedia.org/r/630647 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [16:50:36] (03PS3) 10Ssingh: dnsdist: add acl parameter for webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) [16:53:09] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10ssastry) As an additional data point, note that in the merged {T264241} task from yesterday, it was a different class / file. [16:53:46] RECOVERY - cassandra-a SSL 10.64.16.180:7001 on restbase1029 is OK: SSL OK - Certificate restbase1029-a valid until 2022-09-29 10:16:45 +0000 (expires in 727 days) https://phabricator.wikimedia.org/T120662 [16:54:26] RECOVERY - cassandra-a service on restbase1029 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:57:55] (03PS4) 10Ssingh: dnsdist: add acl parameter for webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) [17:00:04] chrisalbon and accraze: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201001T1700). [17:00:49] 10Operations, 10ops-eqiad, 10Analytics-Radar: an-presto1004 down - https://phabricator.wikimedia.org/T253438 (10RobH) a:05RobH→03Cmjohnson Reassigning to Chris, as I listed him as the contact on the self dispatch for the dell tech to contact and arrange a time for the onsite work. [17:06:46] (03PS1) 10Andrew Bogott: cloud-vps: update resolv.conf and associated domains for .wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/631511 [17:16:30] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:10] 10Operations, 10ops-eqiad, 10Traffic, 10netops: lvs1016 enp5s0f0 interface errors - https://phabricator.wikimedia.org/T264227 (10BBlack) 05Open→03Resolved a:03Cmjohnson @Cmjohnson replaced the SFPs on both ends of this link before my reboot above. Since the reboot, we don't seem to have any abnormal... [17:19:16] (03CR) 10Volans: [C: 03+2] Set esams as migrated to the DNS Netbox automation [cookbooks] - 10https://gerrit.wikimedia.org/r/631389 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [17:19:27] (03CR) 10Volans: [C: 03+2] scripts: dns, mark esams as migrated to Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/631388 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [17:19:39] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) Yeah, that's a classic string one-byte flip issue which does seem to be random and rarely seen twice the same way. Some upstream bugs... [17:20:25] (03Merged) 10jenkins-bot: Set esams as migrated to the DNS Netbox automation [cookbooks] - 10https://gerrit.wikimedia.org/r/631389 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [17:20:51] (03PS5) 10Ssingh: dnsdist: add acl parameter for webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) [17:22:57] !log fdans@deploy1001 Started deploy [analytics/refinery@530b339]: Regular analytics weekly train 530b339 [17:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:16] !log etherpad1002 - attempted to upgrade Etherpad to newer version but wasn't working, reverted to previous one [17:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:31] !log fdans@deploy1001 Finished deploy [analytics/refinery@530b339]: Regular analytics weekly train 530b339 (duration: 01m 34s) [17:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:26:43] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:32:09] (03PS1) 10Cmjohnson: Adding production dns for maps servers [dns] - 10https://gerrit.wikimedia.org/r/631512 (https://phabricator.wikimedia.org/T260269) [17:33:24] (03CR) 10Cmjohnson: [C: 03+2] Adding production dns for maps servers [dns] - 10https://gerrit.wikimedia.org/r/631512 (https://phabricator.wikimedia.org/T260269) (owner: 10Cmjohnson) [17:35:36] !log fdans@deploy1001 Started deploy [analytics/refinery@530b339]: Regular analytics weekly train 530b339 [17:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:18] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [17:49:19] !log fdans@deploy1001 Finished deploy [analytics/refinery@530b339]: Regular analytics weekly train 530b339 (duration: 13m 42s) [17:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:34] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [17:51:53] (03CR) 10Herron: [C: 03+1] hieradata: enable rsyslog kafka queues in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/631439 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [17:57:18] PROBLEM - ensure kvm processes are running on cloudvirt-wdqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:58:50] (03CR) 10Herron: [C: 03+2] admin: update dedcode account attributes and set expiry [puppet] - 10https://gerrit.wikimedia.org/r/631455 (https://phabricator.wikimedia.org/T263692) (owner: 10Herron) [18:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201001T1800). [18:00:04] ebernhardson: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:32] RECOVERY - ensure kvm processes are running on cloudvirt-wdqs1003 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:01:30] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Djellel Difallah - https://phabricator.wikimedia.org/T263692 (10herron) [18:01:49] \o [18:01:53] i can ship [18:03:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:04:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:05:53] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Djellel Difallah - https://phabricator.wikimedia.org/T263692 (10herron) 05Open→03Resolved a:03herron The requested access has been enabled and will become active within the next 30 minutes. I'... [18:06:45] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Please replace Shannon Baileys SSH key - https://phabricator.wikimedia.org/T264127 (10herron) Is there another host in production where you have working access? Placing a file there would work too, just let me know where to check. Otherwise we can fi... [18:07:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10Maps, 10Patch-For-Review: (Need By: TBD) rack/setup/install maps10[05-10].eqiad.wmnet - https://phabricator.wikimedia.org/T260269 (10Cmjohnson) [18:07:22] (03PS1) 10Dzahn: add etherpad1003.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/631514 [18:07:30] (03CR) 10jerkins-bot: [V: 04-1] add etherpad1003.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/631514 (owner: 10Dzahn) [18:08:11] (03PS2) 10Dzahn: add etherpad1003.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/631514 [18:08:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10Maps, 10Patch-For-Review: (Need By: TBD) rack/setup/install maps10[05-10].eqiad.wmnet - https://phabricator.wikimedia.org/T260269 (10Cmjohnson) These are mostly ready to turn over, the h/w raid has not been setup. I am not sure which raid configuration is needed. @... [18:09:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10Cmjohnson) [18:10:33] (03CR) 10Dzahn: [C: 03+2] add etherpad1003.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/631514 (owner: 10Dzahn) [18:10:45] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Ladsgroup) >>! In T263910#6507925, @Joe wrote: > > It seems like ORES right now is operating almost "at-capacity", and the additional traff... [18:10:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10Cmjohnson) @marostegui and @bstorm I will be racking these in the next few days. Can you please review your racking plan and confirm that... [18:11:34] (03PS3) 10Ebernhardson: cirrus: Increase more_like cache from one to three days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631312 (https://phabricator.wikimedia.org/T264053) [18:11:43] (03CR) 10Ebernhardson: [C: 03+2] "backport window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631312 (https://phabricator.wikimedia.org/T264053) (owner: 10Ebernhardson) [18:12:31] (03Merged) 10jenkins-bot: cirrus: Increase more_like cache from one to three days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631312 (https://phabricator.wikimedia.org/T264053) (owner: 10Ebernhardson) [18:12:56] 10Operations: Change urbanecm's SSH production key - https://phabricator.wikimedia.org/T264345 (10Urbanecm) [18:13:38] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Cmjohnson) I am removing the ops-eqiad tag from this task. If you need an on-site task please create a new ticket. [18:14:08] (03PS1) 10Urbanecm: admin: Change urbanecm's SSH key [puppet] - 10https://gerrit.wikimedia.org/r/631515 (https://phabricator.wikimedia.org/T264345) [18:14:16] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:TBD) rack/setup/install rows C and D new PDUs - https://phabricator.wikimedia.org/T253694 (10Cmjohnson) 05Open→03Resolved the pdu upgrade has been completed. [18:15:00] thcipriani: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/631469 is merged to gerrit but not merged to deployment. intentional? [18:15:34] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Djellel Difallah - https://phabricator.wikimedia.org/T263692 (10Isaac) @herron sorry was a bit late but the patch looks good to me. thanks! [18:16:26] 10Operations, 10ops-eqiad, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps10[05-10].eqiad.wmnet - https://phabricator.wikimedia.org/T260269 (10wiki_willy) Hi @Cmjohnson - I think @RKemper might be the owner of these machines: >>! In T260269#6509877, @Cmjohnson wrote: > These are mostly ready to t... [18:19:08] liw: looks like you merged https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/631469 a few hours ago, but it doesn't seem to be deployed. Should it be? [18:22:00] 10Operations, 10Patch-For-Review: Change urbanecm's SSH production key - https://phabricator.wikimedia.org/T264345 (10Urbanecm) Verification of key authenticity: * Above uploaded patch comes from my Gerrit account * Signed version of the SSH key was uploaded to bast2002:/home/urbanecm/id_ed25519_wmnet_2020100... [18:23:28] 10Operations, 10ops-eqiad, 10Analytics-Radar: an-presto1004 down - https://phabricator.wikimedia.org/T253438 (10Cmjohnson) the dell tech came today and replaced the board but did not bring new power supplies...anyway, swapped the board, and the power supplies still burned up [18:23:58] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Change urbanecm's SSH production key - https://phabricator.wikimedia.org/T264345 (10herron) [18:25:36] liw: thcipriani: since it's not deployed safest thing seems to revert the undeployed revert, meaning scap plugins stay dropped afaict [18:27:18] (03PS1) 10Ebernhardson: Revert "Revert "Drop scap plugins, moved into scap proper"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631472 [18:27:56] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Please replace Shannon Baileys SSH key - https://phabricator.wikimedia.org/T264127 (10Urbanecm) @Sbailey Note the right way to connect to bast1002 is `ssh bast1002.wikimedia.org`. That seems to be the reason why it failed for you. Everyone with any kin... [18:28:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10Cmjohnson) [18:28:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10Cmjohnson) [18:28:11] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Change urbanecm's SSH production key - https://phabricator.wikimedia.org/T264345 (10herron) The file bast2002:/home/urbanecm/id_ed25519_wmnet_20201001.pub.sig does indeed match the key in the description, and on the patch [18:28:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10cloud-services-team (Hardware): (Need By: 2020-06-12) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10Cmjohnson) 05Open→03Resolved updated em0 for both...resolving [18:29:03] ebernhardson: AFAIK this patch needs to be only fetched, it does not need to be synced [18:29:05] (03CR) 10Herron: [C: 03+1] "lgtm, verification details on task" [puppet] - 10https://gerrit.wikimedia.org/r/631515 (https://phabricator.wikimedia.org/T264345) (owner: 10Urbanecm) [18:30:04] Urbanecm: i suppose it wont affect the fleet, but it does change scap in ways that aren't clear to me. [18:30:38] ebernhardson: I think you should not merge the revert, as doing so would break production again [18:31:00] Urbanecm: since it's not deployed to the deploy host, that would imply produciton is currently broken? [18:31:17] (03CR) 10Urbanecm: [C: 04-2] "this would break scap in production again 😊." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631472 (owner: 10Ebernhardson) [18:31:36] ebernhardson: well, yes, right now, scap update-wikiversions won't work currently [18:31:55] (03CR) 10Ebernhardson: "This isn't deployed to the deploy host. It's not clear how that breaks anything." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631472 (owner: 10Ebernhardson) [18:33:02] Urbanecm: so, how does reverting it break production? That means it's currently broken [18:33:34] ebernhardson: ah, that's what you mean. Well, it doesn't break it _more_, right [18:33:43] Basically, my premise is that only deployed or about to be deployed code is allowed to be merged to mediawiki-config. If someone merged it and didn't deploy it, it must be reverted to match reality [18:35:30] ebernhardson: I have just run git rebase at deploy host. This patch only affects the deployment host, and as such, it doesn't need to be synced (in fact, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/631469 wasn't synced too, but it was fetched) [18:35:42] I think you can now just sync your patch [18:36:03] (through yes, you're right, liw should've fetched it after merging :)) [18:36:26] (03Abandoned) 10Ebernhardson: Revert "Revert "Drop scap plugins, moved into scap proper"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631472 (owner: 10Ebernhardson) [18:36:49] ok, that works i suppose. I'm not comfortable putting code on deployment that i'm wholy unfamiliar with, but you seem to know more :) [18:37:10] sure, i totally understand :) [18:40:17] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: cirrus: increase more_like recommendation cache from one to three days T264053 (duration: 00m 59s) [18:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:25] T264053: Unsustainable increases in Elasticsearch cluster disk IO - https://phabricator.wikimedia.org/T264053 [18:44:23] (03PS6) 10Ssingh: dnsdist: add acl parameter for webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) [18:45:38] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1033 psu redundancy alert - https://phabricator.wikimedia.org/T263145 (10Cmjohnson) I submitted a ticket with Dell for a new power supply You have successfully submitted request SR1038434287. [18:46:27] (03PS7) 10Ssingh: dnsdist: add acl parameter for webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) [18:48:39] (03PS8) 10Ssingh: dnsdist: add acl parameter for webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) [18:52:53] (03CR) 10Ssingh: "OK so here's a recap: it seems like 0.0.0.0/0 is not considered a valid Stdlib::IP::Address address. This is confirmed by https://puppet-c" [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) (owner: 10Ssingh) [18:59:07] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10Cmjohnson) [18:59:11] 10Operations, 10ops-eqiad, 10decommission-hardware: Decommission asw2-a5-eqiad - https://phabricator.wikimedia.org/T217383 (10Cmjohnson) 05Open→03Resolved wiped and removed from the rack [19:00:04] twentyafterfour and hashar: #bothumor I � Unicode. All rise for Mediawiki train - American+European Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201001T1900). [19:01:02] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Cmjohnson) 05Open→03Resolved a:05Jclark-ctr→03Cmjohnson Resolving this task, the server has been sent for decommis... [19:01:43] 10Operations, 10ops-eqiad, 10DBA, 10netops, and 2 others: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10Cmjohnson) 05Open→03Resolved This has been completed [19:03:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10wiki_willy) @Andrew - just a heads up, I talked to our Dell rep about using that credit/seed server for you guys in Q3, si... [19:03:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:03:51] (03CR) 10Fdans: [C: 03+1] "Tables created, dirs created, job restarted, we're good to merge this." [puppet] - 10https://gerrit.wikimedia.org/r/629070 (https://phabricator.wikimedia.org/T258047) (owner: 10Joal) [19:03:59] Pchelolo: will you be available to help debug if anything else goes wrong with wmf.11 rollout? Still trying to decide if the train is unblocked [19:04:54] twentyafterfour: I'm around yes. I'm porting Krinkle's hack from wmf.10 to wmf.11 as well just to be sure [19:05:56] Pchelolo: cool, should I wait for that? [19:06:04] yes please. [19:06:10] ok thanks! [19:08:25] twentyafterfour: ok, first, we need to merge the rollback of the faulty ParserCache patch: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/631240 [19:08:36] can I do it now? [19:10:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:11:06] (03PS2) 10Ppchelko: Revert "Revert "Revert "Hard deprecate all public properties in CacheTime and ParserOutput""" [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631240 (https://phabricator.wikimedia.org/T264257) [19:12:07] (03PS2) 10Urbanecm: admin: Change urbanecm's SSH key [puppet] - 10https://gerrit.wikimedia.org/r/631515 (https://phabricator.wikimedia.org/T264345) [19:13:01] (03CR) 10Ppchelko: "This change is ready for review." [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631473 (https://phabricator.wikimedia.org/T264257) (owner: 10Ppchelko) [19:13:39] twentyafterfour: ok. this is the chain of patches that moves wmf.11 to the same exact state as wmf.10 and master for the parserCache bug: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/631240/2 and https://gerrit.wikimedia.org/r/c/mediawiki/core/+/631473/2 [19:23:32] twentyafterfour: the backports of fixes are created: T264257#6510172 Jenkins doesn't seem to be picking them up though [19:23:33] T264257: Fix ParserOutput corruption wmf.10 -> wmf.11 - https://phabricator.wikimedia.org/T264257 [19:23:48] Pchelolo: checking [19:24:18] oh, no, I've looked at wrong queue [19:25:13] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10wiki_willy) @Cmjohnson - I just asked John and he says these are in shipping. Thanks Willy [19:25:34] (03CR) 1020after4: [C: 03+2] "hopefully unblocking the train" [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631240 (https://phabricator.wikimedia.org/T264257) (owner: 10Ppchelko) [19:27:21] (03CR) 1020after4: [C: 03+2] "hopefully unblocking the train." [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631473 (https://phabricator.wikimedia.org/T264257) (owner: 10Ppchelko) [19:32:40] 10Operations, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx1001 & frdata1002 - https://phabricator.wikimedia.org/T260181 (10Jgreen) [19:46:24] (03PS1) 10Jgreen: add frmx1001.wikimedia.org A/PTR records [dns] - 10https://gerrit.wikimedia.org/r/631521 (https://phabricator.wikimedia.org/T257245) [19:48:28] (03CR) 10Jgreen: [C: 03+2] add frmx1001.wikimedia.org A/PTR records [dns] - 10https://gerrit.wikimedia.org/r/631521 (https://phabricator.wikimedia.org/T257245) (owner: 10Jgreen) [19:49:43] (03PS1) 10Dzahn: import IP address types from stdlib 5.2 to fix CIDR matching [puppet] - 10https://gerrit.wikimedia.org/r/631522 [19:49:50] oh wow [19:50:10] (03Merged) 10jenkins-bot: Revert "Revert "Revert "Hard deprecate all public properties in CacheTime and ParserOutput""" [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631240 (https://phabricator.wikimedia.org/T264257) (owner: 10Ppchelko) [19:50:16] (03Merged) 10jenkins-bot: HACK/ParserCache: Force cache-miss if mUsedOptions is undefined [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631473 (https://phabricator.wikimedia.org/T264257) (owner: 10Ppchelko) [19:50:57] gerrit is not very good at highlighting 1-char diffs :P [19:51:24] oh, they removed the [] around the isolated 2's [19:51:34] sorry gerrit, not actually your fault! [19:52:13] (03CR) 10BBlack: [C: 03+1] import IP address types from stdlib 5.2 to fix CIDR matching [puppet] - 10https://gerrit.wikimedia.org/r/631522 (owner: 10Dzahn) [19:52:31] (03PS2) 10Dzahn: import IP address type patterns from stdlib 5.2 [puppet] - 10https://gerrit.wikimedia.org/r/631522 [19:52:58] (03PS3) 10Dzahn: import IP address type patterns from stdlib 5.2 [puppet] - 10https://gerrit.wikimedia.org/r/631522 [19:55:01] 10Operations, 10Editing-team, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Clarakosi) @Pchelolo and I looked at the logs and investigated the most recent incident reported by @SamW... [19:56:55] (03CR) 10Dzahn: [C: 03+2] import IP address type patterns from stdlib 5.2 [puppet] - 10https://gerrit.wikimedia.org/r/631522 (owner: 10Dzahn) [19:58:01] (03PS9) 10Dzahn: dnsdist: add acl parameter for webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) (owner: 10Ssingh) [19:58:13] (03PS10) 10Ssingh: dnsdist: add acl parameter for webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) [19:58:25] oops [20:01:09] ohh patches merged... [20:04:58] twentyafterfour: see backlog above around: [20:04:59] 18:33 < ebernhardson> Basically, my premise is that only deployed or about to be deployed code is allowed to be merged to mediawiki-config. If someone merged it and didn't deploy it, it [20:05:03] must be reverted to match reality [20:05:04] oops, bad paste.. but you get it [20:05:06] 18:35 < Urbanecm> ebernhardson: I have just run git rebase at deploy host. This patch only affects the deployment host, and as such, it doesn't need to be synced (in fact, [20:05:10] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/631469 wasn't synced too, but it was fetched) [20:05:46] * Urbanecm wonders how is that backlog related :) [20:06:02] uh it just merged a minute ago and it's the train window, during the train window I own mediawiki-staging [20:07:24] when even was that? hmm [20:08:11] that was before train window, I don't think it's an issue anymore is it? [20:08:13] !log twentyafterfour@deploy1001 Synchronized php-1.36.0-wmf.11/includes/parser/: sync ParserCache patches to unblock the train T264257 T263177 (duration: 00m 59s) [20:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:21] T263177: 1.36.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T263177 [20:08:21] T264257: Fix ParserOutput corruption wmf.10 -> wmf.11 - https://phabricator.wikimedia.org/T264257 [20:08:59] twentyafterfour: no, I git fetched it, verified scap knows all the plugins, and the backpot window was unblocked [20:09:21] (03PS11) 10Ssingh: dnsdist: add acl parameter for webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) [20:09:31] sorry about that but yeah that one had nothing to deploy so nobody thought to deploy it [20:09:49] (actually, deploying depended on that one merging) [20:12:22] RECOVERY - cassandra-a CQL 10.64.16.180:9042 on restbase1029 is OK: TCP OK - 0.000 second response time on 10.64.16.180 port 9042 https://phabricator.wikimedia.org/T93886 [20:12:52] ok Pchelolo still around? I'm gonna roll forward starting with group0 [20:12:59] I'm around twentyafterfour [20:13:28] anything in particular I need to watch out for? [20:14:08] I guess prod errors [20:14:50] nothing specific... the last parsercache bug manifested on a rollback [20:15:08] (03PS1) 1020after4: group0 wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631524 [20:15:10] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631524 (owner: 1020after4) [20:15:56] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631524 (owner: 1020after4) [20:16:14] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:16:53] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [20:16:56] 10Operations, 10ops-eqiad, 10decommission-hardware: Decommission asw2-a5-eqiad - https://phabricator.wikimedia.org/T217383 (10ayounsi) 05Resolved→03Open From the task description: > Update Netbox (inc. rename asw3-a5 to asw2-a5) Console still shows as connected as well: https://netbox.wikimedia.org/dcim... [20:18:45] 10Operations, 10ops-eqiad, 10DBA, 10netops, and 2 others: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10ayounsi) 05Resolved→03Open From the task description: > [DCops] Update Netbox At least the status and name are incorrect (should be asw2-d4 for consistency) > [D... [20:19:30] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.11 [20:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:06] ok so far so good [20:20:51] log re-deployed 1.36.0-wmf.11 to group0 wikis. Moving to group1 soon T263177 [20:20:52] T263177: 1.36.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T263177 [20:23:05] twentyafterfour anything I can do to test? [20:24:23] DannyS712: maybe not on group0, I think the real havoc happens on group1 and/or group2 [20:25:13] I guess I'm gonna go ahead with group1 because I can't really detect any changes in the error rate on group0 [20:25:43] DannyS712: I guess just look for general breakage on mediawiki.org or testwikis? [20:26:54] I'll go poke around mw.org [20:27:30] (03PS12) 10Ssingh: dnsdist: add acl parameter for webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) [20:28:56] (03PS1) 10Dwisehaupt: Shift payments to codfw for testing 1_35 [dns] - 10https://gerrit.wikimedia.org/r/631526 (https://phabricator.wikimedia.org/T254298) [20:29:07] 10Operations, 10serviceops, 10Patch-For-Review, 10User-jijiki: Test onhost memcached performance and functionality - https://phabricator.wikimedia.org/T263958 (10jijiki) I installed memcached on a mw2271 appserver and configured mcrouter as above. This experiment was surely more representative since this i... [20:29:57] (03CR) 10Jgreen: [C: 03+2] Shift payments to codfw for testing 1_35 [dns] - 10https://gerrit.wikimedia.org/r/631526 (https://phabricator.wikimedia.org/T254298) (owner: 10Dwisehaupt) [20:30:26] (03PS1) 1020after4: group1 wikis to 1.36.0-wmf.11 refs T263177 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631527 [20:30:28] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.36.0-wmf.11 refs T263177 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631527 (owner: 1020after4) [20:31:10] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.11 refs T263177 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631527 (owner: 1020after4) [20:31:42] ok here goes [20:32:13] * Urbanecm sometimes wonders why Commons/Wikidata is group1 [20:32:28] Urbanecm: as apposed to group2? [20:32:34] yup [20:32:43] they're by far more important than most group2 wikis [20:33:14] (03PS1) 10Jgreen: Revert "Shift payments to codfw for testing 1_35" [dns] - 10https://gerrit.wikimedia.org/r/631475 [20:33:21] yeah but the time difference between us train and most of the wikidata dev team means that they wouldn't have a chance to fix things before we get into no-deploy days [20:33:27] (03PS13) 10Ssingh: dnsdist: add acl parameter for webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) [20:33:33] (03CR) 10Jgreen: [C: 03+2] Revert "Shift payments to codfw for testing 1_35" [dns] - 10https://gerrit.wikimedia.org/r/631475 (owner: 10Jgreen) [20:33:37] so every bug would have to be fixed in the next week's train [20:33:50] (03CR) 10Jgreen: [V: 03+2 C: 03+2] Revert "Shift payments to codfw for testing 1_35" [dns] - 10https://gerrit.wikimedia.org/r/631475 (owner: 10Jgreen) [20:34:03] ok kibana fail [20:34:09] getting "no results" for some reason [20:34:53] twentyafterfour: what's your query? [20:35:02] https://logstash.wikimedia.org/goto/3b47bc012bb06855fa2765a31c3e6ac0 works for me [20:35:32] just loading the mediawiki-NEW-errors dashboard. it worked finally on the 3rd try [20:35:54] (03PS1) 10Krinkle: docroot: add foundation.wikimedia.org docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631529 [20:35:55] I think kibana is too heavy weight for my "backup ISP" which is slow (primary ISP is broken currently ) [20:36:15] when mediacom cable is broken you get to wait 1 week for them to fix ti [20:36:36] :/ [20:36:45] (03PS2) 10Krinkle: docroot: add foundation.wikimedia.org docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631529 (https://phabricator.wikimedia.org/T261531) [20:36:54] new-errors works for me too fwiw [20:36:55] RevisionStore.php: Failed to load data blob from tt:9375723: Bad data in text row 9375723. Use findBadBlobs. [20:37:22] is this anything to worry about? it's for wmf.10 so presumably unrealted to train [20:37:26] but it wasn't there before [20:37:43] this is also new: ErrorException from line 421 of /srv/mediawiki/php-1.36.0-wmf.10/includes/api/ApiHelp.php: PHP Warning: count(): Parameter must be an array or an object that implements Countable [20:38:51] is there a trace with the parameter that is failing? [20:38:54] (03PS1) 10Krinkle: foundation.wikimedia.org: Add .well-known/matrix/server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631530 (https://phabricator.wikimedia.org/T261531) [20:39:41] DannyS712: https://www.irccloud.com/pastebin/j4aoTt23/ [20:40:32] that doesn't include the parameter unfortunately, but I'll try to look into it [20:40:41] table 'metawiki.localuser' doesn't exist [20:40:50] twentyafterfour: it is not supposed to exist [20:40:53] that table is from centralauth [20:41:11] (03PS14) 10Ssingh: dnsdist: add acl parameter for webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) [20:41:13] (03CR) 10Krinkle: [C: 03+2] docroot: add foundation.wikimedia.org docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631529 (https://phabricator.wikimedia.org/T261531) (owner: 10Krinkle) [20:41:15] (03CR) 10Krinkle: [C: 03+2] foundation.wikimedia.org: Add .well-known/matrix/server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631530 (https://phabricator.wikimedia.org/T261531) (owner: 10Krinkle) [20:41:16] well why is it looking for it? [20:41:23] twentyafterfour: idk, I'm looking into it [20:41:31] Function: CentralAuthUser::loadAttached [20:41:33] Query: SELECT lu_wiki FROM `localuser` WHERE lu_name = '172.16.5.48' [20:41:40] opps maybe I shouldn' [20:41:44] have pasted that ip :( [20:41:49] * Krinkle cancels [20:41:55] it's internal anyway? [20:41:58] (03CR) 10jerkins-bot: [V: 04-1] foundation.wikimedia.org: Add .well-known/matrix/server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631530 (https://phabricator.wikimedia.org/T261531) (owner: 10Krinkle) [20:42:00] oh [20:42:00] localuser is a known issue for years [20:42:09] connections get mixed up [20:42:21] nothing new unless very frequent [20:42:21] so ignore? [20:42:23] twentyafterfour: one question, you said earlier "deploying depended on that one merging". The part i'm failing to understand is how anything deployed if the code never made it to the deployment host, are there other hosts we scap deploy from? [20:42:28] no not frequent [20:42:44] https://phabricator.wikimedia.org/T193565 [20:42:50] yeah ignore [20:42:57] ebernhardson: probably only train deployments [20:43:05] stuff releng does but backports don't use it [20:43:18] ahh, ok [20:43:24] I deployed by manually pulling the script out of git and putting it in my home directry uner ~/.scap/plugins [20:44:22] ok going ahead with group1 then? [20:45:03] all the errors currently seem to be wmf.10 still [20:45:17] (03PS15) 10Ssingh: dnsdist: add acl parameter for webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) [20:45:43] pulling the trigger [20:46:52] (03PS16) 10Ssingh: dnsdist: add acl parameter for webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) [20:47:09] and group1 is syncing apaches [20:47:16] looks all clear still [20:47:18] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.11 refs T263177 [20:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:25] T263177: 1.36.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T263177 [20:47:57] uh [20:48:01] LogicException from line 16 of /srv/mediawiki/php-1.36.0-wmf.11/includes/libs/NonSerializableTrait.php: Instances of User are not serializable! [20:48:25] !log twentyafterfour@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.11 refs T263177 (duration: 01m 06s) [20:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:42] already 7 of those logic exceptions [20:49:00] awww... that's Daniel's patch: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/625612 [20:49:14] shoul I be rolling back? [20:49:36] twentyafterfour where is it being serialized? Is there a trace available? [20:49:57] DannyS712: https://www.irccloud.com/pastebin/PAw646wk/ [20:50:06] https://phabricator.wikimedia.org/T264363 [20:50:07] also the "ErrorException from line 421 of /srv/mediawiki/php-1.36.0-wmf.10/includes/api/ApiHelp.php: PHP Warning: count(): Parameter must be an array or an object that implements Countable" probably isn't new, but patch at https://gerrit.wikimedia.org/r/c/mediawiki/core/+/631476 [20:50:41] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Please replace Shannon Baileys SSH key - https://phabricator.wikimedia.org/T264127 (10Sbailey) I cannot access bast1002 using ssh bast1002.wikimedia.org Keeps asking for Password: which I do not have. [20:51:00] just +2'ed that patch [20:51:06] cool [20:52:30] (03PS17) 10Ssingh: dnsdist: add acl parameter for webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) [20:52:32] DannyS712: ad non-serialization issue of the User object, removing use NonSerializableTrait; seems to be a working stop-gap solution [20:52:50] as it seems to happen a lot [20:53:09] yeah :-/ [20:53:49] (03CR) 10Ssingh: "Finally ready for review:" [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) (owner: 10Ssingh) [20:54:13] 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020): CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10RLazarus) [20:54:46] but if those objects shouldn't be serialized then that doesn't really address the issue, right? [20:54:51] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10RLazarus) October 27 confirmed, and I just filed T264364. Meet you over there. :) [21:00:08] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Please replace Shannon Baileys SSH key - https://phabricator.wikimedia.org/T264127 (10Urbanecm) >>! In T264127#6510431, @Sbailey wrote: > I cannot access bast1002 using ssh bast1002.wikimedia.org > Keeps asking for Password: which I do not have. I r... [21:01:03] (03PS3) 10Krinkle: docroot: expand foundation.wikimedia.org docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631529 (https://phabricator.wikimedia.org/T261531) [21:01:05] (03PS2) 10Krinkle: foundation.wikimedia.org: Add .well-known/matrix/server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631530 (https://phabricator.wikimedia.org/T261531) [21:02:49] twentyafterfour: the new trait is making sure they are not serialized, but they were serialized before, so removing the trait will fix it [21:03:23] but them getting serialized is also an issue, right? [21:03:47] I see it was causing other errors with objects unable to be unserialized? [21:04:00] seems almost better to error at the serialize stage than the unserialize stage [21:04:20] but yeah for now we'll just paper over it ;) [21:06:04] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 116 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:06:41] uhm I can't easily submit a patch for mediawiki/core because my internet connection is too damn slow to pull the repo [21:07:14] anyone mind submitting a removal of NonSerializableTrait from /includes/user/User.php [21:07:36] submitting? I can +2 the patch [21:07:50] oh wait did I miss it? [21:07:55] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/631536 [21:08:02] twentyafterfour: sorry, my verbosity is low: https://gerrit.wikimedia.org/r/631536 [21:08:33] thanks [21:09:23] (03PS1) 1020after4: Remove NonSerializableTrait from User object [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631477 (https://phabricator.wikimedia.org/T264363) [21:09:56] * twentyafterfour is running on less than enough sleep and more than enough train-deployment stress so my brain is at like 80% right now [21:10:30] 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10Johan) [21:10:32] (03PS2) 10Dbarratt: SpecialInvestigateBlock: Don't assume 'DisableUTEdit' exists [extensions/CheckUser] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631466 (https://phabricator.wikimedia.org/T264302) (owner: 10Jforrester) [21:10:36] apparently caffein alone isn't enough to sustain [21:11:19] (03CR) 1020after4: [C: 03+2] "unblock teh train, plz and thank you" [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631477 (https://phabricator.wikimedia.org/T264363) (owner: 1020after4) [21:11:43] 10Operations, 10ops-eqiad, 10DBA, 10netops, and 2 others: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10wiki_willy) Related to Arzhel's previous comment, getting these Netbox errors: test_missing_assets_from_accounting asw3-d4-eqiad Device with s/n TA3716160376 (WMF542... [21:11:52] 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10Johan) Whoever ends up handling this on the community side:: I'll add this to next week's Tech News and /44, so you don't have to worry ab... [21:12:04] twentyafterfour: nice +2 message :) [21:12:36] btw thanks for you help everybody! I couldn't operate at 80% without you all to back me up,.,.,lol [21:13:47] ok this is gonna take a minute to merge I'm gonna grab some sort of instant food product from the kitchen brb [21:14:22] twentyafterfour: it failed jerkins? [21:14:53] oh, I need to remove _tests_ too [21:15:07] gimme a sec [21:15:53] Hello, metawiki no works: Original exception: [d40d943e-539e-48f4-906b-b68f1572c328] 2020-10-01 21:15:33: Fatal exception of type "LogicException" [21:16:17] Kizule known issue [21:16:31] wait, whole meta? [21:16:40] Yes [21:16:48] When I want to open meta.wikimedia.org it shows me that message [21:16:54] DannyS712: mind +2'ing https://gerrit.wikimedia.org/r/c/mediawiki/core/+/631536/ again? [21:16:58] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10Bstorm) Things look right to me, at least. @Marostegui? [21:17:03] MediaWiki internal error. [21:17:03] Original exception: [c686b32c-3ac4-47ee-ad2c-f6cdb97e80f1] 2020-10-01 21:16:56: Fatal exception of type "LogicException" [21:17:03] Exception caught inside exception handler. [21:17:03] Set $wgShowExceptionDetails = true; at the bottom of LocalSettings.php to show detailed debugging information. [21:17:09] confirmed [21:17:29] for Main page, that [21:17:33] Urbanecm done [21:17:35] didn't know this trait thing took down all group2 wikis [21:17:37] *group1 [21:17:38] should I update the cherry pick as well? [21:17:38] thanks DannyS712 [21:17:48] (03CR) 10Urbanecm: [C: 04-2] "will fail jerkins" [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631477 (https://phabricator.wikimedia.org/T264363) (owner: 1020after4) [21:17:57] DannyS712: creating a new one [21:17:58] but eg for my discussion page, ok (logged in user) [21:18:00] ugh [21:18:12] the error rate wasn't that high [21:18:16] (03PS2) 10Urbanecm: Remove NonSerializableTrait from User object [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631477 (https://phabricator.wikimedia.org/T264363) (owner: 1020after4) [21:18:48] aha, didn't know cherry-pick updates it [21:18:53] (03CR) 10Urbanecm: [C: 03+2] "unblock train" [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631477 (https://phabricator.wikimedia.org/T264363) (owner: 1020after4) [21:18:56] Wait... When I go on https://meta.wikimedia.org/wiki/User:Kizule , it works.. https://meta.wikimedia.org/wiki/Main_Page no works... [21:19:01] Echo just said I was mentioned in an edit, but https://meta.wikimedia.org/w/index.php?title=Talk:Wikimedia_District_of_Columbia&oldid=prev&diff=20501150&safemode=1 doesn't mention me [21:19:13] Kizule: apergos: it's a known issue in either way [21:19:19] okey dokey [21:19:29] twentyafterfour: +2'ed the backport [21:19:35] (new version, i mean) [21:19:35] ok so what do I need to do rollback? [21:19:43] is the echo bug something known? [21:19:49] jenkins takes forever [21:19:54] twentyafterfour: I vote for V+2 [21:20:19] we shouldn't wait on jenkins in this case [21:20:41] (03PS1) 10Dbarratt: SpecialInvestigateBlock: Don't assume 'DisableUTEdit' exists [extensions/CheckUser] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/631478 (https://phabricator.wikimedia.org/T264302) [21:21:10] I say at least wait for the master tests to not fail before syncing/deploying, even if you manually merge before jenkins confirms [21:21:27] I can test on mwdebug if you want (once the patch is staged) [21:21:40] (03CR) 1020after4: [V: 03+2 C: 03+2] Remove NonSerializableTrait from User object [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631477 (https://phabricator.wikimedia.org/T264363) (owner: 1020after4) [21:21:42] ok [21:21:46] Wait wait twentyafterfour see pm [21:22:02] pm where? [21:22:06] oh [21:22:09] I see [21:22:26] (03PS1) 10Andrew Bogott: cloudvirt1018 to Buster/Ceph [puppet] - 10https://gerrit.wikimedia.org/r/631541 (https://phabricator.wikimedia.org/T259399) [21:22:28] what's happening? [21:23:17] uhm maybe need to roll back instead due to what DannyS712 just told me in PM [21:23:33] rollback should also fix this issue [21:23:38] "fix" [21:24:00] meta seems to be really down [21:24:34] rolling back [21:25:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:26:13] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: rollback group1 [21:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:26:30] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [21:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:26] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:41] DannyS712: check pm? [21:29:15] rolling back group0 as well [21:29:28] !log Manually created mediawiki/skins.git REL1_35 at 796693cb7a2ee3191fcbe19769d341bd0530bd4a for T264365 [21:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:33] T264365: mediawiki/extensions and mediawiki/skins missing a REL1_35 branch - https://phabricator.wikimedia.org/T264365 [21:29:41] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 7 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:29:53] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: rollback group0 as well T264363 [21:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:58] T264363: Instances of User are not serializable! - https://phabricator.wikimedia.org/T264363 [21:31:32] ok things appear to be back to normal for the most part [21:31:39] DannyS712: can you confirm? [21:32:47] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10RBrounley_WMF) We’re working to patch up our end, switching to streams with querying the Ores api when streams fail. Sorry will update soon [21:38:03] 10Operations, 10Domains, 10Traffic: URL to redirect to upcoming Wikipedia Birthday page on wikimediafoundation.org - https://phabricator.wikimedia.org/T264367 (10hdothiduc) [21:40:11] (03Merged) 10jenkins-bot: Remove NonSerializableTrait from User object [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631477 (https://phabricator.wikimedia.org/T264363) (owner: 1020after4) [21:41:32] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1018 to Buster/Ceph [puppet] - 10https://gerrit.wikimedia.org/r/631541 (https://phabricator.wikimedia.org/T259399) (owner: 10Andrew Bogott) [21:54:17] (03PS1) 1020after4: rolled everything back to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631543 [21:54:19] (03CR) 1020after4: [C: 03+2] "this brings the repo in sync with the deployed state of the cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631543 (owner: 1020after4) [21:57:50] (03Merged) 10jenkins-bot: rolled everything back to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631543 (owner: 1020after4) [21:58:26] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [21:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:21] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:34] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:03:32] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:03:55] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [22:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:39] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [22:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:05] (03PS1) 10Elukey: Set debian buster for stat100[467] [puppet] - 10https://gerrit.wikimedia.org/r/631544 (https://phabricator.wikimedia.org/T255028) [22:18:59] (03PS1) 10Dzahn: Revert "add etherpad1003.eqiad.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/631479 [22:20:34] (03CR) 10Dzahn: [C: 03+2] Revert "add etherpad1003.eqiad.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/631479 (owner: 10Dzahn) [22:20:37] (03PS2) 10Dzahn: Revert "add etherpad1003.eqiad.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/631479 [22:23:43] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [22:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:36] (03CR) 10Tchanders: [C: 03+1] SpecialInvestigateBlock: Don't assume 'DisableUTEdit' exists [extensions/CheckUser] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/631478 (https://phabricator.wikimedia.org/T264302) (owner: 10Dbarratt) [22:35:29] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [22:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:01] !log Manually created mediawiki/extensions.git REL1_35 at 7ab9a74c9ebbb22ad9fb9b7c95c91b7fad8bf8c6 for T264365 [22:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:08] T264365: mediawiki/extensions and mediawiki/skins missing a REL1_35 branch - https://phabricator.wikimedia.org/T264365 [22:54:46] (03CR) 10Dzahn: [V: 03+1 C: 03+2] maps: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/629439 (owner: 10Dzahn) [22:54:52] (03PS7) 10Dzahn: maps: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/629439 [23:00:04] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Evening backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201001T2300). [23:00:04] davidwbarratt: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:20] I'm here! [23:06:16] Niharika or Urbanecm ? [23:07:06] 10Operations, 10Traffic: ATS-BE Lua mitigations for cacheable responses w/ Set-Cookie seemingly not working - https://phabricator.wikimedia.org/T264378 (10CDanis) [23:07:21] davidwbarratt: train's currently held - i know we're not terribly strict about this lately in general, but does your patch fall under "simple config changes"? https://wikitech.wikimedia.org/wiki/Deployments/Holding_the_train#What_happens_during_backport_windows_while_the_train_is_on_hold? [23:07:34] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [23:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:42] davidwbarratt: or "emergency fix" [23:10:00] brennen Urbanecm uhh, no? it's a production error, but it only effects a new page (we created) and only on itwiki https://phabricator.wikimedia.org/T264302 [23:10:23] only itwiki because they are the only wiki with that configuration [23:12:29] I can see a case to do it and to not do it [23:12:52] I'm fine with whatever you decide, but if you could comment that on the task that would be super helpful. Also with the next window to deploy the fix [23:13:03] the train is currently rollbacked, and any backport can theoretically add a bug [23:13:16] yeah I totes get that [23:13:27] to me, the proposed backport looks simple [23:13:31] brennen: what do you think? [23:15:27] i think in practice it would probably be totally fine, but also it's been a rough week for deploys and we should probably get stricter about this policy as long as we're going to have it. i'd say if you can live with it over the weekend, please hold. [23:15:35] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@6101b56]: mjolnir: increase training memory overhead by 10% [23:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:59] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@6101b56]: mjolnir: increase training memory overhead by 10% (duration: 00m 24s) [23:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:31] brennen works for me, I'll leave a comment and cite the policy [23:16:45] thanks davidwbarratt, appreciate it. [23:19:47] no problem. :) [23:25:04] (03PS1) 10Dzahn: re-add etherpad1003 with IP picked by cookbook [dns] - 10https://gerrit.wikimedia.org/r/631551 [23:26:03] (03CR) 10Dzahn: [C: 03+2] re-add etherpad1003 with IP picked by cookbook [dns] - 10https://gerrit.wikimedia.org/r/631551 (owner: 10Dzahn) [23:31:53] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [23:33:06] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [23:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:44] (03PS1) 10Dzahn: DHCP: add MAC for etherpad1003 [puppet] - 10https://gerrit.wikimedia.org/r/631553 [23:37:09] (03CR) 10jerkins-bot: [V: 04-1] DHCP: add MAC for etherpad1003 [puppet] - 10https://gerrit.wikimedia.org/r/631553 (owner: 10Dzahn) [23:38:14] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@6101b56]: mjolnir: increase training memory overhead by 10% [23:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:49] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@6101b56]: mjolnir: increase training memory overhead by 10% (duration: 00m 34s) [23:38:49] (03PS2) 10Dzahn: DHCP: add MAC for etherpad1003 [puppet] - 10https://gerrit.wikimedia.org/r/631553 [23:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:36] (03CR) 10Dzahn: [C: 03+2] DHCP: add MAC for etherpad1003 [puppet] - 10https://gerrit.wikimedia.org/r/631553 (owner: 10Dzahn) [23:42:07] (03PS1) 10Dzahn: site: add etherpad1003 with role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/631554 [23:43:34] (03CR) 10Dzahn: [C: 03+2] site: add etherpad1003 with role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/631554 (owner: 10Dzahn) [23:45:03] davidwbarratt: are you done with the window? [23:45:53] would be nice to squeeze in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/607155/ [23:49:14] AaronSchulz: from what was said above (which may have changed in the meantime) the train's held, so the only deployments would be simple config changes or emergency fixes per https://wikitech.wikimedia.org/wiki/Deployments/Holding_the_train#What_happens_during_backport_windows_while_the_train_is_on_hold? [23:49:38] (03PS1) 10Dzahn: ATS/Etherpad: replace backend host name with discovery record [puppet] - 10https://gerrit.wikimedia.org/r/631555 [23:51:28] (03PS1) 10Dzahn: add etherpad.discovery.wmnet, point to etherpad1002 [dns] - 10https://gerrit.wikimedia.org/r/631557 [23:54:17] aye