[00:05:04] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:18] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 6445436616 and 767 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:43:42] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2655741168 and 173 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:02] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1195484048 and 85 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:46:26] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2712627088 and 172 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:46:50] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 6508073944 and 408 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:46:50] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3323034632 and 208 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:49:14] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3039822696 and 164 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:12] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 151376 and 213 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:42] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 72264 and 243 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:58] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 840 and 258 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:32] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 47088 and 352 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:44] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3656 and 366 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:58:24] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 24464 and 465 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:58:32] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 87464 and 474 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:37:10] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:42:08] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:33:02] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:37:44] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:13:58] RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:18:42] PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:06] 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Marostegui) @jijiki hosts arrived a few days ago to the DC and were racked and installed past week T267043... [06:50:39] (03PS1) 10Marostegui: install_server: Remove clouddb1020 from install list [puppet] - 10https://gerrit.wikimedia.org/r/649125 [06:51:27] (03CR) 10Marostegui: [C: 03+2] install_server: Remove clouddb1020 from install list [puppet] - 10https://gerrit.wikimedia.org/r/649125 (owner: 10Marostegui) [06:53:56] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on es1023 - https://phabricator.wikimedia.org/T268796 (10Marostegui) RAID back to optimal ` root@es1023:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-1, Sec... [07:01:31] (03CR) 10Marostegui: [C: 03+1] analytics-meta: Avoid replication of superset_staging db when running as replica [puppet] - 10https://gerrit.wikimedia.org/r/647358 (owner: 10Elukey) [07:02:25] marostegui: <3 [07:03:55] (03PS1) 10Elukey: cdh::hadoop: change settings for yarn.resourcemanager.webapp.address [puppet] - 10https://gerrit.wikimedia.org/r/649126 (https://phabricator.wikimedia.org/T269919) [07:03:59] (03CR) 10Marostegui: "Thanks for fixing this Brooke! <3" [puppet] - 10https://gerrit.wikimedia.org/r/647419 (https://phabricator.wikimedia.org/T269620) (owner: 10Bstorm) [07:12:17] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27124/console" [puppet] - 10https://gerrit.wikimedia.org/r/649126 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey) [07:20:04] (03PS2) 10Elukey: cdh::hadoop: change settings for yarn.resourcemanager.webapp.address [puppet] - 10https://gerrit.wikimedia.org/r/649126 (https://phabricator.wikimedia.org/T269919) [07:23:53] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27125/console" [puppet] - 10https://gerrit.wikimedia.org/r/649126 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey) [07:29:01] (03CR) 10Elukey: [V: 03+1 C: 03+2] cdh::hadoop: change settings for yarn.resourcemanager.webapp.address [puppet] - 10https://gerrit.wikimedia.org/r/649126 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey) [07:38:26] (03PS1) 10Elukey: role::analytics_test_cluster::client: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/649127 [07:39:02] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::client: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/649127 (owner: 10Elukey) [07:56:25] (03CR) 10Elukey: [C: 03+2] analytics-meta: Avoid replication of superset_staging db when running as replica [puppet] - 10https://gerrit.wikimedia.org/r/647358 (owner: 10Elukey) [08:03:04] (03CR) 10Elukey: [C: 03+2] mjolnir: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/648349 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [08:03:41] (03CR) 10Elukey: [C: 03+2] kafkatee: Migrate hiera() to lookup() and set data type [puppet] - 10https://gerrit.wikimedia.org/r/648348 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [08:25:21] hashar: I don't know, um, if it's what you're currently doing, but um, cloud vps is currently spamming a *lot* of emails about puppet failures [08:26:03] marktraceur: merely fixed up a single instance in the devtools project :] [08:26:32] OK well I dunno, I'm flagging it then, there's a lot of puppet failure notices coming to my inbox like every minute [08:26:49] It's *well* past my bedtime so I can't do more than that [08:26:50] maybe, but which project? [08:27:00] Seems like deployment-prep [08:27:06] if it is all of labs, I guess that is an issue with puppet.git maybe due to elukey patch above [08:27:12] will check [08:27:19] marktraceur: sleep well and thank you! :] [08:27:27] Best of luck [08:27:50] <_joe_> I doubt that's the case, this is just the time when the emails get sent [08:28:32] <_joe_> that luca's change caused this [08:28:51] <_joe_> I think it's something that changed in cloud VPS and wasn't fixed in puppet as well [08:29:18] _joe_: i dunno, i'm perfectly happy to believe that elukey broke stuff, regardless of the facts [08:31:13] <_joe_> in the case of deployment-prep, the problem seems to be with puppetdb, but I'm not going to debug that further [08:31:19] <_joe_> definitely *not* my job [08:31:38] _joe_: kormat: that is the puppetdb on deployment-prep that went oom [08:31:54] anyway does not seem related to any change made in puppet.git [08:32:15] sorry I was afk briefly, didn't read what happened [08:32:51] seems like the email notifications are send way after the actual failure [08:33:11] kormat: <3 [08:33:15] <_joe_> it doesn't "seems", that's how it works [08:34:35] anyway I have fixed it by restarting the puppetdb process [08:40:13] !log swift eqiad-prod: add weight to ms-be106[0-3] - T268435 [08:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:17] T268435: Add ms-be106[0-3] to swift - https://phabricator.wikimedia.org/T268435 [08:42:39] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/647281 (https://phabricator.wikimedia.org/T269563) (owner: 10Jbond) [08:46:47] 10Operations, 10SRE-swift-storage, 10Patch-For-Review: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10fgiunchedi) [08:48:53] PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:53:39] RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:55:46] (03PS1) 10Ladsgroup: Avoid loading the whole item in every client page view [extensions/Wikibase] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/648283 (https://phabricator.wikimedia.org/T269960) [09:08:53] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [debs/calico] - 10https://gerrit.wikimedia.org/r/648363 (owner: 10Alexandros Kosiaris) [09:14:21] (03CR) 10Ladsgroup: [C: 03+2] "Backporting this" [extensions/Wikibase] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/648283 (https://phabricator.wikimedia.org/T269960) (owner: 10Ladsgroup) [09:14:33] Backporting this ^ [09:16:03] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:37] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [09:17:55] PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:20:55] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:24:52] (03CR) 10jerkins-bot: [V: 04-1] Avoid loading the whole item in every client page view [extensions/Wikibase] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/648283 (https://phabricator.wikimedia.org/T269960) (owner: 10Ladsgroup) [09:28:47] (03CR) 10Jcrespo: [C: 03+1] icinga::raid_handler: add support for ssacli [puppet] - 10https://gerrit.wikimedia.org/r/647281 (https://phabricator.wikimedia.org/T269563) (owner: 10Jbond) [09:31:06] (03PS1) 10Ladsgroup: Fix some PSR12.Properties.ConstantVisibility.NotFound [extensions/Wikibase] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/648284 (https://phabricator.wikimedia.org/T253169) [09:32:02] (03PS1) 10Ema: ATS: change connect timeouts [puppet] - 10https://gerrit.wikimedia.org/r/649278 (https://phabricator.wikimedia.org/T264398) [09:35:04] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Also ship the following plugins which are included in the release [debs/calico] - 10https://gerrit.wikimedia.org/r/648363 (owner: 10Alexandros Kosiaris) [09:35:49] RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:36:48] (03PS2) 10Ema: ATS: change connect timeouts [puppet] - 10https://gerrit.wikimedia.org/r/649278 (https://phabricator.wikimedia.org/T264398) [09:37:00] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/649278 (https://phabricator.wikimedia.org/T264398) (owner: 10Ema) [09:37:10] (03CR) 10jerkins-bot: [V: 04-1] ATS: change connect timeouts [puppet] - 10https://gerrit.wikimedia.org/r/649278 (https://phabricator.wikimedia.org/T264398) (owner: 10Ema) [09:37:56] (03PS3) 10Ema: ATS: change connect timeouts [puppet] - 10https://gerrit.wikimedia.org/r/649278 (https://phabricator.wikimedia.org/T264398) [09:41:51] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/649278 (https://phabricator.wikimedia.org/T264398) (owner: 10Ema) [09:44:39] (03PS1) 10Alexandros Kosiaris: k8s-staging-codfw: --service-account-key-file is a public key [puppet] - 10https://gerrit.wikimedia.org/r/649280 [09:44:41] (03PS1) 10Alexandros Kosiaris: prometheus: Turn on codfw prometheus/k8s-staging [puppet] - 10https://gerrit.wikimedia.org/r/649281 [09:44:43] (03PS1) 10Marostegui: mariadb: Add codfw x2 hosts [puppet] - 10https://gerrit.wikimedia.org/r/649282 (https://phabricator.wikimedia.org/T269324) [09:45:02] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 6 days, 0:00:00 on cloudvirt1024.eqiad.wmnet with reason: T269419 [09:45:02] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on cloudvirt1024.eqiad.wmnet with reason: T269419 [09:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:05] T269419: cloudvirt1024: /srv full 99% - https://phabricator.wikimedia.org/T269419 [09:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:13] (03CR) 10Alexandros Kosiaris: [C: 03+2] "And we are going for it today. Merging. Thanks for the +1 !" [homer/public] - 10https://gerrit.wikimedia.org/r/641704 (owner: 10Alexandros Kosiaris) [09:47:47] (03Merged) 10jenkins-bot: Add k8s-staging-codfw cluster [homer/public] - 10https://gerrit.wikimedia.org/r/641704 (owner: 10Alexandros Kosiaris) [09:47:50] (03PS2) 10Matthias Mullie: Add Media Search survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640934 (https://phabricator.wikimedia.org/T258419) [09:48:42] (03CR) 10Matthias Mullie: [C: 03+1] Add Media Search survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640934 (https://phabricator.wikimedia.org/T258419) (owner: 10Matthias Mullie) [09:48:45] (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s-staging-codfw: --service-account-key-file is a public key [puppet] - 10https://gerrit.wikimedia.org/r/649280 (owner: 10Alexandros Kosiaris) [09:49:37] wow the new downtime IRC log is way more useful than before! [09:49:44] cc volans [09:50:43] thx arturo :) It's possible to customize the !log for every cookbook now if they get migrated to the new class API [09:51:07] (03PS4) 10Ema: ATS: change ats-be connect timeouts [puppet] - 10https://gerrit.wikimedia.org/r/649278 (https://phabricator.wikimedia.org/T264398) [09:51:08] awesome [09:51:10] volans: <3 [09:51:12] !log swift codfw-prod: more weight to ms-be20[58-61] - T269337 [09:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:15] T269337: Add ms-be20[58-61] to swift - https://phabricator.wikimedia.org/T269337 [09:51:33] (03CR) 10Vgutierrez: [C: 03+1] ATS: change ats-be connect timeouts [puppet] - 10https://gerrit.wikimedia.org/r/649278 (https://phabricator.wikimedia.org/T264398) (owner: 10Ema) [09:51:44] (03PS2) 10Alexandros Kosiaris: prometheus: Turn on codfw prometheus/k8s-staging [puppet] - 10https://gerrit.wikimedia.org/r/649281 [09:51:46] (03PS1) 10Alexandros Kosiaris: Fix typo with k8s-staging-cdofw ServiceAccount [puppet] - 10https://gerrit.wikimedia.org/r/649283 [09:52:20] (03CR) 10Alexandros Kosiaris: [C: 03+2] Fix typo with k8s-staging-cdofw ServiceAccount [puppet] - 10https://gerrit.wikimedia.org/r/649283 (owner: 10Alexandros Kosiaris) [09:52:56] (03CR) 10Ema: [C: 03+2] ATS: change ats-be connect timeouts [puppet] - 10https://gerrit.wikimedia.org/r/649278 (https://phabricator.wikimedia.org/T264398) (owner: 10Ema) [09:53:02] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] "The failure is unrelated. Tested on https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php72-noselenium-docker/53253/console" [extensions/Wikibase] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/648283 (https://phabricator.wikimedia.org/T269960) (owner: 10Ladsgroup) [09:56:47] (03CR) 10jerkins-bot: [V: 04-1] Fix some PSR12.Properties.ConstantVisibility.NotFound [extensions/Wikibase] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/648284 (https://phabricator.wikimedia.org/T253169) (owner: 10Ladsgroup) [09:58:01] 10Operations, 10SRE-tools, 10serviceops-radar: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10Volans) p:05Triage→03Medium [09:58:41] (03CR) 10Jbond: [C: 03+2] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/648321 (owner: 10Legoktm) [09:58:42] confirming it's working as intended https://performance.wikimedia.org/xhgui/run/symbol?id=5fd736ac7a51fa04264ab55f&symbol=Wikibase%5CClient%5CHooks%5CSkinAfterBottomScriptsHandler%3A%3AgetDescription [09:59:57] (03Abandoned) 10Ladsgroup: Fix some PSR12.Properties.ConstantVisibility.NotFound [extensions/Wikibase] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/648284 (https://phabricator.wikimedia.org/T253169) (owner: 10Ladsgroup) [10:01:53] (03CR) 10Jbond: "thanx ill see if i can add a check for this in CI" [puppet] - 10https://gerrit.wikimedia.org/r/648970 (https://phabricator.wikimedia.org/T269930) (owner: 10ArielGlenn) [10:03:12] 10Puppet, 10User-jbond: Add check for ssh key type in admin module CI - https://phabricator.wikimedia.org/T270073 (10jbond) p:05Triage→03Medium [10:03:13] !log ladsgroup@deploy1001 Scap failed!: 4/9 canaries failed their endpoint checks(https://en.wikipedia.org) [10:03:13] this will cause a jump in errors during the sync [10:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:15] ugh [10:05:06] (03CR) 10Kormat: "Do these need to be added to conftool-data/dbconfig-instance/instances.yaml yet?" [puppet] - 10https://gerrit.wikimedia.org/r/649282 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui) [10:06:12] (03CR) 10Marostegui: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/649282 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui) [10:06:52] (03CR) 10Kormat: [C: 03+1] "> I am not handling that yet, as I prefer to set up, the hosts, replication, grants etc before even touching dbctl" [puppet] - 10https://gerrit.wikimedia.org/r/649282 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui) [10:07:11] (03PS1) 10Alexandros Kosiaris: Specify k8s-stage codfw AS number [homer/public] - 10https://gerrit.wikimedia.org/r/649286 [10:07:19] 10Operations, 10Technical-blog-posts, 10Traffic: 3rd part of blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T270074 (10ema) [10:07:35] I'm --forcing it, can't find a way to deploy it [10:07:43] wihtout a bit of noise [10:07:58] (03CR) 10Alexandros Kosiaris: [C: 03+2] Specify k8s-stage codfw AS number [homer/public] - 10https://gerrit.wikimedia.org/r/649286 (owner: 10Alexandros Kosiaris) [10:08:19] (03PS4) 10Arturo Borrero Gonzalez: kubeadm: wmcs-k8s-node-upgrade.py: collapse ssh calls for same package checks [puppet] - 10https://gerrit.wikimedia.org/r/648211 [10:08:25] (03Merged) 10jenkins-bot: Specify k8s-stage codfw AS number [homer/public] - 10https://gerrit.wikimedia.org/r/649286 (owner: 10Alexandros Kosiaris) [10:08:39] (03CR) 10Jbond: "LGTM also added a question" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/649077 (owner: 10ArielGlenn) [10:09:14] (03CR) 10Marostegui: [C: 03+2] mariadb: Add codfw x2 hosts [puppet] - 10https://gerrit.wikimedia.org/r/649282 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui) [10:09:41] (03CR) 10Jbond: [C: 03+2] icinga::raid_handler: add support for ssacli [puppet] - 10https://gerrit.wikimedia.org/r/647281 (https://phabricator.wikimedia.org/T269563) (owner: 10Jbond) [10:10:39] 10Operations, 10SRE-tools, 10observability: HP RAID failed on ms-be1054 didn't open a task - https://phabricator.wikimedia.org/T269563 (10jbond) I have merged a patch which should fix this but it is untested [10:12:22] I committed the revert, in case I need to revert it right away [10:12:52] !log ladsgroup@deploy1001 Synchronized php-1.36.0-wmf.21/extensions/Wikibase/client/includes: [[gerrit:648283|Avoid loading the whole item in every client page view (T269960)]] (duration: 00m 25s) [10:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:55] T269960: Schema properties in client code loads the whole item in every page view - https://phabricator.wikimedia.org/T269960 [10:15:12] (03PS1) 10Alexandros Kosiaris: calico: Add network-admin RBAC config [deployment-charts] - 10https://gerrit.wikimedia.org/r/649287 [10:15:14] (03PS1) 10Alexandros Kosiaris: calico/typha: 1 replica is enough as a default [deployment-charts] - 10https://gerrit.wikimedia.org/r/649288 [10:15:41] (03CR) 10jerkins-bot: [V: 04-1] calico: Add network-admin RBAC config [deployment-charts] - 10https://gerrit.wikimedia.org/r/649287 (owner: 10Alexandros Kosiaris) [10:15:43] (03CR) 10jerkins-bot: [V: 04-1] calico/typha: 1 replica is enough as a default [deployment-charts] - 10https://gerrit.wikimedia.org/r/649288 (owner: 10Alexandros Kosiaris) [10:16:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2131 to clone db2142', diff saved to https://phabricator.wikimedia.org/P13542 and previous config saved to /var/cache/conftool/dbconfig/20201214-101611-marostegui.json [10:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:23] (03PS1) 10Elukey: profile::swap: allow to override the kerberos credential cache location [puppet] - 10https://gerrit.wikimedia.org/r/649289 [10:16:45] (03PS1) 10Alexandros Kosiaris: Adding AS64604 policy rules [homer/public] - 10https://gerrit.wikimedia.org/r/649290 [10:17:34] !log Stop mysql on db2131 to clone db2142 [10:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] Adding AS64604 policy rules [homer/public] - 10https://gerrit.wikimedia.org/r/649290 (owner: 10Alexandros Kosiaris) [10:20:54] (03Merged) 10jenkins-bot: Adding AS64604 policy rules [homer/public] - 10https://gerrit.wikimedia.org/r/649290 (owner: 10Alexandros Kosiaris) [10:21:27] (03PS2) 10Alexandros Kosiaris: calico: Add network-admin RBAC config [deployment-charts] - 10https://gerrit.wikimedia.org/r/649287 [10:21:29] (03PS2) 10Alexandros Kosiaris: calico/typha: 1 replica is enough as a default [deployment-charts] - 10https://gerrit.wikimedia.org/r/649288 [10:22:13] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: Upgrade Calico - https://phabricator.wikimedia.org/T207804 (10JMeybohm) [10:25:01] (03CR) 10Filippo Giunchedi: "+Stevie as the author of "reuse" partman" [puppet] - 10https://gerrit.wikimedia.org/r/647815 (https://phabricator.wikimedia.org/T266199) (owner: 10Bstorm) [10:25:33] godog: feeling adventurous i see :) [10:25:45] (03PS2) 10Elukey: profile::swap: allow to override the kerberos credential cache location [puppet] - 10https://gerrit.wikimedia.org/r/649289 [10:25:49] ahahhh [10:26:05] godog: ah, it's bstorm's CR [10:26:35] feeling adventurous for others is the best feeling :-P [10:26:43] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/648193 (https://phabricator.wikimedia.org/T244335) (owner: 10JMeybohm) [10:27:21] kormat: let's find a way to nerd-snipe Riccardo this morning, to have a good start of the week [10:27:45] elukey: ideally with something that will keep him busy until i come back in janurary ;) [10:27:49] (03CR) 10Filippo Giunchedi: [C: 04-1] "LVM LVs and filesystems need to be created on prometheus codfw hosts before enabling the instance" [puppet] - 10https://gerrit.wikimedia.org/r/648192 (https://phabricator.wikimedia.org/T244335) (owner: 10JMeybohm) [10:28:21] kormat: yes it really is :D [10:28:36] (03PS3) 10Elukey: profile::swap: allow to override the kerberos credential cache location [puppet] - 10https://gerrit.wikimedia.org/r/649289 [10:32:12] !log Adding kubernetes codfw staging cluster configuration to cr*-codfw [10:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:08] (03PS4) 10Elukey: profile::swap: allow to override the kerberos credential cache location [puppet] - 10https://gerrit.wikimedia.org/r/649289 [10:33:38] (03PS5) 10Elukey: profile::swap: allow to override the kerberos credential cache location [puppet] - 10https://gerrit.wikimedia.org/r/649289 [10:34:34] !log add 100G to prometheus 'global' in codfw [10:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:18] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27129/console" [puppet] - 10https://gerrit.wikimedia.org/r/649289 (owner: 10Elukey) [10:36:33] (03CR) 10Kormat: "> The only thing I have to add (maybe obvious) is that when you invoke this it will need to be with" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/647815 (https://phabricator.wikimedia.org/T266199) (owner: 10Bstorm) [10:36:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: wmcs-k8s-node-upgrade.py: collapse ssh calls for same package checks [puppet] - 10https://gerrit.wikimedia.org/r/648211 (owner: 10Arturo Borrero Gonzalez) [10:37:37] everything seems fine, dropped the revert [10:40:13] RECOVERY - very high load average likely xfs on ms-be2018 is OK: OK - load average: 61.00, 64.59, 75.46 https://wikitech.wikimedia.org/wiki/Swift [10:43:05] (03PS1) 10Jbond: O:pki: add certificate using testing CA [puppet] - 10https://gerrit.wikimedia.org/r/649291 [10:45:44] (03CR) 10Jbond: [C: 03+2] O:pki: add certificate using testing CA [puppet] - 10https://gerrit.wikimedia.org/r/649291 (owner: 10Jbond) [10:48:22] (03PS1) 10Jbond: pki: fix hosts value [puppet] - 10https://gerrit.wikimedia.org/r/649294 [10:49:19] (03CR) 10Jbond: [C: 03+2] pki: fix hosts value [puppet] - 10https://gerrit.wikimedia.org/r/649294 (owner: 10Jbond) [10:55:12] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: Upgrade Calico - https://phabricator.wikimedia.org/T207804 (10JMeybohm) [10:55:48] (03PS1) 10Jbond: pki: add ip address [puppet] - 10https://gerrit.wikimedia.org/r/649299 [10:56:38] (03CR) 10Jbond: [C: 03+2] pki: add ip address [puppet] - 10https://gerrit.wikimedia.org/r/649299 (owner: 10Jbond) [11:04:21] jouncebot: refresh please [11:04:22] I refreshed my knowledge about deployments. [11:10:57] (03PS1) 10Alexandros Kosiaris: Fix kubestage2002 IPv6 address [homer/public] - 10https://gerrit.wikimedia.org/r/649302 [11:12:04] akosiaris: I'm pretty sure we've already discussed this but I can't recall the outcome, can't we grab those mappings from netbox directly? [11:12:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] Fix kubestage2002 IPv6 address [homer/public] - 10https://gerrit.wikimedia.org/r/649302 (owner: 10Alexandros Kosiaris) [11:13:03] (03Merged) 10jenkins-bot: Fix kubestage2002 IPv6 address [homer/public] - 10https://gerrit.wikimedia.org/r/649302 (owner: 10Alexandros Kosiaris) [11:13:45] (03PS3) 10Alexandros Kosiaris: calico: Add network-admin RBAC config [deployment-charts] - 10https://gerrit.wikimedia.org/r/649287 [11:13:47] (03PS3) 10Alexandros Kosiaris: calico/typha: 1 replica is enough as a default [deployment-charts] - 10https://gerrit.wikimedia.org/r/649288 [11:13:49] (03PS1) 10Alexandros Kosiaris: calico: Enable IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/649303 [11:14:34] volans: "those" mappings? What are you referring to? Sorry, deep in making sure this works, not following easily [11:15:13] akosiaris: no prob we can chat later on, I was referring to the k8s_neighbors in homer/public [11:15:31] It would be awesome if we did. I have no idea right now how though [11:16:41] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] calico: Enable IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/649303 (owner: 10Alexandros Kosiaris) [11:17:25] (03CR) 10JMeybohm: [C: 03+2] calico: Add network-admin RBAC config [deployment-charts] - 10https://gerrit.wikimedia.org/r/649287 (owner: 10Alexandros Kosiaris) [11:17:36] (03CR) 10JMeybohm: [C: 03+2] calico/typha: 1 replica is enough as a default [deployment-charts] - 10https://gerrit.wikimedia.org/r/649288 (owner: 10Alexandros Kosiaris) [11:17:36] it would also be awesome if we could do a step in CI to verify that at least whatever homer generates is valid. I got bitten by 2 very easy to catch mistakes [11:17:54] no, sorry. That homer *manages* to generate something. [11:18:03] Not that the output is valid [11:18:11] (03Merged) 10jenkins-bot: calico: Enable IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/649303 (owner: 10Alexandros Kosiaris) [11:18:44] (03Merged) 10jenkins-bot: calico: Add network-admin RBAC config [deployment-charts] - 10https://gerrit.wikimedia.org/r/649287 (owner: 10Alexandros Kosiaris) [11:19:18] (03Merged) 10jenkins-bot: calico/typha: 1 replica is enough as a default [deployment-charts] - 10https://gerrit.wikimedia.org/r/649288 (owner: 10Alexandros Kosiaris) [11:24:25] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649304 (https://phabricator.wikimedia.org/T128546) [11:27:51] akosiaris: ack, the current issue is that CI would need API access to Netbox to do that and some bits in Netbox are considered private. So we'd need to create a dedicated API user/token that has access to all the public bits needed in homer but not the private ones, and that's a bit tricky because of the current Netbox ACL setup IIRC. [11:29:22] volans: Well, the issues I met were with the public part. I mean basic jinja2 stuff failing cause I had not added the required stanzas in the public repo [11:29:51] sure but to run "homer generate" we need already data from netbox [11:29:54] maybe we can ship some fake private data in ci ? [11:30:04] jan_drewniak: May I have your attention please! Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201214T1130) [11:30:11] perhaps some fixtures just for CI in the public repo [11:30:32] dunno if it's worth, the netbox data is most of the config nowadays [11:30:57] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649304 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:31:01] as for gathering data from netbox is not hard, there are all the bits, the only problem is how to pick the list of hosts given that Netbox has no "cluster" concept [11:31:16] tags? [11:31:23] and the closest best is doing kubernetes* or kubestage* [11:31:35] tags... long standing open discussion on when/how to use them [11:31:43] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649304 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:31:45] also how to assign them, manually? seems pretty error prone [11:33:45] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:649304| Bumping portals to master (T128546)]] (duration: 00m 56s) [11:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:49] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [11:34:40] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:649304| Bumping portals to master (T128546)]] (duration: 00m 54s) [11:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:11] PROBLEM - SSH on ms-be2059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:38:13] (03PS1) 10Elukey: kerberos: move credential cache under /run on stat1004 [puppet] - 10https://gerrit.wikimedia.org/r/649305 [11:39:17] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27130/console" [puppet] - 10https://gerrit.wikimedia.org/r/649305 (owner: 10Elukey) [11:39:37] RECOVERY - SSH on ms-be2059 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:43:20] (03PS1) 10Hashar: doc: allow changing WMF_DOC_PATH from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/649306 [11:49:59] (03PS2) 10Hashar: doc: fix fallback to WMF_DOC_PATH files [puppet] - 10https://gerrit.wikimedia.org/r/648248 (https://phabricator.wikimedia.org/T149924) [11:51:02] (03CR) 10Hashar: "Cherry picked on the puppet master. Once merged, in Horizon we can clean out the setting that has been applied on both instances." [puppet] - 10https://gerrit.wikimedia.org/r/649275 (https://phabricator.wikimedia.org/T268964) (owner: 10Hashar) [11:51:28] (03CR) 10Hashar: "Cherry picked on the devtools puppet master and it does work as expected there ;)" [puppet] - 10https://gerrit.wikimedia.org/r/649306 (owner: 10Hashar) [11:52:51] (03CR) 10Hashar: "Rebased on top of https://gerrit.wikimedia.org/r/c/operations/puppet/+/649306/ which lets us change the WMF_DOC_PATH via Hiera and is used" [puppet] - 10https://gerrit.wikimedia.org/r/648248 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201214T1200). [12:00:04] matthiasmullie and Lucas_WMDE: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:11] o/ [12:00:40] o/ [12:00:44] I can deploy! [12:01:28] ok sure :) [12:02:42] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "LGTM (all the referenced messages exist on Commons)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640934 (https://phabricator.wikimedia.org/T258419) (owner: 10Matthias Mullie) [12:02:46] unless you want to do it? [12:02:52] should’ve asked first ^^ [12:03:09] nah that's ok [12:03:19] glad to do a little less work :D [12:03:32] ok ^^ [12:03:34] (03Merged) 10jenkins-bot: Add Media Search survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640934 (https://phabricator.wikimedia.org/T258419) (owner: 10Matthias Mullie) [12:03:56] matthiasmullie: can you test it on mwdebug1001? [12:04:42] (03PS7) 10Hnowlan: sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) [12:06:07] I’m not sure how these quick surveys work, but I’m not getting a survey on https://commons.wikimedia.org/w/index.php?curid=90736687 with x-wikimedia-debug [12:06:13] doesn't seem to work, but nothing's breaking either :p [12:07:04] mh [12:07:07] should I try syncing it? [12:07:23] probably no harm in syncing - go ahead! [12:08:31] ok [12:09:55] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:640934|Add Media Search survey (T258419)]] (duration: 00m 55s) [12:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:59] T258419: Survey users about mediasearch on commons - https://phabricator.wikimedia.org/T258419 [12:10:11] poor Lucas [12:10:19] Lucas_WMDE: commons doesn't even seem to have QuickSurveys installed :o [12:10:56] Lucas_WMDE: your repeating service says [12:10:57] 13:10 Lucas_WMDE: commons doesn't even seem to have QuickSurveys installed :o [12:11:02] :O [12:11:02] Lucas_WMDE: commons doesn't even seem to have QuickSurveys installed :o [12:11:05] (thanks) [12:11:40] should we revert or leave the currently-useless config around? [12:12:17] I should've checked that [12:12:39] would probably be a good idea to add a note to the bottom of the wgQuickSurveysConfig array in iS.php [12:12:42] *IS.php [12:12:47] +1 [12:12:50] (03PS1) 10Matthias Mullie: Enable QuickSurveys on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649307 (https://phabricator.wikimedia.org/T258419) [12:12:51] “make sure the wiki also has wmgUseQuickSurveys => true” [12:12:57] so yeah, no harm in syncing - I'll enable quicksurveys on commons separately [12:13:12] I think we can do that now [12:13:28] there should be enough time [12:13:30] any objections? [12:13:38] (03PS2) 10Matthias Mullie: Enable QuickSurveys on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649307 (https://phabricator.wikimedia.org/T258419) [12:13:40] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Enable QuickSurveys on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649307 (https://phabricator.wikimedia.org/T258419) (owner: 10Matthias Mullie) [12:13:41] +1 to enabling [12:13:59] ok then let’s go [12:14:21] it doesn't seem to create any database tables, to my surprise [12:14:27] (03PS3) 10Matthias Mullie: Enable QuickSurveys on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649307 (https://phabricator.wikimedia.org/T258419) [12:14:46] just prepping patch [12:14:52] added the note as well [12:15:29] looks like https://phabricator.wikimedia.org/T232525#5525567 didn’t require anything beyond a sync-file to enable quicksurveys on plwiki [12:15:58] Lucas_WMDE: yup, the code has no SQL files or anything, so just sync [12:16:00] and eventlogging is the only dependency afaict, which is enabled as well [12:16:24] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable QuickSurveys on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649307 (https://phabricator.wikimedia.org/T258419) (owner: 10Matthias Mullie) [12:16:28] oh, that's where it stores responses [12:16:30] * Urbanecm was wondering [12:16:39] (03PS1) 10Alexandros Kosiaris: calico: Force controller to use the host's resolv.conf [deployment-charts] - 10https://gerrit.wikimedia.org/r/649308 [12:16:56] out of mere curiosity, where does it store responses? [12:17:29] (03Merged) 10jenkins-bot: Enable QuickSurveys on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649307 (https://phabricator.wikimedia.org/T258419) (owner: 10Matthias Mullie) [12:17:35] (03CR) 10jerkins-bot: [V: 04-1] calico: Force controller to use the host's resolv.conf [deployment-charts] - 10https://gerrit.wikimedia.org/r/649308 (owner: 10Alexandros Kosiaris) [12:17:46] jynus: eventlogging, https://meta.wikimedia.org/wiki/Schema:QuickSurveysResponses [12:18:07] alright, pulled to mwdebug1001 [12:18:09] let’s test [12:18:27] thanks, Urbanecm [12:18:55] np jynus [12:19:18] well, the survey is there [12:19:23] but the text overflows :| [12:20:34] I'd say that's worse [12:21:02] yeah, I wouldn’t sync that [12:21:24] we can revert the survey for now (while keeping the extension enabled, because why not, I guess) [12:21:41] or maybe override the messages on-wiki, in the MediaWiki namespace [12:21:51] if matthiasmullie has ideas for how to phrase them without overflow ^^ [12:25:09] (also google is apparently having an oops right now, hopefully shouldn’t affect us too badly) [12:25:40] matthiasmullie: ping? [12:27:26] okay, I suppose this also works https://commons.wikimedia.org/w/index.php?title=Commons:Structured_data/Media_search&diff=518744789&oldid=518741314&diffmode=source [12:28:34] but I think I still don’t want to leave the broken survey deployed [12:28:40] so I would still revert the config change [12:28:58] (the first one) [12:32:23] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Add Media Search survey" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/648285 [12:32:35] (03CR) 10jerkins-bot: [V: 04-1] Revert "Add Media Search survey" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/648285 (owner: 10Lucas Werkmeister (WMDE)) [12:34:01] (03CR) 10Matthias Mullie: [C: 04-1] "Not sure if you can read my messages on irc. This does not need to be undeployed over the overflow thing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/648285 (owner: 10Lucas Werkmeister (WMDE)) [12:34:22] matthiasmullie: I’m not seeing any messages from you [12:34:27] but I don’t think it’s my internet this time? [12:34:34] the logs at https://wm-bot.wmflabs.org/browser/index.php?start=12%2F14%2F2020&end=12%2F14%2F2020&display=%23wikimedia-operations also don’t show anything from you [12:34:36] I cant see anything either [12:36:06] (03CR) 10Matthias Mullie: [C: 04-1] "The survey is not meant to be displayed yet. I blanked out the relevant section from wikitext." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/648285 (owner: 10Lucas Werkmeister (WMDE)) [12:37:06] okay, then I guess I’ll sync it… [12:37:23] (03CR) 10Matthias Mullie: [C: 04-1] "The survey will likely be enabled during deployment holiday break, simply by adding back that DOM element it attaches to (that I have now " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/648285 (owner: 10Lucas Werkmeister (WMDE)) [12:38:16] (03CR) 10Matthias Mullie: [C: 04-1] "If text overflow was the only issue, then it's fine to leave this patch up IMO (and the current state of not having the survey visible ATM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/648285 (owner: 10Lucas Werkmeister (WMDE)) [12:38:35] (03Abandoned) 10Lucas Werkmeister (WMDE): Revert "Add Media Search survey" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/648285 (owner: 10Lucas Werkmeister (WMDE)) [12:39:07] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:649307|Enable QuickSurveys on commonswiki (T258419)]] (duration: 00m 55s) [12:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:12] T258419: Survey users about mediasearch on commons - https://phabricator.wikimedia.org/T258419 [12:39:19] (03CR) 10Matthias Mullie: "(weird how I can see (at least some of) your (and other) messages on IRC, but mine are no longer getting through)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/648285 (owner: 10Lucas Werkmeister (WMDE)) [12:40:00] (03CR) 10Matthias Mullie: "Thanks for sticking along for this rather weird deployment session!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/648285 (owner: 10Lucas Werkmeister (WMDE)) [12:40:02] (03PS5) 10Lucas Werkmeister (WMDE): Add log channel Wikibase.IdGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643874 (https://phabricator.wikimedia.org/T268625) [12:40:08] :D [12:40:50] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add log channel Wikibase.IdGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643874 (https://phabricator.wikimedia.org/T268625) (owner: 10Lucas Werkmeister (WMDE)) [12:41:34] (the internet over here is bad apparently, yay) [12:41:53] (03Merged) 10jenkins-bot: Add log channel Wikibase.IdGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643874 (https://phabricator.wikimedia.org/T268625) (owner: 10Lucas Werkmeister (WMDE)) [12:42:18] I’ll sync this first config change directly, there’s no way to test it [12:43:55] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:643874|Add log channel Wikibase.IdGenerator (T268625)]] (duration: 00m 54s) [12:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:58] T268625: [20h] Investigate the significant number of skipped Item IDs for newly created Wikidata items - https://phabricator.wikimedia.org/T268625 [12:44:21] (03PS2) 10Lucas Werkmeister (WMDE): Enable Wikibase Repo ID generator logging on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644564 (https://phabricator.wikimedia.org/T268625) [12:45:06] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:643874|Add log channel Wikibase.IdGenerator (T268625)]] (Beta-only sync to avoid drift) (duration: 00m 55s) [12:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:17] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable Wikibase Repo ID generator logging on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644564 (https://phabricator.wikimedia.org/T268625) (owner: 10Lucas Werkmeister (WMDE)) [12:47:09] (03Merged) 10jenkins-bot: Enable Wikibase Repo ID generator logging on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644564 (https://phabricator.wikimedia.org/T268625) (owner: 10Lucas Werkmeister (WMDE)) [12:47:24] okay, this one I should be able to test on mwdebug1001 [12:47:41] pulled, now testing [12:50:54] (03PS1) 10Effie Mouzeli: hiera: add shard17 and shard18 to sessions redis [puppet] - 10https://gerrit.wikimedia.org/r/649314 (https://phabricator.wikimedia.org/T213089) [12:51:42] alright, that seems to be working [12:51:46] syncing [12:53:16] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:644564|Enable Wikibase Repo ID generator logging on Test Wikidata (T268625)]] (1/2) (duration: 00m 54s) [12:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:19] T268625: [20h] Investigate the significant number of skipped Item IDs for newly created Wikidata items - https://phabricator.wikimedia.org/T268625 [12:53:54] (03PS1) 10JMeybohm: admin_ng: Setup namespaces after calico and coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/649315 [12:54:23] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:644564|Enable Wikibase Repo ID generator logging on Test Wikidata (T268625)]] (2/2) (duration: 00m 54s) [12:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:24] alright, I think that’s it [12:55:28] !log EU backport+config window done [12:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:42] (I’ll probably enable this Wikibase Repo ID generator logging on non-Test Wikidata soon as well) [12:56:20] (03PS1) 10Effie Mouzeli: install_server: switch all mc* hosts to buster [puppet] - 10https://gerrit.wikimedia.org/r/649316 (https://phabricator.wikimedia.org/T213089) [12:59:49] PROBLEM - SSH on ms-be2059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:02:53] RECOVERY - SSH on ms-be2059 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:11:04] (03PS2) 10Alexandros Kosiaris: calico: Force controller to use the host's resolv.conf [deployment-charts] - 10https://gerrit.wikimedia.org/r/649308 [13:13:22] 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10jijiki) @Marostegui @WDoranWMF @Gilles Thank you all for your help! [13:14:16] 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10jijiki) 05Stalled→03Open [13:14:18] 10Operations, 10Platform Engineering, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Phasing out "redis_sessions" MediaWiki cluster - https://phabricator.wikimedia.org/T267581 (10jijiki) [13:14:21] 10Operations, 10Performance-Team, 10Platform Engineering, 10Goal: Decommission the "session redis" cluster - https://phabricator.wikimedia.org/T243520 (10jijiki) [13:18:26] (03PS1) 10Jbond: pki: enable pki1001 [puppet] - 10https://gerrit.wikimedia.org/r/649320 [13:28:08] (03CR) 10Giuseppe Lavagetto: [C: 03+1] install_server: switch all mc* hosts to buster [puppet] - 10https://gerrit.wikimedia.org/r/649316 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [13:28:48] (03CR) 10Jbond: [C: 03+2] pki: enable pki1001 [puppet] - 10https://gerrit.wikimedia.org/r/649320 (owner: 10Jbond) [13:29:53] (03CR) 10JMeybohm: [C: 03+2] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/649308 (owner: 10Alexandros Kosiaris) [13:31:24] (03Merged) 10jenkins-bot: calico: Force controller to use the host's resolv.conf [deployment-charts] - 10https://gerrit.wikimedia.org/r/649308 (owner: 10Alexandros Kosiaris) [13:35:32] (03CR) 10Ladsgroup: [C: 03+2] "This change is ready for review." [extensions/Wikibase] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/649317 (https://phabricator.wikimedia.org/T269608) (owner: 10Michael Große) [13:36:48] (03PS1) 10Lucas Werkmeister (WMDE): Enable Wikibase Repo ID generator logging on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649321 (https://phabricator.wikimedia.org/T268625) [13:37:25] (03PS1) 10Jbond: P:pki::server: open firewall [puppet] - 10https://gerrit.wikimedia.org/r/649322 [13:39:13] (03PS2) 10Effie Mouzeli: hiera: add shard17 and shard18 to sessions redis [puppet] - 10https://gerrit.wikimedia.org/r/649314 (https://phabricator.wikimedia.org/T213089) [13:39:18] Amir1: do you think it’s okay to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/649321 after your backport is done? [13:40:02] Lucas_WMDE: don't worry about my backport, it takes 20 minutes ish to merge [13:40:14] I'd say that one should go in first [13:40:16] oh, right ^^ [13:40:18] ok! [13:40:20] (your patch I mean) [13:40:25] (03CR) 10Jbond: [C: 03+2] P:pki::server: open firewall [puppet] - 10https://gerrit.wikimedia.org/r/649322 (owner: 10Jbond) [13:40:25] then I’ll do that now [13:40:51] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable Wikibase Repo ID generator logging on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649321 (https://phabricator.wikimedia.org/T268625) (owner: 10Lucas Werkmeister (WMDE)) [13:41:01] (03PS3) 10Effie Mouzeli: hiera: add shard17 and shard18 to sessions redis [puppet] - 10https://gerrit.wikimedia.org/r/649314 (https://phabricator.wikimedia.org/T213089) [13:41:46] (03Merged) 10jenkins-bot: Enable Wikibase Repo ID generator logging on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649321 (https://phabricator.wikimedia.org/T268625) (owner: 10Lucas Werkmeister (WMDE)) [13:42:21] testing on mwdebug1001 [13:43:02] (03PS1) 10Jbond: P:pki::server: fix typo proto != protoc [puppet] - 10https://gerrit.wikimedia.org/r/649324 [13:44:22] yup, works [13:44:48] (03PS1) 10Ladsgroup: mjolnir: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/649325 (https://phabricator.wikimedia.org/T209953) [13:45:22] (03CR) 10Jbond: [C: 03+2] P:pki::server: fix typo proto != protoc [puppet] - 10https://gerrit.wikimedia.org/r/649324 (owner: 10Jbond) [13:45:39] (03CR) 10Ladsgroup: "This one is straightforward." [puppet] - 10https://gerrit.wikimedia.org/r/649325 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [13:45:44] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:649321|Enable Wikibase Repo ID generator logging on Wikidata (T268625)]] (duration: 00m 55s) [13:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:49] T268625: [20h] Investigate the significant number of skipped Item IDs for newly created Wikidata items - https://phabricator.wikimedia.org/T268625 [13:47:55] (03PS1) 10David Caro: [wmcs][rbd2backy2] Add a tag for full/diff backup [puppet] - 10https://gerrit.wikimedia.org/r/649326 (https://phabricator.wikimedia.org/T267195) [13:48:56] (03PS2) 10David Caro: [wmcs][rbd2backy2] Add a tag for full/diff backup [puppet] - 10https://gerrit.wikimedia.org/r/649326 (https://phabricator.wikimedia.org/T267195) [13:50:05] (03PS3) 10David Caro: wmcs:rbd2backy2: Add a tag for full/diff backup [puppet] - 10https://gerrit.wikimedia.org/r/649326 (https://phabricator.wikimedia.org/T267195) [13:51:48] (03PS4) 10David Caro: wmcs:rbd2backy2: Add a tag for full/diff backup [puppet] - 10https://gerrit.wikimedia.org/r/649326 (https://phabricator.wikimedia.org/T267195) [13:52:30] (03PS1) 10Jbond: cfssl::ocsp: use correct flag for responses file [puppet] - 10https://gerrit.wikimedia.org/r/649328 [13:52:57] (03CR) 10jerkins-bot: [V: 04-1] cfssl::ocsp: use correct flag for responses file [puppet] - 10https://gerrit.wikimedia.org/r/649328 (owner: 10Jbond) [13:54:11] (03PS1) 10KartikMistry: Update cxserver to 2020-12-12-101743-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/649329 (https://phabricator.wikimedia.org/T268309) [13:58:58] PROBLEM - Check systemd state on rpki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:59:46] (03PS2) 10Jbond: cfssl::ocsp: use correct flag for responses file [puppet] - 10https://gerrit.wikimedia.org/r/649328 [14:00:29] (03PS2) 10JMeybohm: admin_ng: Setup namespaces after calico and coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/649315 [14:00:31] (03PS1) 10JMeybohm: Prevent calico from creating default ip pools [deployment-charts] - 10https://gerrit.wikimedia.org/r/649330 [14:02:05] (03CR) 10Jbond: [C: 03+2] cfssl::ocsp: use correct flag for responses file [puppet] - 10https://gerrit.wikimedia.org/r/649328 (owner: 10Jbond) [14:03:37] (03PS6) 10Elukey: profile::swap: allow to override the kerberos credential cache location [puppet] - 10https://gerrit.wikimedia.org/r/649289 (https://phabricator.wikimedia.org/T255262) [14:03:39] (03PS2) 10Elukey: kerberos: move credentials cache under /run for an-test-client1001 [puppet] - 10https://gerrit.wikimedia.org/r/649305 (https://phabricator.wikimedia.org/T255262) [14:04:47] (03Merged) 10jenkins-bot: Skip HtmlPageLinkRendererEndHookHandlerTest [extensions/Wikibase] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/649317 (https://phabricator.wikimedia.org/T269608) (owner: 10Michael Große) [14:04:50] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27133/console" [puppet] - 10https://gerrit.wikimedia.org/r/649305 (https://phabricator.wikimedia.org/T255262) (owner: 10Elukey) [14:06:59] (03CR) 10Ottomata: [C: 03+1] kerberos: move credentials cache under /run for an-test-client1001 [puppet] - 10https://gerrit.wikimedia.org/r/649305 (https://phabricator.wikimedia.org/T255262) (owner: 10Elukey) [14:07:09] 10Operations, 10CirrusSearch, 10Elasticsearch, 10Discovery-Search (Current work): Search is currently too busy - https://phabricator.wikimedia.org/T262694 (10Gehel) 05Open→03Resolved [14:08:08] PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:57] !log upload modified golang-cfssl to apt [14:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:06] (03CR) 10Elukey: [C: 03+2] mjolnir: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/649325 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [14:13:10] PROBLEM - Check systemd state on rpki1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:30] (03CR) 10Elukey: [C: 03+2] profile::swap: allow to override the kerberos credential cache location [puppet] - 10https://gerrit.wikimedia.org/r/649289 (https://phabricator.wikimedia.org/T255262) (owner: 10Elukey) [14:13:40] (03CR) 10Elukey: [V: 03+1 C: 03+2] kerberos: move credentials cache under /run for an-test-client1001 [puppet] - 10https://gerrit.wikimedia.org/r/649305 (https://phabricator.wikimedia.org/T255262) (owner: 10Elukey) [14:14:03] 10Operations, 10ops-codfw: RMA failed codfw C7 switch - WMF6114 - https://phabricator.wikimedia.org/T267950 (10Papaul) Dear Juniper Networks Customer, Replacement unit has been shipped from our distribution center in Netherlands to United States. Transit time is approximately 4 business days pending customs c... [14:20:28] (03PS1) 10Jbond: cfssl: fix refresh script on first run [puppet] - 10https://gerrit.wikimedia.org/r/649347 [14:20:31] (03PS1) 10Elukey: jupyterhub: add missing import to jupyterhub_config.py [puppet] - 10https://gerrit.wikimedia.org/r/649348 (https://phabricator.wikimedia.org/T255262) [14:21:10] (03CR) 10Jbond: [C: 03+2] cfssl: fix refresh script on first run [puppet] - 10https://gerrit.wikimedia.org/r/649347 (owner: 10Jbond) [14:21:17] (03CR) 10Elukey: [C: 03+2] jupyterhub: add missing import to jupyterhub_config.py [puppet] - 10https://gerrit.wikimedia.org/r/649348 (https://phabricator.wikimedia.org/T255262) (owner: 10Elukey) [14:21:36] elukey: you happy for me to merge [14:21:40] +1 [14:21:58] (thanks :) [14:22:02] np, done [14:32:31] (03PS3) 10ArielGlenn: add platform engineering folks to snapshot and dumpsdata server access [puppet] - 10https://gerrit.wikimedia.org/r/649077 [14:33:08] (03CR) 10ArielGlenn: add platform engineering folks to snapshot and dumpsdata server access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/649077 (owner: 10ArielGlenn) [14:37:23] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Setup namespaces after calico and coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/649315 (owner: 10JMeybohm) [14:37:25] (03CR) 10JMeybohm: [C: 03+2] Prevent calico from creating default ip pools [deployment-charts] - 10https://gerrit.wikimedia.org/r/649330 (owner: 10JMeybohm) [14:38:38] (03Merged) 10jenkins-bot: admin_ng: Setup namespaces after calico and coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/649315 (owner: 10JMeybohm) [14:38:40] (03Merged) 10jenkins-bot: Prevent calico from creating default ip pools [deployment-charts] - 10https://gerrit.wikimedia.org/r/649330 (owner: 10JMeybohm) [14:46:35] !log root@cumin1001 START - Cookbook sre.dns.netbox [14:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:31] (03PS1) 10Elukey: kerberos: move credendial cache under /run on stat1004 [puppet] - 10https://gerrit.wikimedia.org/r/649355 (https://phabricator.wikimedia.org/T255262) [14:47:59] (03PS2) 10Elukey: kerberos: move credential cache under /run on stat1004 [puppet] - 10https://gerrit.wikimedia.org/r/649355 (https://phabricator.wikimedia.org/T255262) [14:48:22] RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:24] PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:36] (03CR) 10Elukey: [C: 03+2] kerberos: move credential cache under /run on stat1004 [puppet] - 10https://gerrit.wikimedia.org/r/649355 (https://phabricator.wikimedia.org/T255262) (owner: 10Elukey) [14:52:20] !log root@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:48] !log root@cumin1001 START - Cookbook sre.dns.netbox [14:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:14] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:50] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:57:56] !log root@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:50] (03PS1) 10Mholloway: Add event stream analytics.mediawiki.mediasearch_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649357 (https://phabricator.wikimedia.org/T258183) [15:05:08] (03PS1) 10JMeybohm: Fix typo in calico-kube-controllers dnsPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/649358 [15:05:24] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Fix typo in calico-kube-controllers dnsPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/649358 (owner: 10JMeybohm) [15:06:48] (03Merged) 10jenkins-bot: Fix typo in calico-kube-controllers dnsPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/649358 (owner: 10JMeybohm) [15:07:29] (03CR) 10Ppchelko: Remove wgParserCacheUseJson setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644317 (https://phabricator.wikimedia.org/T263579) (owner: 10Ppchelko) [15:09:05] (03PS3) 10Ppchelko: Remove wgParserCacheUseJson setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644317 (https://phabricator.wikimedia.org/T263579) [15:12:39] (03PS1) 10Ppchelko: group1: Enable OldRevisionParserCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649359 (https://phabricator.wikimedia.org/T268075) [15:13:32] (03PS1) 10Ottomata: eventgate-main - increase replicas from 3 to 5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/649360 (https://phabricator.wikimedia.org/T249745) [15:21:02] PROBLEM - Disk space on dumpsdata1001 is CRITICAL: DISK CRITICAL - free space: /data 898507 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dumpsdata1001&var-datasource=eqiad+prometheus/ops [15:23:47] (03CR) 10Ppchelko: [C: 03+1] eventgate-main - increase replicas from 3 to 5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/649360 (https://phabricator.wikimedia.org/T249745) (owner: 10Ottomata) [15:28:30] PROBLEM - Check systemd state on ms-be2059 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:28:40] (03PS2) 10Razzi: kafka: Add kafka-test1008 - 1010 [puppet] - 10https://gerrit.wikimedia.org/r/648342 (https://phabricator.wikimedia.org/T268202) [15:32:44] (03CR) 10Razzi: [C: 03+2] kafka: Add kafka-test1008 - 1010 [puppet] - 10https://gerrit.wikimedia.org/r/648342 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [15:33:13] (03CR) 10Ottomata: Add event stream analytics.mediawiki.mediasearch_interaction (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649357 (https://phabricator.wikimedia.org/T258183) (owner: 10Mholloway) [15:33:22] RECOVERY - Check systemd state on ms-be2059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:40] (03PS3) 10Alexandros Kosiaris: prometheus: Turn on codfw prometheus/k8s-staging [puppet] - 10https://gerrit.wikimedia.org/r/649281 [15:35:42] (03PS1) 10Alexandros Kosiaris: calico-typha: Allow 5473 from kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/649369 [15:35:54] (03CR) 10Andrew Bogott: [C: 03+1] wmcs:rbd2backy2: Add a tag for full/diff backup [puppet] - 10https://gerrit.wikimedia.org/r/649326 (https://phabricator.wikimedia.org/T267195) (owner: 10David Caro) [15:36:29] (03CR) 10Andrew Bogott: "I don't think it's documented, but adding me is a reasonable plan :)" [labs/private] - 10https://gerrit.wikimedia.org/r/635859 (https://phabricator.wikimedia.org/T262962) (owner: 10Dave Pifke) [15:36:37] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] webperf: add fake keys for WebPageTest [labs/private] - 10https://gerrit.wikimedia.org/r/635859 (https://phabricator.wikimedia.org/T262962) (owner: 10Dave Pifke) [15:38:30] (03CR) 10JMeybohm: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/649369 (owner: 10Alexandros Kosiaris) [15:38:47] (03PS4) 10Effie Mouzeli: hiera: add shard17 and shard18 to sessions redis [puppet] - 10https://gerrit.wikimedia.org/r/649314 (https://phabricator.wikimedia.org/T213089) [15:40:22] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:41:10] (03CR) 10David Caro: [C: 03+2] wmcs:rbd2backy2: Add a tag for full/diff backup [puppet] - 10https://gerrit.wikimedia.org/r/649326 (https://phabricator.wikimedia.org/T267195) (owner: 10David Caro) [15:42:31] (03PS1) 10RLazarus: administrative: Add getters for the other Reason fields. [software/spicerack] - 10https://gerrit.wikimedia.org/r/649371 [15:46:14] (03CR) 10jerkins-bot: [V: 04-1] administrative: Add getters for the other Reason fields. [software/spicerack] - 10https://gerrit.wikimedia.org/r/649371 (owner: 10RLazarus) [15:53:47] (03CR) 10Effie Mouzeli: [C: 03+2] install_server: switch all mc* hosts to buster [puppet] - 10https://gerrit.wikimedia.org/r/649316 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [15:56:07] (03PS1) 10RLazarus: tox: Remove '--skip B322' from Bandit config. [software/spicerack] - 10https://gerrit.wikimedia.org/r/649374 [15:56:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] calico-typha: Allow 5473 from kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/649369 (owner: 10Alexandros Kosiaris) [15:56:53] (03PS2) 10Alexandros Kosiaris: calico-typha: Allow 5473 from kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/649369 [15:56:59] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] calico-typha: Allow 5473 from kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/649369 (owner: 10Alexandros Kosiaris) [15:59:03] (03PS2) 10RLazarus: tox: Remove '--skip B322' from Bandit config. [software/spicerack] - 10https://gerrit.wikimedia.org/r/649374 [15:59:42] 10Operations, 10SRE-tools, 10serviceops-radar: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10fgiunchedi) > CNAMEs: we only have one right now for swift that points to ms-fe.svc.$dc.wmnet. @fgiunchedi are both records needed? Yes, although swift.svc can be an A too if t... [16:00:58] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10jijiki) [16:04:09] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10akosiaris) [16:05:05] 10Operations, 10ops-eqsin, 10DC-Ops: cr2-eqsin: fan failure - https://phabricator.wikimedia.org/T267544 (10RobH) I am not sure about how to disable rancid alerts, perhaps either @ayounsi or @cdanis knows how? Updating this task: * new parts have not been dispatched yet, I was off thursday-friday, and follo... [16:06:11] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10akosiaris) [16:06:23] (03CR) 10Lucas Werkmeister (WMDE): "This is ready to go now, Wikibase doesn’t read the propagatePageDeletion setting anymore:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643734 (owner: 10Itamar Givon) [16:06:29] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Remove propgatePageDeletion setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643734 (owner: 10Itamar Givon) [16:06:43] 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10akosiaris) [16:07:24] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10akosiaris) 05Open→03Resolved And finally being now use... [16:08:52] (03CR) 10Volans: [C: 03+2] tox: Remove '--skip B322' from Bandit config. [software/spicerack] - 10https://gerrit.wikimedia.org/r/649374 (owner: 10RLazarus) [16:09:02] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:09:06] (03PS2) 10Lucas Werkmeister (WMDE): Remove propagatePageDeletion setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643734 (owner: 10Itamar Givon) [16:11:11] (03CR) 10Volans: [C: 03+1] "LGTM, thx" [software/spicerack] - 10https://gerrit.wikimedia.org/r/649371 (owner: 10RLazarus) [16:13:06] (03Merged) 10jenkins-bot: tox: Remove '--skip B322' from Bandit config. [software/spicerack] - 10https://gerrit.wikimedia.org/r/649374 (owner: 10RLazarus) [16:13:31] (03CR) 10RLazarus: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/649371 (owner: 10RLazarus) [16:13:33] (03PS2) 10Volans: administrative: Add getters for the other Reason fields. [software/spicerack] - 10https://gerrit.wikimedia.org/r/649371 (owner: 10RLazarus) [16:13:37] lol [16:13:42] rzl: ^^^ [16:13:52] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:57] I rebased you re-checked [16:13:59] haha good [16:14:06] the recheck would have failed ;) [16:14:12] oh right of course [16:21:53] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): mediawiki_job_wikidata-updateQueryServiceLag failing - https://phabricator.wikimedia.org/T269693 (10CBogen) [16:25:15] (03CR) 10RLazarus: [C: 03+2] administrative: Add getters for the other Reason fields. [software/spicerack] - 10https://gerrit.wikimedia.org/r/649371 (owner: 10RLazarus) [16:27:52] (03PS8) 10Hnowlan: sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) [16:29:12] (03Merged) 10jenkins-bot: administrative: Add getters for the other Reason fields. [software/spicerack] - 10https://gerrit.wikimedia.org/r/649371 (owner: 10RLazarus) [16:30:36] (03CR) 10Mholloway: [C: 04-1] Add event stream analytics.mediawiki.mediasearch_interaction (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649357 (https://phabricator.wikimedia.org/T258183) (owner: 10Mholloway) [16:32:27] (03PS1) 10Jdlrobson: Revert "Remove title attributes at init" [extensions/Popups] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/649408 (https://phabricator.wikimedia.org/T269297) [16:38:28] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:44:57] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): mediawiki_job_wikidata-updateQueryServiceLag failing - https://phabricator.wikimedia.org/T269693 (10CBogen) [16:46:29] (03PS1) 10Jbond: cfssl: update ocsp refresh to work on multi master [puppet] - 10https://gerrit.wikimedia.org/r/649410 [16:48:52] (03CR) 10jerkins-bot: [V: 04-1] cfssl: update ocsp refresh to work on multi master [puppet] - 10https://gerrit.wikimedia.org/r/649410 (owner: 10Jbond) [16:49:54] (03PS3) 10Ahmon Dancy: Reorganized setup.sh and added db wait loop [deployment-charts] - 10https://gerrit.wikimedia.org/r/647842 [16:49:55] (03PS5) 10Ahmon Dancy: New utility macros in templates/_mediawiki-common.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/647843 [16:51:08] (03PS5) 10Ahmon Dancy: 0.2.0: Use a Job to set up the database [deployment-charts] - 10https://gerrit.wikimedia.org/r/647844 [16:52:59] (03PS7) 10Ahmon Dancy: 0.3.0: add manually recached l10n CDB support [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304 [16:54:56] 10Operations, 10SRE-tools, 10serviceops-radar: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10akosiaris) > * DNS Records with non-standard TTL. We have just one for oresrdb that has a 5M TTL instead of the default 1H and that's not currently supported by the Netbox autom... [16:55:03] (03CR) 10Ahmon Dancy: "> Patch Set 6: Code-Review-1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304 (owner: 10Ahmon Dancy) [16:55:10] !log jayme@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [16:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:47] !log jayme@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [16:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:05] akosiaris: ^^ \o/ [16:58:10] (03CR) 10Ahmon Dancy: [C: 04-1] "broken" [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304 (owner: 10Ahmon Dancy) [17:00:52] (03PS8) 10Ahmon Dancy: 0.3.0: add manually recached l10n CDB support [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304 [17:01:00] (03CR) 10Dzahn: [C: 03+2] "thanks, this was on my list but I did not get to it on Friday" [puppet] - 10https://gerrit.wikimedia.org/r/649275 (https://phabricator.wikimedia.org/T268964) (owner: 10Hashar) [17:01:05] (03PS2) 10Mholloway: Add analytics event stream mediawiki.mediasearch_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649357 (https://phabricator.wikimedia.org/T258183) [17:01:09] (03PS2) 10Dzahn: doc: allow changing WMF_DOC_PATH from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/649306 (owner: 10Hashar) [17:01:13] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/27134/doc1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/649306 (owner: 10Hashar) [17:01:17] (03PS2) 10Jbond: cfssl: update ocsp refresh to work on multi master [puppet] - 10https://gerrit.wikimedia.org/r/649410 [17:09:04] (03CR) 10Eric Gardner: Add analytics event stream mediawiki.mediasearch_interaction (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649357 (https://phabricator.wikimedia.org/T258183) (owner: 10Mholloway) [17:10:07] (03CR) 10Eric Gardner: [C: 03+1] Add analytics event stream mediawiki.mediasearch_interaction (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649357 (https://phabricator.wikimedia.org/T258183) (owner: 10Mholloway) [17:14:17] 10Operations, 10netops, 10observability, 10User-fgiunchedi: LibreNMS sends its alerts to Alertmanager, resulting in email notifications to network operations - https://phabricator.wikimedia.org/T267018 (10CDanis) Does this mean we can deprecate the [[ https://gerrit.wikimedia.org/g/operations/puppet/+/prod... [17:14:21] (03PS1) 10Elukey: kerberos: explicitly set KRB5CCNAME [puppet] - 10https://gerrit.wikimedia.org/r/649415 (https://phabricator.wikimedia.org/T255262) [17:14:45] (03CR) 10Ottomata: "One nit from me, haven't read previous comments." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) (owner: 10Razzi) [17:16:27] (03CR) 10Elukey: "I am not sure if this is the right move, but it works nicely on stat1004 where we have the default kerberos cache for users set to /run/us" [puppet] - 10https://gerrit.wikimedia.org/r/649415 (https://phabricator.wikimedia.org/T255262) (owner: 10Elukey) [17:19:41] (03PS2) 10Ottomata: eventgate-main - increase replicas from 3 to 5 and mem limit to 600Mi [deployment-charts] - 10https://gerrit.wikimedia.org/r/649360 (https://phabricator.wikimedia.org/T249745) [17:23:41] (03CR) 10Ottomata: kerberos: explicitly set KRB5CCNAME (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/649415 (https://phabricator.wikimedia.org/T255262) (owner: 10Elukey) [17:25:11] (03CR) 10Dzahn: "[deploy1001:~] $ httpbb --hosts doc1001.eqiad.wmnet /srv/deployment/httpbb-tests/doc/test_doc.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/649306 (owner: 10Hashar) [17:27:49] (03CR) 10Elukey: kerberos: explicitly set KRB5CCNAME (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/649415 (https://phabricator.wikimedia.org/T255262) (owner: 10Elukey) [17:30:34] (03CR) 10Alexandros Kosiaris: [C: 04-1] "`" (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/649360 (https://phabricator.wikimedia.org/T249745) (owner: 10Ottomata) [17:37:53] (03CR) 10RLazarus: [C: 03+1] hiera: add shard17 and shard18 to sessions redis [puppet] - 10https://gerrit.wikimedia.org/r/649314 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [17:42:15] (03PS9) 10Hnowlan: sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) [17:43:02] (03PS1) 10RLazarus: cross-validate-accounts: Don't crash when splitting an SSH key fails. [puppet] - 10https://gerrit.wikimedia.org/r/649421 [17:48:50] 10Operations, 10Technical-blog-posts, 10Traffic: 3rd part of blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T270074 (10srodlund) @ema Awesome! Let me know when your first draft is ready. Looking forward to reading and editing this! Just a reminder... [17:50:01] (03CR) 10Hashar: "Thank you for the Puppet compile and httpbb test!" [puppet] - 10https://gerrit.wikimedia.org/r/649306 (owner: 10Hashar) [17:51:37] (03PS1) 10Andrew Bogott: Cinder: use a default volume type named 'standard' [puppet] - 10https://gerrit.wikimedia.org/r/649422 (https://phabricator.wikimedia.org/T269511) [17:58:37] (03PS2) 10Razzi: yarn: aggregate logs every hour for long-running jobs [puppet] - 10https://gerrit.wikimedia.org/r/647805 (https://phabricator.wikimedia.org/T269616) [17:58:45] (03CR) 10Dzahn: "@hashar So this is the one that should be next? I see some other open changes but you know better which order you wanted." [puppet] - 10https://gerrit.wikimedia.org/r/648248 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [17:59:15] !log depooled mw1265 for reimaging [17:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:33] cscott I'm here [17:59:38] looking now [18:00:04] ryankemper: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201214T1800). [18:00:10] (03CR) 10Dzahn: "I'll wait for the tests on the doc machine in devtools since that is hopefully unblocked due to the puppetmaster name fix. and thanks for " [puppet] - 10https://gerrit.wikimedia.org/r/648248 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [18:02:01] (03CR) 10Razzi: [C: 03+2] yarn: aggregate logs every hour for long-running jobs [puppet] - 10https://gerrit.wikimedia.org/r/647805 (https://phabricator.wikimedia.org/T269616) (owner: 10Razzi) [18:03:42] !log crusnov@deploy1001 Started deploy [netbox/deploy@2fc439e]: Redeploy Netbox 2.8 to netbox-next T266488 p1 [18:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:47] T266488: Upgrade netbox-next to 2.9 series - https://phabricator.wikimedia.org/T266488 [18:04:15] !log crusnov@deploy1001 Finished deploy [netbox/deploy@2fc439e]: Redeploy Netbox 2.8 to netbox-next T266488 p1 (duration: 00m 33s) [18:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:18] !log crusnov@deploy1001 Started deploy [netbox/deploy@2fc439e]: Redeploy Netbox 2.8 to netbox-next T266488 p2 [18:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:23] !log crusnov@deploy1001 Finished deploy [netbox/deploy@2fc439e]: Redeploy Netbox 2.8 to netbox-next T266488 p2 (duration: 00m 05s) [18:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:42] !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw1265.eqiad.wmnet [18:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:35] 10Operations, 10Diff-blog, 10Traffic, 10HTTPS: Send HSTS header on diff.wikimedia.org - https://phabricator.wikimedia.org/T270034 (10Dzahn) diff.wikimedia.org is an alias for blog-wikimedia-org.go-vip.net. ^ This is hosted outside WMF infrastructure, so Operations can't do much about this. This would nee... [18:07:38] (03PS2) 10Ryan Kemper: wdqs: Counters now must end in _total [puppet] - 10https://gerrit.wikimedia.org/r/646888 (https://phabricator.wikimedia.org/T269204) [18:07:57] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: add shard17 and shard18 to sessions redis [puppet] - 10https://gerrit.wikimedia.org/r/649314 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [18:08:22] (03CR) 10DCausse: [C: 03+1] wdqs: Counters now must end in _total [puppet] - 10https://gerrit.wikimedia.org/r/646888 (https://phabricator.wikimedia.org/T269204) (owner: 10Ryan Kemper) [18:11:01] (03PS3) 10Ryan Kemper: wdqs: Switch lag metric to be a gauge [puppet] - 10https://gerrit.wikimedia.org/r/646888 (https://phabricator.wikimedia.org/T269204) [18:12:25] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: Switch lag metric to be a gauge [puppet] - 10https://gerrit.wikimedia.org/r/646888 (https://phabricator.wikimedia.org/T269204) (owner: 10Ryan Kemper) [18:18:14] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: use a default volume type named 'standard' [puppet] - 10https://gerrit.wikimedia.org/r/649422 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [18:23:43] (03PS2) 10Dzahn: puppetmaster: remove code to remove crons, replaced by timer [puppet] - 10https://gerrit.wikimedia.org/r/648327 (https://phabricator.wikimedia.org/T265138) [18:24:18] (03PS1) 10Effie Mouzeli: hiera: upgrade mc1031, mc2031 to buster [puppet] - 10https://gerrit.wikimedia.org/r/649425 (https://phabricator.wikimedia.org/T213089) [18:24:22] (03CR) 10Dzahn: "@John, more fyi, enough time should have passed now that the old cron is removed on all masters and this has been switched to timers" [puppet] - 10https://gerrit.wikimedia.org/r/648327 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [18:25:21] !log T269204 Restarting `wdqs-blazegraph` prometheus exporter across all wdqs instances:`sudo cumin -b 12 'P{wdqs*}' 'sudo systemctl restart prometheus-blazegraph-exporter-wdqs-blazegraph.service'` [18:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:26] T269204: Some wdqs metrics changed when switching to python3 - https://phabricator.wikimedia.org/T269204 [18:26:01] (03CR) 10Dzahn: [C: 03+2] "removing code that absented crons" [puppet] - 10https://gerrit.wikimedia.org/r/648327 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [18:28:38] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:29:02] (03CR) 10Dzahn: "Thanks for this QChris! So yea, @Paladox we need to merge your changes into a single one. Are you still interested in these and would pick" [puppet] - 10https://gerrit.wikimedia.org/r/556270 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [18:45:35] (03PS16) 10Paladox: gerrit: Add ed25519 and ecdsa ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/556270 (https://phabricator.wikimedia.org/T240266) [18:45:58] (03Abandoned) 10Paladox: Gerrit: Rename ssh_host_key to ssh_host_rsa_key [puppet] - 10https://gerrit.wikimedia.org/r/556265 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [18:46:30] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/556270 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [18:48:09] (03PS1) 10Elukey: Port the Spicerack interactive module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/649426 [18:49:52] (03CR) 10jerkins-bot: [V: 04-1] Port the Spicerack interactive module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/649426 (owner: 10Elukey) [18:50:28] (03CR) 10Ottomata: [C: 03+1] kerberos: explicitly set KRB5CCNAME [puppet] - 10https://gerrit.wikimedia.org/r/649415 (https://phabricator.wikimedia.org/T255262) (owner: 10Elukey) [18:51:13] (03CR) 10Ottomata: eventgate-main - increase replicas from 3 to 5 and mem limit to 600Mi (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/649360 (https://phabricator.wikimedia.org/T249745) (owner: 10Ottomata) [18:56:47] 10Operations, 10ops-codfw: RMA failed codfw C7 switch - WMF6114 - https://phabricator.wikimedia.org/T267950 (10Papaul) Your replacement part associated with RMA R200322630 Item # 100 has been successfully shipped. Details of which are provided below. Replacement Serial Number: TA3716420362 Replacement Line It... [18:57:40] PROBLEM - Check systemd state on ms-be2059 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:58:56] (03PS2) 10Elukey: Port the Spicerack interactive module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/649426 [19:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201214T1900). [19:00:04] Pchelolo and nray: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:19] o/ here and ready [19:00:21] (03CR) 10jerkins-bot: [V: 04-1] Port the Spicerack interactive module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/649426 (owner: 10Elukey) [19:00:25] howdy [19:00:35] wanna go first nray? [19:00:44] works for me [19:00:46] PROBLEM - mediawiki-installation DSH group on mw1265 is CRITICAL: Host mw1265 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:00:48] (03PS3) 10Elukey: Port the Spicerack interactive module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/649426 [19:00:53] ping me when done please [19:02:14] (03CR) 10jerkins-bot: [V: 04-1] Port the Spicerack interactive module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/649426 (owner: 10Elukey) [19:03:11] (03PS4) 10Elukey: Port the Spicerack interactive module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/649426 [19:03:32] (03CR) 10Ottomata: eventgate-main - increase replicas from 3 to 5 and mem limit to 600Mi (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/649360 (https://phabricator.wikimedia.org/T249745) (owner: 10Ottomata) [19:03:47] (03PS3) 10Ottomata: eventgate-main - increase replicas from 3 to 5 and mem limit to 600Mi [deployment-charts] - 10https://gerrit.wikimedia.org/r/649360 (https://phabricator.wikimedia.org/T249745) [19:05:40] PROBLEM - SSH on ms-be2059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:08:54] PROBLEM - Check systemd state on ms-be2059 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:09:35] 10Operations, 10Diff-blog, 10Traffic, 10HTTPS: Send HSTS header on diff.wikimedia.org - https://phabricator.wikimedia.org/T270034 (10BBlack) We probably should reach out to them and push on this, though. We do have standards that apply ( https://wikitech.wikimedia.org/wiki/HTTPS ), it's just been a while... [19:09:37] nray: are you deploying or waiting for me? I feel like there might be some confusion happening :) [19:10:12] Sorry, I was waiting for a deployer (I don't have deploy rights) [19:10:20] RECOVERY - SSH on ms-be2059 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:10:23] oh.. Ok, I can be a deployer then [19:10:34] do you have a way to test it? [19:10:39] yes I can test it [19:10:43] (03CR) 10Ppchelko: [C: 03+2] Revert "Remove title attributes at init" [extensions/Popups] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/649408 (https://phabricator.wikimedia.org/T269297) (owner: 10Jdlrobson) [19:10:44] (and thank you!) [19:11:05] ok, will ping you once on mwdebug [19:11:10] cool [19:13:33] (03CR) 10RLazarus: [C: 03+1] hiera: upgrade mc1031, mc2031 to buster [puppet] - 10https://gerrit.wikimedia.org/r/649425 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [19:15:48] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): mediawiki_job_wikidata-updateQueryServiceLag failing - https://phabricator.wikimedia.org/T269693 (10RLazarus) 05Open→03Resolved Yep, the alert has cleared. Thanks! [19:19:05] (03Merged) 10jenkins-bot: Revert "Remove title attributes at init" [extensions/Popups] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/649408 (https://phabricator.wikimedia.org/T269297) (owner: 10Jdlrobson) [19:19:06] PROBLEM - Long running screen/tmux on centrallog1001 is CRITICAL: CRIT: Long running tmux process. (user: cdanis PID: 12365, 1742320s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [19:20:24] nray: your change is on mwdebug1002 [19:20:33] cool, thank you, will test now [19:20:51] Pchelolo: thanks for leading this B&C :) [19:21:18] Urbanecm: no problem. 66.6666(6)% of it is my patches.. [19:22:08] not if I add sth :) [19:22:29] things look good Pchelolo , you may proceed :) [19:22:46] oki, going [19:23:18] (03PS1) 10Urbanecm: zhwikinews: Grant suppressredirect to autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649430 (https://phabricator.wikimedia.org/T270023) [19:24:52] !log ppchelko@deploy1001 Synchronized php-1.36.0-wmf.21/extensions/Popups: Backport gerrit:649408 Revert Remove title attributes at init (duration: 00m 59s) [19:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:57] all done nray [19:25:05] thanks Pchelolo ! [19:25:13] (03CR) 10Ppchelko: [C: 03+2] Remove wgParserCacheUseJson setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644317 (https://phabricator.wikimedia.org/T263579) (owner: 10Ppchelko) [19:25:34] Pchelolo: would you please ping me once done, so I can do some config stuff? [19:25:41] yup. [19:25:47] shouldn't take long [19:25:48] thank you :) [19:26:10] (03Merged) 10jenkins-bot: Remove wgParserCacheUseJson setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644317 (https://phabricator.wikimedia.org/T263579) (owner: 10Ppchelko) [19:28:01] (03PS1) 10Urbanecm: hrwiki: Add draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649431 (https://phabricator.wikimedia.org/T268740) [19:28:14] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: gerrit:644317 Remove wgParserCacheUseJson setting (duration: 00m 56s) [19:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:23] (03CR) 10Ppchelko: [C: 03+2] group1: Enable OldRevisionParserCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649359 (https://phabricator.wikimedia.org/T268075) (owner: 10Ppchelko) [19:29:15] (03CR) 10jerkins-bot: [V: 04-1] hrwiki: Add draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649431 (https://phabricator.wikimedia.org/T268740) (owner: 10Urbanecm) [19:29:17] (03Merged) 10jenkins-bot: group1: Enable OldRevisionParserCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649359 (https://phabricator.wikimedia.org/T268075) (owner: 10Ppchelko) [19:29:45] (03PS2) 10Urbanecm: hrwiki: Add draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649431 (https://phabricator.wikimedia.org/T268740) [19:31:03] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: gerrit:649359 group1: Enable OldRevisionParserCache (duration: 00m 55s) [19:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:14] Urbanecm: I'm done, it's all yours now [19:31:17] thank you [19:31:30] (03PS3) 10Urbanecm: hrwiki: Add draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649431 (https://phabricator.wikimedia.org/T268740) [19:31:37] (03CR) 10Urbanecm: [C: 03+2] hrwiki: Add draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649431 (https://phabricator.wikimedia.org/T268740) (owner: 10Urbanecm) [19:32:28] (03Merged) 10jenkins-bot: hrwiki: Add draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649431 (https://phabricator.wikimedia.org/T268740) (owner: 10Urbanecm) [19:34:15] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: cf36ad6e89acd71ca0bc985eb5399fecec64fc5f: hrwiki: Add draft namespace (T268740) (duration: 00m 56s) [19:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:19] T268740: Create Draft: namespace on hrwiki - https://phabricator.wikimedia.org/T268740 [19:34:36] RECOVERY - Check systemd state on ms-be2059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:34:36] (03PS1) 10Ppchelko: Enable old revision parser cache on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649432 (https://phabricator.wikimedia.org/T268075) [19:35:11] (03PS2) 10Ppchelko: Enable old revision parser cache on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649432 (https://phabricator.wikimedia.org/T268075) [19:37:10] (03PS2) 10Urbanecm: zhwikinews: Grant suppressredirect to autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649430 (https://phabricator.wikimedia.org/T270023) [19:40:14] (03CR) 10Urbanecm: [C: 03+2] zhwikinews: Grant suppressredirect to autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649430 (https://phabricator.wikimedia.org/T270023) (owner: 10Urbanecm) [19:41:08] (03Merged) 10jenkins-bot: zhwikinews: Grant suppressredirect to autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649430 (https://phabricator.wikimedia.org/T270023) (owner: 10Urbanecm) [19:41:50] 10Operations, 10Gerrit-Privilege-Requests: Offboard Pablo-WMDE from WMF systems - https://phabricator.wikimedia.org/T268946 (10Legoktm) [19:43:27] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 3b5974ff7f57d19732cd1e7f7f492b778daf6cfc: zhwikinews: Grant suppressredirect to autoconfirmed (T270023) (duration: 00m 55s) [19:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:31] T270023: Autoconfirmed users get suppressredirect permission in zhwikinews - https://phabricator.wikimedia.org/T270023 [19:44:24] (03PS1) 10Jdlrobson: wgMinervaCountErrors config was removed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649434 (https://phabricator.wikimedia.org/T266359) [19:45:04] * Urbanecm done [19:45:26] !log mwdebug1003 - removing zero.wikimedia.org include for testing [19:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:08] (03PS19) 10CRusnov: netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) [19:47:39] (03CR) 10jerkins-bot: [V: 04-1] netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [19:50:35] !log disable puppet on mc1031, mc2031 to install buster [19:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:50] (03PS1) 10Dzahn: httpbb: remove test for 404 on zero.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/649435 (https://phabricator.wikimedia.org/T187716) [19:52:08] (03PS20) 10CRusnov: netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) [19:52:47] (03CR) 10Dzahn: [C: 03+2] httpbb: remove test for 404 on zero.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/649435 (https://phabricator.wikimedia.org/T187716) (owner: 10Dzahn) [19:52:49] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: upgrade mc1031, mc2031 to buster [puppet] - 10https://gerrit.wikimedia.org/r/649425 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [19:53:48] (03CR) 10jerkins-bot: [V: 04-1] netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [19:55:15] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc2031.codfw.wmnet ` The log can be... [19:55:23] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc1031.eqiad.wmnet ` The log can be... [19:56:41] (03PS21) 10CRusnov: netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) [19:57:50] 10Operations, 10SRE-Access-Requests: [WIP] Requesting access to deployment group for STran - https://phabricator.wikimedia.org/T270125 (10Tchanders) [19:58:08] (03CR) 10Jforrester: [C: 03+1] httpbb: remove test for 404 on zero.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/649435 (https://phabricator.wikimedia.org/T187716) (owner: 10Dzahn) [20:05:47] 10Operations, 10fundraising-tech-ops, 10netops: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802 (10Dwisehaupt) Yeah, our local VM setups are definitely hub like so that makes sense. We aren't currently using vSwitch and manage our different networks by using a host that simulate... [20:05:56] (03CR) 10CRusnov: "After discussion and testing here we are:" [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [20:06:24] (03CR) 10CRusnov: [C: 04-1] "-1 until we deploy Netbox 2.9" [puppet] - 10https://gerrit.wikimedia.org/r/649436 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [20:09:10] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1031.eqiad.wmnet with reason: REIMAGE [20:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:12] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1031.eqiad.wmnet with reason: REIMAGE [20:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:34] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2031.codfw.wmnet with reason: REIMAGE [20:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:52] (03PS6) 10Razzi: sqoop: Ensure /tmp/sqoop-jars/ is present [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) [20:22:35] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2031.codfw.wmnet with reason: REIMAGE [20:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:27] (03CR) 10Razzi: sqoop: Ensure /tmp/sqoop-jars/ is present (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) (owner: 10Razzi) [20:25:23] (03CR) 10RLazarus: "Thanks for sending this, I like the direction it's going." [puppet] - 10https://gerrit.wikimedia.org/r/648385 (owner: 10Dzahn) [20:29:20] (03PS1) 10Dzahn: http: stop including zero.wikimedia.org config [puppet] - 10https://gerrit.wikimedia.org/r/649442 (https://phabricator.wikimedia.org/T187716) [20:30:11] (03CR) 10Volans: [C: 03+1] "LGTM, couple of possible improvements inline. It's also ok if you prefer in a separate patch." (033 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/649426 (owner: 10Elukey) [20:33:55] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2031.codfw.wmnet'] ` and were **ALL** successful. [20:34:00] (03CR) 10RLazarus: [C: 03+1] http: stop including zero.wikimedia.org config [puppet] - 10https://gerrit.wikimedia.org/r/649442 (https://phabricator.wikimedia.org/T187716) (owner: 10Dzahn) [20:44:37] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc1031.eqiad.wmnet'] ` and were **ALL** successful. [20:59:19] (03PS3) 10Mholloway: Add analytics event stream mediawiki.mediasearch_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649357 (https://phabricator.wikimedia.org/T258183) [21:00:04] chrisalbon and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201214T2100). [21:01:40] (03CR) 10Mholloway: [C: 03+2] Add analytics event stream mediawiki.mediasearch_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649357 (https://phabricator.wikimedia.org/T258183) (owner: 10Mholloway) [21:02:57] (03Merged) 10jenkins-bot: Add analytics event stream mediawiki.mediasearch_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649357 (https://phabricator.wikimedia.org/T258183) (owner: 10Mholloway) [21:05:34] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add analytics event stream mediawiki.mediasearch_interaction T258183 (duration: 00m 56s) [21:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:38] T258183: [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 [21:05:52] (03CR) 10Jeena Huneidi: [C: 04-1] "Cool, I like it" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/647844 (owner: 10Ahmon Dancy) [21:08:56] (03CR) 10Jeena Huneidi: New utility macros in templates/_mediawiki-common.tpl (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/647843 (owner: 10Ahmon Dancy) [21:09:52] (03CR) 10Jeena Huneidi: 0.3.0: add manually recached l10n CDB support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304 (owner: 10Ahmon Dancy) [21:13:58] (03CR) 10Dzahn: [C: 03+2] http: stop including zero.wikimedia.org config [puppet] - 10https://gerrit.wikimedia.org/r/649442 (https://phabricator.wikimedia.org/T187716) (owner: 10Dzahn) [21:16:44] (03CR) 10Dzahn: "on mwdebug1002 apache2ctl -S shows the namevhost is gone afterwards (no manual restart involved) and all tests still pass" [puppet] - 10https://gerrit.wikimedia.org/r/649442 (https://phabricator.wikimedia.org/T187716) (owner: 10Dzahn) [21:34:27] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:40:35] 10Operations, 10SRE-Access-Requests: [WIP] Requesting access to deployment group for STran - https://phabricator.wikimedia.org/T270125 (10RLazarus) p:05Triage→03Medium a:03Tchanders Hi @STran, welcome! @Tchanders From the "WIP" in the title, I'm guessing this isn't ready for SRE to work on yet, so I'm a... [22:00:04] Reedy and sbassett: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201214T2200). [22:03:40] (03PS2) 10Dzahn: httpbb: add tests for parsoid servers [puppet] - 10https://gerrit.wikimedia.org/r/648383 (https://phabricator.wikimedia.org/T268524) [22:05:03] (03PS1) 10RLazarus: admin: Add mattcleinman to ldap_only_users. [puppet] - 10https://gerrit.wikimedia.org/r/649457 (https://phabricator.wikimedia.org/T269696) [22:09:32] (03PS4) 10Ahmon Dancy: Reorganized setup.sh and added db wait loop [deployment-charts] - 10https://gerrit.wikimedia.org/r/647842 [22:09:34] (03PS6) 10Ahmon Dancy: New utility macros in templates/_mediawiki-common.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/647843 [22:09:36] (03PS6) 10Ahmon Dancy: 0.2.0: Use a Job to set up the database [deployment-charts] - 10https://gerrit.wikimedia.org/r/647844 [22:09:38] (03PS9) 10Ahmon Dancy: 0.3.0: add manually recached l10n CDB support [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304 [22:10:08] (03CR) 10Dzahn: [C: 03+1] "matches LDAP info on both mwmwaint1002 (Wikitech user) and ldap-corp1001 (full time employee)" [puppet] - 10https://gerrit.wikimedia.org/r/649457 (https://phabricator.wikimedia.org/T269696) (owner: 10RLazarus) [22:10:15] (03CR) 10Ahmon Dancy: New utility macros in templates/_mediawiki-common.tpl (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/647843 (owner: 10Ahmon Dancy) [22:10:44] (03CR) 10Ahmon Dancy: 0.2.0: Use a Job to set up the database (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/647844 (owner: 10Ahmon Dancy) [22:11:15] (03CR) 10RLazarus: [C: 03+2] admin: Add mattcleinman to ldap_only_users. [puppet] - 10https://gerrit.wikimedia.org/r/649457 (https://phabricator.wikimedia.org/T269696) (owner: 10RLazarus) [22:11:20] (03CR) 10Ahmon Dancy: 0.3.0: add manually recached l10n CDB support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304 (owner: 10Ahmon Dancy) [22:15:54] 10Operations, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10Dzahn) @jijiki @ssastry I wrote the following httpbb tests based on that: ` # tests for parsoid appservers # hosts: parse*, wtp* http://en.wikipedia.org:... [22:17:05] (03CR) 10Ottomata: [C: 03+1] sqoop: Ensure /tmp/sqoop-jars/ is present [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) (owner: 10Razzi) [22:17:07] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to wmf group for Matt Cleinman - https://phabricator.wikimedia.org/T269696 (10RLazarus) 05Open→03Resolved a:03RLazarus @MattCleinman Thanks for the update! I've added you to the wmf group. ` rzl@mwmaint1002:~$ ldapsearch -x cn=wmf |... [22:17:16] PROBLEM - Check systemd state on mc1031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:17:49] ^ hmm, we just reimaged mc1031 to buster today [22:18:06] Hey all - I was going to deploy an updated security patch for T120883 [22:19:44] (03PS10) 10Ahmon Dancy: 0.3.0: add manually recached l10n CDB support [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304 [22:20:23] (03CR) 10Ahmon Dancy: 0.3.0: add manually recached l10n CDB support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304 (owner: 10Ahmon Dancy) [22:23:34] PROBLEM - Check systemd state on mc2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:26:11] 10Operations, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10Dzahn) [22:34:12] PROBLEM - Check systemd state on ms-be1043 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:39:11] !log Deployed security patch for T120883 (v8) to wmf.21 [22:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:38] PROBLEM - Check systemd state on wdqs1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:40:44] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:40:46] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:40:50] PROBLEM - Check systemd state on wdqs2007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:41:18] PROBLEM - Check systemd state on wdqs2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:41:33] ^ Downtime expired, nothing to be concerned about, renewing the downtime for another 2 days [22:41:36] PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:41:36] PROBLEM - Check systemd state on wdqs2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:41:46] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:41:54] PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:41:58] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:42:07] (Got a fix rolling out either today or tomorrow that will solve the actual issue) [22:45:03] Downtime set [22:45:09] ryankemper: cool, thanks! [22:45:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:46:31] (03PS8) 10Dzahn: Remove apache config for zero.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [22:48:41] (03CR) 10Dzahn: "I did most of this in separate steps and now rebased the _actual_ removal on top of all that." [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [22:49:40] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/27136/mwdebug1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [22:49:50] (03CR) 10Dzahn: [V: 03+1] "not included anymore since https://gerrit.wikimedia.org/r/c/operations/puppet/+/649442" [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [22:49:52] (03CR) 10Dzahn: [V: 03+1 C: 03+2] Remove apache config for zero.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [22:51:18] 10Operations, 10Research, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10Aklapper) Ping - how to get someone to make a decision (e.g. +2) for a trivial patc... [23:00:51] (03CR) 10RLazarus: [C: 03+1] httpbb: add tests for parsoid servers [puppet] - 10https://gerrit.wikimedia.org/r/648383 (https://phabricator.wikimedia.org/T268524) (owner: 10Dzahn) [23:02:33] (03CR) 10Dzahn: "Volans is probably right that we should generate them and I shouldn't unilaterally merge something like this before the holidays." [dns] - 10https://gerrit.wikimedia.org/r/637849 (https://phabricator.wikimedia.org/T152882) (owner: 10Ladsgroup) [23:08:19] 10Operations, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee - Mohammed Sadat - https://phabricator.wikimedia.org/T269843 (10RLazarus) I've taken over SRE clinic duty from @jbond for the week -- @KFrancis I've emailed you his address. Once that's all set, I can go ahead with granting acces... [23:09:49] (03PS4) 10Dzahn: jenkins: support changing $JAVA_HOME [puppet] - 10https://gerrit.wikimedia.org/r/645075 (https://phabricator.wikimedia.org/T269354) (owner: 10Hashar) [23:10:03] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/27138/" [puppet] - 10https://gerrit.wikimedia.org/r/645075 (https://phabricator.wikimedia.org/T269354) (owner: 10Hashar) [23:10:57] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releasers-wikibase for toan - https://phabricator.wikimedia.org/T269777 (10RLazarus) @toan In addition to the agreement you signed with WMDE, you'll need to sign another with WMF -- @KFrancis will get you set up. Once that's taken... [23:11:02] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for toan - https://phabricator.wikimedia.org/T269678 (10RLazarus) NDA discussion is happening over at T269777, then this will be ready to go. [23:16:14] RECOVERY - Check systemd state on ms-be1043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:19:44] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Epic, 10Goal: WMF media storage must be adequately backed up in a remote location - https://phabricator.wikimedia.org/T262668 (10Ottomata) Ping @Miriam, we might be able to piggy back on this project to get access to image data outside of Sw... [23:48:44] (03CR) 10Krinkle: [C: 04-1] "It's done in redirects.dat, the same we use for all other prod redirects." [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) (owner: 10Jforrester) [23:52:04] (03PS11) 10Ahmon Dancy: 0.3.0: add manually recached l10n CDB support [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304 [23:52:06] (03PS1) 10Ahmon Dancy: fix typo in README.md [deployment-charts] - 10https://gerrit.wikimedia.org/r/649470