[00:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Evening backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201218T0000). [00:00:04] nray, tgr, and Seddon: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:37] o/ here and ready. We can head into the new year with some code removal :) [00:00:56] Here and o/ [00:01:28] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:32] I can deploy today [00:02:01] thanks Urbanecm ! [00:02:26] (03PS1) 10Dzahn: site: add doc1002 and doc2001 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/650305 (https://phabricator.wikimedia.org/T269977) [00:02:48] (03CR) 10Urbanecm: [C: 03+2] Revert "vue: Log component errors" [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/649924 (owner: 10Phuedx) [00:03:09] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releases1002/2002 for jhuneidi, brennen - https://phabricator.wikimedia.org/T270350 (10RLazarus) 05Open→03Resolved a:03RLazarus Merged! It'll be rolled out everywhere in 30 minutes, feel free to reopen if you need anything else. [00:03:30] (03CR) 10Urbanecm: [C: 03+2] Undeploy graphoid for arwiki. Phase 4. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650269 (https://phabricator.wikimedia.org/T270443) (owner: 10Seddon) [00:03:53] o/ [00:04:05] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releases1002/2002 for jhuneidi, brennen - https://phabricator.wikimedia.org/T270350 (10RLazarus) [00:04:13] (03CR) 10Urbanecm: [C: 03+2] [beta] Get rid of GrowthExperiments morelike mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650241 (https://phabricator.wikimedia.org/T266020) (owner: 10Gergő Tisza) [00:04:23] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releases1002/2002 for jhuneidi, brennen - https://phabricator.wikimedia.org/T270350 (10RLazarus) [00:05:50] (03Merged) 10jenkins-bot: Undeploy graphoid for arwiki. Phase 4. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650269 (https://phabricator.wikimedia.org/T270443) (owner: 10Seddon) [00:05:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Dzahn) 05Open→03Resolved per Willy the remaining ones are also listed on T267065 which is a wider task about the same thing. Suggesting to... [00:06:12] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:19] Seddon: pulled onto mwdebug1001, please test. [00:06:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=gerrit site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:06:26] (03Merged) 10jenkins-bot: [beta] Get rid of GrowthExperiments morelike mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650241 (https://phabricator.wikimedia.org/T266020) (owner: 10Gergő Tisza) [00:06:36] tgr_: and yours will be auto-deployed soon :-) [00:08:36] (03PS1) 10Dzahn: scap: add doc1002/doc2001 to dsh [puppet] - 10https://gerrit.wikimedia.org/r/650306 [00:09:18] nray: https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php72-docker/42593/console shows an error for your backport, could you have a look, please? [00:09:43] yes looking [00:10:15] 10Operations, 10Wikimedia-Mailing-lists: Create a mailing list for North-West Russia Wiki-Historians UG - https://phabricator.wikimedia.org/T270392 (10RLazarus) a:03RLazarus [00:10:35] Urbanecm: ALL GOOD [00:10:40] thanks Seddon, syncing [00:11:14] !log urbanecm@deploy1001 sync-file aborted: (no justification provided) (duration: 00m 00s) [00:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:11:18] looks like "stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/WikimediaMessages/': The requested URL returned error: 502'". I'm not really sure why that patch would cause that.. [00:12:10] hm, I'll try to restart the job then [00:12:40] Yeah, I'm hoping that is just a fluke. That patch should be pretty safe (knock on wood) [00:12:53] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 3cc5caa12fc4143295560a375d6be70819e4daad: Undeploy graphoid for arwiki. Phase 4. (T270443) (duration: 00m 55s) [00:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:57] T270443: Undeploy graphoid for phase 4 wiki's - https://phabricator.wikimedia.org/T270443 [00:13:00] Seddon: deployed [00:13:11] (03CR) 10jerkins-bot: [V: 04-1] Revert "vue: Log component errors" [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/649924 (owner: 10Phuedx) [00:13:34] (03CR) 10Urbanecm: [C: 03+2] Revert "vue: Log component errors" [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/649924 (owner: 10Phuedx) [00:13:38] let's try again [00:13:50] Thanks Urbanecm [00:13:56] happy to help, Seddon [00:15:04] (03PS8) 10Bstorm: wikireplicas: Work toward a proxy setup on multi-instance replicas [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) [00:15:27] Gerrit is OOMing [00:15:42] we should probably restart it once the deploy window is over and the security release is done [00:16:23] legoktm: right now, i just want CI to merge nray's backport, if gerrit restart will help, I'm for :D [00:16:51] uh, I don't want to restart it literally in the middle of the window [00:16:54] once you're done is better [00:17:48] (03PS1) 10Gerrit maintenance bot: Add nia to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/650310 (https://phabricator.wikimedia.org/T270409) [00:18:30] 10Operations, 10Gerrit: Gerrit is OOMing - https://phabricator.wikimedia.org/T270451 (10Legoktm) p:05Triage→03Unbreak! [00:18:32] (03CR) 10Urbanecm: [C: 03+1] Add nia to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/650310 (https://phabricator.wikimedia.org/T270409) (owner: 10Gerrit maintenance bot) [00:19:05] (03PS1) 10Gerrit maintenance bot: Add nia to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/650311 (https://phabricator.wikimedia.org/T270408) [00:19:22] you don't need to submit it twice :/ [00:21:21] (03Abandoned) 10Gerrit maintenance bot: Add nia to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/650311 (https://phabricator.wikimedia.org/T270408) (owner: 10Gerrit maintenance bot) [00:22:21] (03CR) 10Dzahn: [C: 03+2] "https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wiktionary_Nias" [dns] - 10https://gerrit.wikimedia.org/r/650310 (https://phabricator.wikimedia.org/T270409) (owner: 10Gerrit maintenance bot) [00:22:28] thank you mutante ! [00:22:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=gerrit site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:23:27] !log DNS - new project language 'nia' added - The Nias language is an Austronesian language spoken on Nias Island and the Batu Islands off the west coast of Sumatra in Indonesia. [00:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:25:25] (03PS10) 10Cwhite: profile: add ecs pre and post filters to pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565) [00:26:04] Urbanecm: yw. soon I will need to go through the new projects and do the former "add to wikistats" subtasks. somehow they got dropped from the checklist [00:26:16] oh [00:26:20] or one ticket that combines them [00:26:41] 10Operations, 10Wikimedia-Mailing-lists: Create a mailing list for North-West Russia Wiki-Historians UG - https://phabricator.wikimedia.org/T270392 (10RLazarus) 05Open→03Resolved You're all set! Nikolay, you should have received an automated email with the admin password. Yekaterina won't have received tha... [00:26:54] mutante: we can make the bot to automatically create a task for you [00:27:27] Urbanecm: that would be nice to have, it did remind me in the past to not forget any [00:27:31] https://github.com/ladsgroup/Phabricator-maintenance-bot -- pull requests are welcomed :) [00:28:35] Urbanecm: wow, it has "patch makers" too [00:28:39] impressive [00:28:52] mutante: yeah, that's how the dns patches are done :) [00:30:07] !log reset email for Sutton12 [00:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=gerrit site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:31:55] Urbanecm: can you ping me when you're done? and then I'll look into restarting Gerrit [00:32:29] legoktm: sure. Right now, I'm just waiting on CI, I need https://gerrit.wikimedia.org/r/c/mediawiki/core/+/649924 to get merged, which is my last patch [00:32:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:32:49] Urbanecm might need to add one more patch - see #wikimedia-security [00:40:05] (03PS7) 10CRusnov: Make scripts and reports compatible with Netbox 2.9 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/643444 (https://phabricator.wikimedia.org/T266487) (owner: 10Ayounsi) [00:40:47] (03Merged) 10jenkins-bot: Revert "vue: Log component errors" [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/649924 (owner: 10Phuedx) [00:41:00] \o/ [00:42:27] (03PS1) 10Urbanecm: SECURITY: Act like users don't exist if hidden from viewer [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/650195 (https://phabricator.wikimedia.org/T120883) [00:42:29] (03PS9) 10Bstorm: wikireplicas: Work toward a proxy setup on multi-instance replicas [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) [00:43:12] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.963 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [00:43:24] (03CR) 10Urbanecm: [C: 03+2] SECURITY: Act like users don't exist if hidden from viewer [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/650195 (https://phabricator.wikimedia.org/T120883) (owner: 10Urbanecm) [00:44:17] nray: pulled onto mwdebug1001, please test [00:44:24] thank you, testing [00:44:55] DannyS712: please stand by, I'll ping you once ready to test [00:45:00] okay [00:46:03] (03PS10) 10Bstorm: wikireplicas: Work toward a proxy setup on multi-instance replicas [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) [00:48:07] looks good, you can proceed [00:48:17] Urbanecm: ^^ [00:48:27] thanks, syncing [00:49:53] where is logmsgbot [00:49:58] nray: done [00:50:13] thank you for your help Urbanecm ! [00:50:21] (syncing once more to make the bot log it properly) [00:50:24] np nray [00:51:00] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.22/resources/src/vue/index.js: ed8212bfbe1854cc92a9f1cb33b5661cd0a8382c: Revert "vue: Log component errors" (duration: 00m 55s) [00:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:33] Urbanecm standing by to test, just need to know which debug host it'll be on [00:51:57] DannyS712: mwdebug1001, but it's not there yet -- waiting on CI (tm) [00:52:44] (03PS11) 10Cwhite: profile: add ecs pre and post filters to pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565) [00:54:53] (03CR) 10Jforrester: "This'll need to wait for wmf.26 to be branched and pulled onto the deployment host, sadly (otherwise it'll break the i18n build step)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521233 (https://phabricator.wikimedia.org/T181217) (owner: 10Reedy) [00:59:06] (03PS8) 10CRusnov: Make scripts and reports compatible with Netbox 2.9 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/643444 (https://phabricator.wikimedia.org/T266487) (owner: 10Ayounsi) [01:00:10] (03CR) 10CRusnov: "As suggested by Arzhel we'll take over this patch." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/643444 (https://phabricator.wikimedia.org/T266487) (owner: 10Ayounsi) [01:07:51] (03PS1) 10Dzahn: DHCP: add doc1002 and doc2001 [puppet] - 10https://gerrit.wikimedia.org/r/650318 (https://phabricator.wikimedia.org/T269977) [01:11:26] (03CR) 10Dzahn: [C: 03+2] DHCP: add doc1002 and doc2001 [puppet] - 10https://gerrit.wikimedia.org/r/650318 (https://phabricator.wikimedia.org/T269977) (owner: 10Dzahn) [01:11:30] Urbanecm any idea how long it'll be? [01:12:00] DannyS712: no, https://integration.wikimedia.org/zuul/ says only one job is pending, and estimates 0 mins to the end [01:13:17] (03Merged) 10jenkins-bot: SECURITY: Act like users don't exist if hidden from viewer [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/650195 (https://phabricator.wikimedia.org/T120883) (owner: 10Urbanecm) [01:13:23] finally [01:13:45] DannyS712: pulled onto mwdebug1001, please test [01:15:00] (03PS2) 10Dzahn: site: add doc1002 and doc2001 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/650305 (https://phabricator.wikimedia.org/T269977) [01:16:44] doesn't appear to be working - last time this was because we miscomunicated about the debug host (easy typos) - can you confirm it is 1001 ? [01:16:54] (03CR) 10Dzahn: [C: 03+2] site: add doc1002 and doc2001 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/650305 (https://phabricator.wikimedia.org/T269977) (owner: 10Dzahn) [01:17:57] Urbanecm [01:18:21] DannyS712: yes, mwdebug1001, and I made sure the code is there [01:18:39] hmm [01:19:14] DannyS712: which user are you testing this on? [01:19:54] suppressed user name -> replied in ->security [01:22:46] !log signing puppet certs and installing buster on doc1002/doc2001 with "insetup" role [01:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:30] syncing per -security conversation [01:27:15] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.22/includes/EditPage.php: 4c224bb88e968e885befd9e201ff96c29b976f11: SECURITY: Act like users dont exist if hidden from viewer (T120883) (duration: 00m 53s) [01:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:09] legoktm: I'm finally done [01:28:23] woot [01:28:35] I will wait a few more minutes to make sure things don't explode [01:29:22] ack [01:33:18] I'm restarting now [01:33:22] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on gerrit1001.wikimedia.org with reason: OOM [01:33:23] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on gerrit1001.wikimedia.org with reason: OOM [01:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:29] 10Operations, 10Gerrit: Gerrit is OOMing - https://phabricator.wikimedia.org/T270451 (10ops-monitoring-bot) Icinga downtime set by legoktm@cumin1001 for 0:10:00 1 host(s) and their services with reason: OOM ` gerrit1001.wikimedia.org ` [01:34:53] !log restarted gerrit (T270451) [01:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:57] T270451: Gerrit is OOMing - https://phabricator.wikimedia.org/T270451 [02:53:13] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Release-Engineering-Team (CI & Testing services): replace doc1001.eqiad.wmnet with a buster VM - https://phabricator.wikimedia.org/T247653 (10Dzahn) [02:53:22] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Release-Engineering-Team (CI & Testing services): replace doc1001.eqiad.wmnet with a buster VM - https://phabricator.wikimedia.org/T247653 (10Dzahn) [02:53:26] 10Operations, 10vm-requests: eqiad: 1 VM request for doc (doc1002) - https://phabricator.wikimedia.org/T269977 (10Dzahn) 05Open→03Resolved a:03Dzahn VM created, OS installed, puppet "insetup". [02:53:33] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Release-Engineering-Team (CI & Testing services): replace doc1001.eqiad.wmnet with a buster VM - https://phabricator.wikimedia.org/T247653 (10Dzahn) a:03Dzahn [02:53:35] 10Operations, 10vm-requests: codfw: 1 VM request for doc.wikimedia.org (doc2001) - https://phabricator.wikimedia.org/T269978 (10Dzahn) 05Open→03Resolved VM created, OS installed, puppet "insetup". [02:54:01] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Release-Engineering-Team (CI & Testing services): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) [05:03:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:05:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:23:31] 10Operations, 10Gerrit: Gerrit is OOMing - https://phabricator.wikimedia.org/T270451 (10Legoktm) 05Open→03Resolved a:03Legoktm {F33951320} Will file a follow-up for better monitoring. [06:53:23] !log Compress clouddb1020:3315 clouddb1016:3315 T270473 [06:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:28] T270473: Ensure InnoDB is compressed on the new clouddb hosts - https://phabricator.wikimedia.org/T270473 [06:59:39] (03PS1) 10Marostegui: instances.yaml: Remove es1013 [puppet] - 10https://gerrit.wikimedia.org/r/650409 (https://phabricator.wikimedia.org/T268436) [07:00:09] !log Compress clouddb1019:3316 clouddb1015:3316 T270473 [07:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:13] T270473: Ensure InnoDB is compressed on the new clouddb hosts - https://phabricator.wikimedia.org/T270473 [07:02:00] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove es1013 [puppet] - 10https://gerrit.wikimedia.org/r/650409 (https://phabricator.wikimedia.org/T268436) (owner: 10Marostegui) [07:02:19] so the ores logrotate seems to have worked without reload! [07:02:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove es1013 from dbctl T268436', diff saved to https://phabricator.wikimedia.org/P13600 and previous config saved to /var/cache/conftool/dbconfig/20201218-070235-marostegui.json [07:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:40] T268436: decommission es1013.eqiad.wmnet - https://phabricator.wikimedia.org/T268436 [07:03:22] even the netbox one [07:06:40] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:06:48] !log Stop mysql on db1124:3313 T268742 [07:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:51] T268742: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 [07:06:52] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:19:00] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:19:12] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 79, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:34:32] PROBLEM - Check systemd state on kafka-test1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:48:24] 10Operations, 10SRE-Access-Requests: Requesting access to releasers-wikibase for toan - https://phabricator.wikimedia.org/T269777 (10toan) >>! In T269777#6696028, @RLazarus wrote: > @KFrancis Thank you! > > @toan Your releasers-wikibase access should be taken care of (may take up to 30 min to roll out everywh... [07:54:08] RECOVERY - Check systemd state on kafka-test1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:44] RECOVERY - Check systemd state on kafka-test1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:55] !log on kafka-test10[08-10] - "ip addr flush dev ens5; systemctl restart ifup@ens5.service" [07:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:04] RECOVERY - Check systemd state on kafka-test1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201218T0800) [08:26:15] !log temporarily taking db2102 offline for mysql testing [08:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:26] !log Compress clouddb1018:3317 clouddb1014:3317 T270473 [09:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:31] T270473: Ensure InnoDB is compressed on the new clouddb hosts - https://phabricator.wikimedia.org/T270473 [09:36:47] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Mike Pham - https://phabricator.wikimedia.org/T270438 (10elukey) @RLazarus reporting a question from IRC: do we need to have L3 or something similar to be signed? [09:46:24] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:47:26] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:06] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:38] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:10:48] <_joe_> uhm [10:11:32] <_joe_> volans: Dec 18 09:58:29 cumin1001 check-cumin-aliases[19488]: Alias spare matched 0 hosts [10:11:39] <_joe_> maybe we should allow that? [10:14:58] maybe [10:15:10] there are patches pending from mo.ritz to convert back it to email [10:15:11] IIRC [10:15:28] it was converted from cron to systemd timer and hence from cron email to icinga check [10:15:32] has never alarmed before [10:15:41] <_joe_> it did two days ago [10:15:49] T268369 [10:15:49] T268369: how to deal with cumin alias alerts - https://phabricator.wikimedia.org/T268369 [10:16:23] *before the migration to systemd timer I meant [10:16:32] that was done by daniel IIRC [10:20:20] !log hashar@deploy1001 Started deploy [integration/docroot@1166384]: noop: clear out proper env variable in tests [10:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:27] !log hashar@deploy1001 Finished deploy [integration/docroot@1166384]: noop: clear out proper env variable in tests (duration: 00m 07s) [10:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:13] (03PS1) 10JMeybohm: Allow the kube-controller-manager to run without superuser permissions [puppet] - 10https://gerrit.wikimedia.org/r/650469 (https://phabricator.wikimedia.org/T228967) [10:35:51] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: etcd: introduce systemd-based higher priority scheduling policy [puppet] - 10https://gerrit.wikimedia.org/r/650470 (https://phabricator.wikimedia.org/T267966) [10:40:18] (03PS2) 10JMeybohm: Allow the kube-controller-manager to run without superuser permissions [puppet] - 10https://gerrit.wikimedia.org/r/650469 (https://phabricator.wikimedia.org/T228967) [10:41:00] (03PS2) 10Arturo Borrero Gonzalez: kubeadm: etcd: introduce systemd-based higher priority scheduling policy [puppet] - 10https://gerrit.wikimedia.org/r/650470 (https://phabricator.wikimedia.org/T267966) [10:44:15] (03PS1) 10JMeybohm: k8s_infrastructure_users: Add system:kube-controller-manager [labs/private] - 10https://gerrit.wikimedia.org/r/650473 (https://phabricator.wikimedia.org/T228967) [10:45:28] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] k8s_infrastructure_users: Add system:kube-controller-manager [labs/private] - 10https://gerrit.wikimedia.org/r/650473 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [10:47:15] (03CR) 10Jbond: [C: 03+1] "> Patch Set 1: -Code-Review" [puppet] - 10https://gerrit.wikimedia.org/r/650298 (https://phabricator.wikimedia.org/T270438) (owner: 10RLazarus) [10:47:40] (03CR) 10Jcrespo: "Hey, @jbond and others, not sure if this change affected it, or it was an existing issue, but it seems that "enable-puppet --force" seems " [puppet] - 10https://gerrit.wikimedia.org/r/645104 (owner: 10Jbond) [10:50:23] (03CR) 10Arturo Borrero Gonzalez: "PCC https://puppet-compiler.wmflabs.org/compiler1001/27204/" [puppet] - 10https://gerrit.wikimedia.org/r/650470 (https://phabricator.wikimedia.org/T267966) (owner: 10Arturo Borrero Gonzalez) [10:51:12] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Mike Pham - https://phabricator.wikimedia.org/T270438 (10jbond) [10:53:19] (03PS3) 10JMeybohm: Allow the kube-controller-manager to run without superuser permissions [puppet] - 10https://gerrit.wikimedia.org/r/650469 (https://phabricator.wikimedia.org/T228967) [10:53:20] !log returning db2102 to its original state [10:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:23] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Mike Pham - https://phabricator.wikimedia.org/T270438 (10jbond) >>! In T270438#6700941, @elukey wrote: > @RLazarus reporting a question from IRC: do we need to have L3 or something similar to be signed? L3 is all abou... [10:54:35] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27205/console" [puppet] - 10https://gerrit.wikimedia.org/r/650469 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [10:55:50] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Mike Pham - https://phabricator.wikimedia.org/T270438 (10jbond) [10:59:00] !log starting test swift backup of enwiki on a single thread towards dbstore2003 T264189 [10:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:16] ^godog [11:00:04] T264189: Prepare a proof of concept of the minimum setup capable of backup and recover testwiki media files - https://phabricator.wikimedia.org/T264189 [11:01:01] (03PS1) 10Jbond: enable-puppet: force logic is inverted [puppet] - 10https://gerrit.wikimedia.org/r/650477 [11:04:15] (03CR) 10Jcrespo: [C: 03+1] enable-puppet: force logic is inverted [puppet] - 10https://gerrit.wikimedia.org/r/650477 (owner: 10Jbond) [11:11:30] (03CR) 10Jbond: [C: 03+2] enable-puppet: force logic is inverted [puppet] - 10https://gerrit.wikimedia.org/r/650477 (owner: 10Jbond) [11:12:34] (03PS1) 10Elukey: profile::kerberos::client: change alternative ccache location [puppet] - 10https://gerrit.wikimedia.org/r/650480 (https://phabricator.wikimedia.org/T255262) [11:17:03] (03CR) 10Jbond: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/645104 (owner: 10Jbond) [11:17:21] fyi jynus the enable-puppet PS has been merged [11:17:29] thanks for reporting [11:17:58] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Epic, and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10Pginer-WMF) [11:18:35] no, thank you for fixing it! [11:18:41] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Mike Pham - https://phabricator.wikimedia.org/T270438 (10jbond) [11:24:36] (03CR) 10Jcrespo: "> I have pushed a fix now let me know if you still see an issue" [puppet] - 10https://gerrit.wikimedia.org/r/645104 (owner: 10Jbond) [11:30:44] there was minutes ago a spike of lag on enwiki [11:38:35] 10Operations, 10Wikimedia-Mailing-lists: Have a regular cronjob which alerts about (potentially unadministrated) mailing lists with large (or aged?) moderation queues - https://phabricator.wikimedia.org/T270368 (10Aklapper) [11:48:03] (03PS1) 10Jbond: phab: remove user from block list [puppet] - 10https://gerrit.wikimedia.org/r/650485 (https://phabricator.wikimedia.org/T270184) [11:49:47] (03CR) 10Jbond: [C: 03+2] phab: remove user from block list [puppet] - 10https://gerrit.wikimedia.org/r/650485 (https://phabricator.wikimedia.org/T270184) (owner: 10Jbond) [12:05:27] (03PS8) 10Jbond: P:phabricator: migrate banlist to abuse-networks [puppet] - 10https://gerrit.wikimedia.org/r/649882 (https://phabricator.wikimedia.org/T270285) [12:14:02] 10Operations, 10Wikidata, 10Wikidata Query UI, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Ladsgroup) What can we do to move this forward? [12:43:14] 10Operations, 10Technical-blog-posts, 10Traffic: 3rd part of blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T270074 (10ema) >>! In T270074#6699924, @srodlund wrote: > @ema I published this Thanks! > Will you look it over and let me know if you s... [12:50:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] kubeadm and paws: tuning options for stacked control plane [puppet] - 10https://gerrit.wikimedia.org/r/650280 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm) [12:50:19] (03CR) 10JMeybohm: [C: 03+1] Revert "vcl: do not stream responses to docker" [puppet] - 10https://gerrit.wikimedia.org/r/650191 (https://phabricator.wikimedia.org/T270270) (owner: 10Ema) [12:53:17] (03CR) 10Ema: [C: 03+2] Revert "vcl: do not stream responses to docker" [puppet] - 10https://gerrit.wikimedia.org/r/650191 (https://phabricator.wikimedia.org/T270270) (owner: 10Ema) [12:53:58] (03CR) 10Ladsgroup: [C: 03+1] etcd: add data types, replace hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/649708 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [12:57:47] 10Operations, 10Wikimedia-Mailing-lists: Have a regular cronjob which alerts about (potentially unadministrated) mailing lists with large (or aged?) moderation queues - https://phabricator.wikimedia.org/T270368 (10Aklapper) [13:10:59] (03PS1) 10Effie Mouzeli: hiera: upgrade mc1021, mc2021 to buster [puppet] - 10https://gerrit.wikimedia.org/r/650491 (https://phabricator.wikimedia.org/T213089) [13:13:58] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: upgrade mc1021, mc2021 to buster [puppet] - 10https://gerrit.wikimedia.org/r/650491 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [13:15:22] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc1021.eqiad.wmnet ` The log can be... [13:15:40] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc2021.codfw.wmnet ` The log can be... [13:17:08] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1002 job=burrow partition={1,3,4,5} prometheus=ops site=eqiad topic={rsyslog-notice,udp_localhost-info} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-dataso [13:17:08] luster=logging-eqiad&var-topic=All&var-consumer_group=All [13:17:28] (03PS1) 10Ladsgroup: druid: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/650492 (https://phabricator.wikimedia.org/T209953) [13:19:04] (03CR) 10jerkins-bot: [V: 04-1] druid: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/650492 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [13:22:14] !log Compress clouddb1018:3312 clouddb1014:3312 T270473 [13:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:20] T270473: Ensure InnoDB is compressed on the new clouddb hosts - https://phabricator.wikimedia.org/T270473 [13:25:47] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [13:29:10] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1021.eqiad.wmnet with reason: REIMAGE [13:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:21] (03PS1) 10Jbond: varnish: ratelimit vscode-phabricator plugin [puppet] - 10https://gerrit.wikimedia.org/r/650494 (https://phabricator.wikimedia.org/T270482) [13:30:25] (03PS2) 10Ladsgroup: druid: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/650492 (https://phabricator.wikimedia.org/T209953) [13:31:02] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2021.codfw.wmnet with reason: REIMAGE [13:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:14] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1021.eqiad.wmnet with reason: REIMAGE [13:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:20] (03PS2) 10Jbond: varnish: ratelimit vscode-phabricator plugin [puppet] - 10https://gerrit.wikimedia.org/r/650494 (https://phabricator.wikimedia.org/T270482) [13:33:24] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2021.codfw.wmnet with reason: REIMAGE [13:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:08] (03CR) 10Jbond: varnish: ratelimit vscode-phabricator plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/650494 (https://phabricator.wikimedia.org/T270482) (owner: 10Jbond) [13:35:45] (03PS3) 10David Caro: [wmcs][backups] Add project and vm info [puppet] - 10https://gerrit.wikimedia.org/r/650141 (https://phabricator.wikimedia.org/T267195) [13:35:47] (03PS1) 10David Caro: [wmcs][backups] Add cli see where a project/vm is backed up [puppet] - 10https://gerrit.wikimedia.org/r/650496 (https://phabricator.wikimedia.org/T267195) [13:35:49] (03PS1) 10David Caro: [wmcs][backup] Added command to show a project [puppet] - 10https://gerrit.wikimedia.org/r/650497 (https://phabricator.wikimedia.org/T267195) [13:36:55] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1002 job=burrow partition=1 prometheus=ops site=eqiad topic={rsyslog-info,rsyslog-notice,rsyslog-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-data [13:36:55] -cluster=logging-eqiad&var-topic=All&var-consumer_group=All [13:41:04] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc1021.eqiad.wmnet'] ` and were **ALL** successful. [13:43:49] (03CR) 10Elukey: [C: 04-1] "This doesn't work, kinit runs as the user that executes it and /run/kerberos at this stage does not allow any user to create files in it." [puppet] - 10https://gerrit.wikimedia.org/r/650480 (https://phabricator.wikimedia.org/T255262) (owner: 10Elukey) [13:44:33] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [13:56:11] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: Upgrade Calico - https://phabricator.wikimedia.org/T207804 (10JMeybohm) [13:56:16] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Refactor calico deploy strategy - https://phabricator.wikimedia.org/T267653 (10JMeybohm) 05Open→03Resolved This is done with calico deployed now via puppet (CNI plugins and calicoctl) as well as helm3 (helmfile.d/admin_n... [13:57:18] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1002 job=burrow partition=1 prometheus=ops site=eqiad topic={rsyslog-info,rsyslog-notice,rsyslog-warning,udp_localhost-info} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=n [13:57:18] tasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [13:57:29] (03PS1) 10Elukey: kerberos: rollback settings for ccache in /run/user/$uid [puppet] - 10https://gerrit.wikimedia.org/r/650499 (https://phabricator.wikimedia.org/T255262) [13:59:35] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: Upgrade Calico - https://phabricator.wikimedia.org/T207804 (10JMeybohm) We packaged and deployed calico 3.16 to staging-codfw but currently we lack full IPv6 support due to the fact that we now require kubernetes to dual-stack/IPv... [14:02:50] (03CR) 10Elukey: [C: 03+2] kerberos: rollback settings for ccache in /run/user/$uid [puppet] - 10https://gerrit.wikimedia.org/r/650499 (https://phabricator.wikimedia.org/T255262) (owner: 10Elukey) [14:05:52] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [14:06:17] (03CR) 10Elukey: "To keep archives happy: I rolled back this change due to some weird corner cases, check the task for more info" [puppet] - 10https://gerrit.wikimedia.org/r/649415 (https://phabricator.wikimedia.org/T255262) (owner: 10Elukey) [14:10:04] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2021.codfw.wmnet'] ` and were **ALL** successful. [14:15:45] (03PS1) 10Ema: varnish: add 37-docker-registry-cl-head.vtc [puppet] - 10https://gerrit.wikimedia.org/r/650502 (https://phabricator.wikimedia.org/T270270) [14:19:43] (03CR) 10Ema: [C: 03+2] varnish: add 37-docker-registry-cl-head.vtc [puppet] - 10https://gerrit.wikimedia.org/r/650502 (https://phabricator.wikimedia.org/T270270) (owner: 10Ema) [14:21:54] (03PS1) 10JMeybohm: Don't register /var/run/kubernetes (as it's unused) [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/650505 (https://phabricator.wikimedia.org/T270298) [14:25:41] (03PS2) 10Elukey: profile::kerberos::client: change alternative ccache location [puppet] - 10https://gerrit.wikimedia.org/r/650480 (https://phabricator.wikimedia.org/T255262) [14:28:28] 10Operations, 10MediaWiki-Docker, 10Traffic, 10serviceops, and 2 others: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10ema) 05Open→03Resolved >>! In T270270#6699809, @ema wrote: > I think we can revert the no-s... [14:31:01] (03PS1) 10Elukey: service::uwsgi: fix logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/650508 [14:32:05] (03CR) 10Elukey: [C: 03+2] service::uwsgi: fix logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/650508 (owner: 10Elukey) [14:39:16] (03CR) 10CDanis: "LGTM overall but nits" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/650494 (https://phabricator.wikimedia.org/T270482) (owner: 10Jbond) [14:44:03] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Mike Pham - https://phabricator.wikimedia.org/T270438 (10RLazarus) >>! In T270438#6700941, @elukey wrote: > @RLazarus reporting a question from IRC: do we need to have L3 or something similar to be signed? I agree wi... [14:47:48] (03PS3) 10Awight: Add a job for TemplateWizard metrics aggregation [puppet] - 10https://gerrit.wikimedia.org/r/649662 (https://phabricator.wikimedia.org/T262209) [15:03:51] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal, 10Patch-For-Review: Prepare a proof of concept of the minimum setup capable of backup and recover testwiki media files - https://phabricator.wikimedia.org/T264189 (10jcrespo) The proof of concept, which is not focused on performance an... [15:05:36] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Move labstore1005 to 10Gbps rack and ethernet - https://phabricator.wikimedia.org/T266199 (10Andrew) [15:07:37] 10Operations, 10MediaWiki-Docker, 10Traffic, 10serviceops, and 2 others: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10hashar) Thank you @ema for the full explanation (and for the fix of course)! [15:08:38] (03PS1) 10Elukey: Remove stat100[5,8] hiera host configs [puppet] - 10https://gerrit.wikimedia.org/r/650510 [15:09:58] (03CR) 10Elukey: [C: 03+1] admin: Add mttp to analytics-privatedata-users, but with no SSH. [puppet] - 10https://gerrit.wikimedia.org/r/650298 (https://phabricator.wikimedia.org/T270438) (owner: 10RLazarus) [15:12:45] (03PS2) 10RLazarus: admin: Add mttp to analytics-privatedata-users, but with no SSH. [puppet] - 10https://gerrit.wikimedia.org/r/650298 (https://phabricator.wikimedia.org/T270438) [15:12:49] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27207/console" [puppet] - 10https://gerrit.wikimedia.org/r/650510 (owner: 10Elukey) [15:13:02] (03CR) 10Elukey: [V: 03+1 C: 03+2] Remove stat100[5,8] hiera host configs [puppet] - 10https://gerrit.wikimedia.org/r/650510 (owner: 10Elukey) [15:13:44] (03CR) 10RLazarus: [C: 03+2] admin: Add mttp to analytics-privatedata-users, but with no SSH. [puppet] - 10https://gerrit.wikimedia.org/r/650298 (https://phabricator.wikimedia.org/T270438) (owner: 10RLazarus) [15:13:46] 10Operations, 10ops-eqiad, 10Analytics-Clusters: an-presto1004 shows only the NIC in the boot list - https://phabricator.wikimedia.org/T268951 (10Cmjohnson) The part did not arrive in time on Monday for the tech to get here and then a snow/ice storm delayed the tech. We rescheduled this for this coming Mond... [15:21:18] (03PS1) 10Bstorm: cloud nfs: change primary interface of labstore1005 to 10G [puppet] - 10https://gerrit.wikimedia.org/r/650513 (https://phabricator.wikimedia.org/T266199) [15:21:58] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Mike Pham - https://phabricator.wikimedia.org/T270438 (10RLazarus) 05Open→03Resolved @MPhamWMF You're all set! After we discussed a bit more, consensus is that you //don't// need to sign L3. It may take up to 30 mi... [15:22:40] (03CR) 10Andrew Bogott: [C: 03+1] cloud nfs: change primary interface of labstore1005 to 10G [puppet] - 10https://gerrit.wikimedia.org/r/650513 (https://phabricator.wikimedia.org/T266199) (owner: 10Bstorm) [15:28:10] (03CR) 10Bstorm: [C: 03+2] cloud nfs: change primary interface of labstore1005 to 10G [puppet] - 10https://gerrit.wikimedia.org/r/650513 (https://phabricator.wikimedia.org/T266199) (owner: 10Bstorm) [15:29:44] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:41:38] (03PS3) 10Razzi: role::analytics_cluster::ui::dashboards: Add superset to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/650179 (https://phabricator.wikimedia.org/T268219) [15:43:10] (03CR) 10Mforns: "@elukey, this can be merged now, if you agree!" [puppet] - 10https://gerrit.wikimedia.org/r/649660 (https://phabricator.wikimedia.org/T262209) (owner: 10Awight) [15:46:14] (03CR) 10Bstorm: [C: 03+1] "Aggressive! It might be a good idea though. We don't run anything else on them." [puppet] - 10https://gerrit.wikimedia.org/r/650470 (https://phabricator.wikimedia.org/T267966) (owner: 10Arturo Borrero Gonzalez) [15:49:18] (03PS1) 10Elukey: Port the decorators.py module from Spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/650518 (https://phabricator.wikimedia.org/T257905) [15:50:39] (03CR) 10SBassett: "> This'll need to wait for wmf.26 to be branched and pulled onto the deployment host" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521233 (https://phabricator.wikimedia.org/T181217) (owner: 10Reedy) [15:51:40] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:39] (03PS1) 10JMeybohm: Restart systemd units on package upgrade [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/650521 (https://phabricator.wikimedia.org/T270302) [16:00:54] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:50] (03CR) 10Elukey: [C: 03+2] Add a job for some visualeditor metrics aggregation [puppet] - 10https://gerrit.wikimedia.org/r/649660 (https://phabricator.wikimedia.org/T262209) (owner: 10Awight) [16:03:46] (03PS1) 10Razzi: superset: Switch traffic from analytics-tool1004 to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/650522 [16:08:46] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:15:20] (03CR) 10Mforns: "I merged the dependency patch." [puppet] - 10https://gerrit.wikimedia.org/r/649662 (https://phabricator.wikimedia.org/T262209) (owner: 10Awight) [16:19:48] !log restart logstash on logstash2004 [16:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:05] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/27208/" [puppet] - 10https://gerrit.wikimedia.org/r/650492 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [16:27:20] (03CR) 10Volans: [C: 03+1] "LGTM, it remains to find a good way to use this in spicerack keeping the dry-run automagic reset of tries to 1 in DRY-RUN mode without ful" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/650518 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [16:28:54] (03CR) 10Elukey: [C: 03+2] Port the decorators.py module from Spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/650518 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [16:45:13] (03PS1) 10Andrew Bogott: cloudweb2001-dev: move from Stretch to Buster [puppet] - 10https://gerrit.wikimedia.org/r/650529 (https://phabricator.wikimedia.org/T269004) [16:46:50] (03CR) 10Andrew Bogott: [C: 03+2] cloudweb2001-dev: move from Stretch to Buster [puppet] - 10https://gerrit.wikimedia.org/r/650529 (https://phabricator.wikimedia.org/T269004) (owner: 10Andrew Bogott) [16:53:11] 10Operations, 10Diff-blog, 10Traffic, 10HTTPS: Send HSTS header on diff.wikimedia.org - https://phabricator.wikimedia.org/T270034 (10RLazarus) a:03RLazarus Emailed Comms about it, will route this appropriately when I hear back. [16:55:57] 10Operations, 10Technical-blog-posts, 10Traffic: 3rd part of blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T270074 (10srodlund) @ema these should all be fixed now. :-) I'll send out an announcement today. [16:57:46] (03PS1) 10Ssingh: dnsdist: update configuration variables in dnsdist.conf [puppet] - 10https://gerrit.wikimedia.org/r/650532 (https://phabricator.wikimedia.org/T252132) [16:59:54] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27209/console" [puppet] - 10https://gerrit.wikimedia.org/r/650532 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [17:01:03] (03CR) 10Bstorm: P:toolforge: migrate to ensure_packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/639826 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [17:02:22] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsdist: update configuration variables in dnsdist.conf [puppet] - 10https://gerrit.wikimedia.org/r/650532 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [17:02:47] andrewbogott: ok to merge your change? [17:03:16] yes please [17:03:19] thanks [17:05:25] (03CR) 10Bstorm: [C: 03+1] "Seems like a good idea before the weekend." [puppet] - 10https://gerrit.wikimedia.org/r/650178 (https://phabricator.wikimedia.org/T269419) (owner: 10David Caro) [17:11:14] 10Operations, 10Diff-blog, 10Traffic, 10HTTPS: Send HSTS header on diff.wikimedia.org - https://phabricator.wikimedia.org/T270034 (10RLazarus) a:05RLazarus→03Varnent Thanks @Varnent for offering to look at this, as our primary contact with VIP. It turns out two other VIP-hosted domains, techblog.wikime... [17:11:44] 10Operations, 10Diff-blog, 10Traffic, 10HTTPS: Send HSTS header on all VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10RLazarus) p:05Triage→03Medium [17:15:12] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 10.57 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [17:16:47] (03CR) 10David Caro: [C: 03+2] [wmcs] Move some heavy backups to cloudvirt1026 [puppet] - 10https://gerrit.wikimedia.org/r/650178 (https://phabricator.wikimedia.org/T269419) (owner: 10David Caro) [17:18:00] (03CR) 10David Caro: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/650178 (https://phabricator.wikimedia.org/T269419) (owner: 10David Caro) [17:26:00] (03PS4) 10David Caro: [wmcs][backups] Add project and vm info [puppet] - 10https://gerrit.wikimedia.org/r/650141 (https://phabricator.wikimedia.org/T267195) [17:26:02] (03PS2) 10David Caro: [wmcs][backups] Add cli see where a project/vm is backed up [puppet] - 10https://gerrit.wikimedia.org/r/650496 (https://phabricator.wikimedia.org/T267195) [17:26:04] (03PS2) 10David Caro: [wmcs][backup] Added command to show a project [puppet] - 10https://gerrit.wikimedia.org/r/650497 (https://phabricator.wikimedia.org/T267195) [17:26:06] (03PS1) 10David Caro: [wmcs][backup] Add command to remove/print dangling snapshots [puppet] - 10https://gerrit.wikimedia.org/r/650535 (https://phabricator.wikimedia.org/T270478) [17:26:44] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudweb2001-dev.wikimedia.org with reason: REIMAGE [17:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:46] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudweb2001-dev.wikimedia.org with reason: REIMAGE [17:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:19] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install rdb101[12] - https://phabricator.wikimedia.org/T266724 (10Cmjohnson) [17:33:43] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install rdb101[12] - https://phabricator.wikimedia.org/T266724 (10Cmjohnson) a:05Cmjohnson→03RobH @robh these are ready for you [17:34:53] 10Operations, 10ops-eqiad: Interface errors on cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T267672 (10Cmjohnson) 05Open→03Resolved No new errors since I replace the fiber. Resolving this [17:36:06] 10Operations, 10ops-eqiad, 10decommission-hardware: Reclaim torrelay1001 to spares - https://phabricator.wikimedia.org/T243390 (10Cmjohnson) 05Open→03Resolved disks wiped [17:40:25] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:03] (03CR) 10Tchanders: Add IPInfo extension config to InitialiseSettings-labs.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644550 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [17:44:07] (03PS14) 10Jcrespo: [WIP] We continue with swift listing and download tests for media backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/643980 [17:54:00] (03PS1) 10David Caro: [wmcs][backup] Remove all temp files after usage [puppet] - 10https://gerrit.wikimedia.org/r/650542 (https://phabricator.wikimedia.org/T270478) [17:59:58] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:13] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on labstore1005.eqiad.wmnet with reason: REIMAGE [18:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:14] (03PS1) 10Elukey: Port IRCSocketHandler from Spickerack and create irc_utils.py [software/pywmflib] - 10https://gerrit.wikimedia.org/r/650546 (https://phabricator.wikimedia.org/T257905) [18:07:17] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on labstore1005.eqiad.wmnet with reason: REIMAGE [18:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:21] (03CR) 10Elukey: "@Volans: naming of the module etc.. can be changed of course, this is my proposal but lemme know your thoughts :)" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/650546 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [18:08:41] (03PS3) 10Elukey: admin: remove access for user dstrine [puppet] - 10https://gerrit.wikimedia.org/r/645275 (https://phabricator.wikimedia.org/T268801) [18:09:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1019.eqiad.wmnet - https://phabricator.wikimedia.org/T270159 (10Cmjohnson) 05Open→03Resolved Removed from rack [18:10:03] (03CR) 10Elukey: [C: 03+2] druid: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/650492 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [18:10:06] 10Operations, 10ops-eqiad, 10Analytics-Clusters: an-presto1004 shows only the NIC in the boot list - https://phabricator.wikimedia.org/T268951 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson Dell tech arrived today, swapped the raid controller. All disks are now online. resolving [18:11:25] (03PS2) 10Razzi: superset: Switch traffic from analytics-tool1004 to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/650522 [18:11:34] 10Operations, 10ops-eqiad, 10DC-Ops: Audit down ports - https://phabricator.wikimedia.org/T218751 (10Cmjohnson) @ayounsi is this still needed now that we switched to using netbox and homer? [18:11:39] (03PS1) 10RLazarus: cumin: Don't complain if A:spare matches no hosts. [puppet] - 10https://gerrit.wikimedia.org/r/650547 [18:13:20] (03CR) 10jerkins-bot: [V: 04-1] cumin: Don't complain if A:spare matches no hosts. [puppet] - 10https://gerrit.wikimedia.org/r/650547 (owner: 10RLazarus) [18:14:25] (03PS2) 10RLazarus: cumin: Don't complain if A:spare matches no hosts. [puppet] - 10https://gerrit.wikimedia.org/r/650547 [18:15:29] (03CR) 10RLazarus: [C: 03+1] admin: remove access for user dstrine [puppet] - 10https://gerrit.wikimedia.org/r/645275 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [18:16:16] (03CR) 10Elukey: [C: 03+2] admin: remove access for user dstrine [puppet] - 10https://gerrit.wikimedia.org/r/645275 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [18:18:53] ACKNOWLEDGEMENT - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. RLazarus cumin-check-aliases.service: False alarm, to be fixed in https://gerrit.wikimedia.org/r/650547/ https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:18:53] ACKNOWLEDGEMENT - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. RLazarus cumin-check-aliases.service: False alarm, to be fixed in https://gerrit.wikimedia.org/r/650547/ https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:19:37] (03PS7) 10Razzi: sqoop: Ensure /tmp/sqoop-jars/ is present [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) [18:20:59] (03PS1) 10Dduvall: pipeline: Define wmf-publish image build pipeline [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/650363 (https://phabricator.wikimedia.org/T269617) [18:21:33] (03CR) 10Dduvall: "check experimental" [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/650363 (https://phabricator.wikimedia.org/T269617) (owner: 10Dduvall) [18:23:43] * razzi afk for lunch [18:36:05] !log andrew@deploy1001 Started deploy [horizon/deploy@89b308c]: update codfw1dev deploy [18:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:13] !log andrew@deploy1001 Finished deploy [horizon/deploy@89b308c]: update codfw1dev deploy (duration: 00m 09s) [18:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:52] !log andrew@deploy1001 Started deploy [horizon/deploy@89b308c]: update codfw1dev deploy [18:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:17] 10Operations, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10RLazarus) p:05Triage→03Medium [18:38:09] 10Operations, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10RLazarus) [18:38:14] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10RLazarus) [18:38:47] ACKNOWLEDGEMENT - PHP opcache health on mw1265 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% RLazarus T270517 https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:38:47] ACKNOWLEDGEMENT - PHP opcache health on mwdebug1003 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% RLazarus T270517 https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:39:09] !log andrew@deploy1001 Finished deploy [horizon/deploy@89b308c]: update codfw1dev deploy (duration: 02m 17s) [18:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:04] 10Operations, 10Discovery-Search (Current work): Reshard commonswiki_file elasticsearch index - https://phabricator.wikimedia.org/T260083 (10RLazarus) [18:49:23] (03CR) 10Dduvall: [C: 03+2] pipeline: Define wmf-publish image build pipeline [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/650363 (https://phabricator.wikimedia.org/T269617) (owner: 10Dduvall) [18:51:01] (03PS3) 10Bstorm: etcd: make snapshot interval configurable [puppet] - 10https://gerrit.wikimedia.org/r/648354 (https://phabricator.wikimedia.org/T267966) [18:52:45] (03CR) 10jerkins-bot: [V: 04-1] etcd: make snapshot interval configurable [puppet] - 10https://gerrit.wikimedia.org/r/648354 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm) [18:54:46] (03PS1) 10Andrew Bogott: Change labstore1005 to role(insetup) so that we can log in [puppet] - 10https://gerrit.wikimedia.org/r/650556 (https://phabricator.wikimedia.org/T266199) [18:56:32] !log nskaggs@cumin1001 START - Cookbook wmcs.wikireplicas.add_wiki [18:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:13] (03CR) 10Bstorm: [C: 03+1] "This seems like a valid approach, thought I'd like to check the console first." [puppet] - 10https://gerrit.wikimedia.org/r/650556 (https://phabricator.wikimedia.org/T266199) (owner: 10Andrew Bogott) [19:00:05] (03CR) 10Andrew Bogott: [C: 03+2] Change labstore1005 to role(insetup) so that we can log in [puppet] - 10https://gerrit.wikimedia.org/r/650556 (https://phabricator.wikimedia.org/T266199) (owner: 10Andrew Bogott) [19:07:23] !log nskaggs@cumin1001 Added views for new wiki: madwiki T269440 [19:07:23] !log nskaggs@cumin1001 END (PASS) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=0) [19:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:26] T269440: Prepare and check storage layer for madwiki - https://phabricator.wikimedia.org/T269440 [19:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:26] (03PS1) 10Andrew Bogott: Revert "Change labstore1005 to role(insetup) so that we can log in" [puppet] - 10https://gerrit.wikimedia.org/r/650566 [19:13:03] (03CR) 10Bstorm: [C: 03+1] Revert "Change labstore1005 to role(insetup) so that we can log in" [puppet] - 10https://gerrit.wikimedia.org/r/650566 (owner: 10Andrew Bogott) [19:13:17] (03CR) 10Jforrester: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521233 (https://phabricator.wikimedia.org/T181217) (owner: 10Reedy) [19:13:56] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Change labstore1005 to role(insetup) so that we can log in" [puppet] - 10https://gerrit.wikimedia.org/r/650566 (owner: 10Andrew Bogott) [19:17:27] hmm.. no IRC updates for Phab ticket comments? [19:19:21] (03PS1) 10Bstorm: cloud nfs: correct interfaces for 10G change [puppet] - 10https://gerrit.wikimedia.org/r/650562 (https://phabricator.wikimedia.org/T266199) [19:21:33] wikibooogs might've just lost the phab parser connection [19:22:04] they can be restarted separatelly [19:22:15] !log bug: wikibugs stopped reporting bugs, attempting to restart bug bot to continue reporting bugs [19:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:33] (03CR) 10Andrew Bogott: [C: 03+1] cloud nfs: correct interfaces for 10G change [puppet] - 10https://gerrit.wikimedia.org/r/650562 (https://phabricator.wikimedia.org/T266199) (owner: 10Bstorm) [19:22:53] (03CR) 10Bstorm: [C: 03+2] cloud nfs: correct interfaces for 10G change [puppet] - 10https://gerrit.wikimedia.org/r/650562 (https://phabricator.wikimedia.org/T266199) (owner: 10Bstorm) [19:23:06] (03Merged) 10jenkins-bot: pipeline: Define wmf-publish image build pipeline [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/650363 (https://phabricator.wikimedia.org/T269617) (owner: 10Dduvall) [19:23:44] (03PS1) 10Dduvall: pipeline: Fix malformed pipeline config [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/650567 [19:24:15] (03CR) 10Dduvall: "check experimental" [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/650567 (owner: 10Dduvall) [19:24:50] !log tools.wikibugs@tools-sgebastion-07:~/wikibugs2$ qdel 1766104 [19:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:34] marcaur: how to restart just the one job? The docs say to kill them all and wait and then restart them all [19:25:50] let me fetch mutante [19:25:55] but I guess it's not important either? [19:26:05] I have killed all 3 before multiple times. [19:26:48] python3 manage.py restart_job wb2-phab [19:27:06] thanks! doing that [19:27:24] or you can drop the nuke and restart everything [19:27:27] denied: host "tools-sgebastion-07.tools.eqiad.wmflabs" is not an admin host [19:27:30] :p [19:27:42] but it did not deny me when i killed it [19:27:47] interesting combo [19:27:48] pkill? [19:27:54] or qdel ? [19:27:59] qdel [19:28:22] ok, going the known route as described on wikitech [19:28:26] the most updated docs seems to be https://www.mediawiki.org/wiki/Wikibugs [19:28:28] mutante: ^ [19:28:35] qstat/qdel ..wait .. restart all [19:28:41] not wikitech [19:28:47] or was? [19:28:49] idk [19:28:56] ack, yea, I was confused because I just got redirected from A to B [19:28:56] not my problem anymore :D [19:29:37] I dunno why but there seems to be a trend to move stuff to mediawiki.org [19:30:21] there it goes after I used qdel on the remaining 2 jobs [19:30:32] !log nskaggs@cumin1001 START - Cookbook wmcs.wikireplicas.add_wiki [19:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:10] !log restarted wikibugs (phab, gerrit and irc jobs) [19:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:27] 10Operations, 10Wikibugs: wikibugs test bug part II - https://phabricator.wikimedia.org/T90594 (10Dzahn) [19:33:58] 10Operations, 10Wikibugs: wikibugs test bug part II - https://phabricator.wikimedia.org/T90594 (10Dzahn) using test bug to test bug bot [19:34:13] marcaur: ^ works, thanks [19:35:32] * marcaur bows [19:36:03] (03CR) 10Volans: "I agree with the functionality port, a comment to make it slightly more general inline." (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/650546 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [19:37:29] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the addition" [puppet] - 10https://gerrit.wikimedia.org/r/650547 (owner: 10RLazarus) [19:38:00] (03CR) 10RLazarus: [C: 03+2] "Thanks for the quick review!" [puppet] - 10https://gerrit.wikimedia.org/r/650547 (owner: 10RLazarus) [19:39:13] (03PS1) 10Bstorm: cloud nfs: fix broken file reference [puppet] - 10https://gerrit.wikimedia.org/r/650586 (https://phabricator.wikimedia.org/T266199) [19:40:25] (03PS1) 10Andrew Bogott: Add backend entry for labtesttoolsadmin [puppet] - 10https://gerrit.wikimedia.org/r/650587 (https://phabricator.wikimedia.org/T269004) [19:40:54] (03CR) 10Andrew Bogott: [C: 03+1] cloud nfs: fix broken file reference [puppet] - 10https://gerrit.wikimedia.org/r/650586 (https://phabricator.wikimedia.org/T266199) (owner: 10Bstorm) [19:41:04] (03CR) 10Bstorm: [C: 03+2] cloud nfs: fix broken file reference [puppet] - 10https://gerrit.wikimedia.org/r/650586 (https://phabricator.wikimedia.org/T266199) (owner: 10Bstorm) [19:41:19] !log nskaggs@cumin1001 Added views for new wiki: skrwiki T268412 [19:41:19] !log nskaggs@cumin1001 END (PASS) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=0) [19:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:22] T268412: Prepare and check storage layer for skrwiki - https://phabricator.wikimedia.org/T268412 [19:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:30] 10Operations, 10Traffic: Image fails to load with CORS violation - https://phabricator.wikimedia.org/T270209 (10RLazarus) I can't repro the CORS issue exactly, but I am getting a 503 from Varnish for `https://upload.wikimedia.org/wikipedia/commons/thumb/b/ba/Circuit_de_la_Sarthe_track_map.svg/2880px-Circuit_de... [20:08:30] 10Operations, 10Traffic: Image fails to load with CORS violation - https://phabricator.wikimedia.org/T270209 (10RLazarus) >>! In T270209#6702440, @RLazarus wrote: > only on the 2880px- URL -- the smaller ones work fine. Meant to add -- that's also the reason for this: >>! In T270209#6697248, @RoySmith wrote... [20:20:59] 10Operations, 10Traffic: Image fails to load with CORS violation - https://phabricator.wikimedia.org/T270209 (10RoySmith) Yeah, I can repro that here. On the command line: curl -v https://upload.wikimedia.org/wikipedia/commons/thumb/b/ba/Circuit_de_la_Sarthe_track_map.svg/2560px-Circuit_de_la_Sarthe_track_ma... [20:25:25] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/compiler1001/27211/" [puppet] - 10https://gerrit.wikimedia.org/r/650587 (https://phabricator.wikimedia.org/T269004) (owner: 10Andrew Bogott) [20:26:35] (03CR) 10Andrew Bogott: [C: 03+2] Add backend entry for labtesttoolsadmin [puppet] - 10https://gerrit.wikimedia.org/r/650587 (https://phabricator.wikimedia.org/T269004) (owner: 10Andrew Bogott) [20:43:35] (03PS1) 10Nikki Nikkhoui: Enable gitiles named anchor [puppet] - 10https://gerrit.wikimedia.org/r/650598 (https://phabricator.wikimedia.org/T269300) [20:44:15] (03CR) 10Nikki Nikkhoui: "Very unsure how to actually test this works, just my best attempt!" [puppet] - 10https://gerrit.wikimedia.org/r/650598 (https://phabricator.wikimedia.org/T269300) (owner: 10Nikki Nikkhoui) [20:45:22] (03PS2) 10Nikki Nikkhoui: Enable gitiles named anchor [puppet] - 10https://gerrit.wikimedia.org/r/650598 (https://phabricator.wikimedia.org/T269300) [20:45:51] (03CR) 10jerkins-bot: [V: 04-1] Enable gitiles named anchor [puppet] - 10https://gerrit.wikimedia.org/r/650598 (https://phabricator.wikimedia.org/T269300) (owner: 10Nikki Nikkhoui) [20:47:20] (03PS3) 10Nikki Nikkhoui: Enable gitiles named anchor [puppet] - 10https://gerrit.wikimedia.org/r/650598 (https://phabricator.wikimedia.org/T269300) [20:47:50] (03PS1) 10Bstorm: cloud nfs: fix custom fact output while syncing drbd [puppet] - 10https://gerrit.wikimedia.org/r/650599 (https://phabricator.wikimedia.org/T266199) [20:48:38] (03CR) 10Bstorm: "locally, this produces the expected output in IRB" [puppet] - 10https://gerrit.wikimedia.org/r/650599 (https://phabricator.wikimedia.org/T266199) (owner: 10Bstorm) [20:55:02] (03CR) 10Bstorm: [C: 03+2] cloud nfs: fix custom fact output while syncing drbd [puppet] - 10https://gerrit.wikimedia.org/r/650599 (https://phabricator.wikimedia.org/T266199) (owner: 10Bstorm) [20:59:09] 10Operations, 10Thumbor, 10serviceops: Image fails to load with CORS violation - https://phabricator.wikimedia.org/T270209 (10RLazarus) Thanks to @CDanis for digging into this with me. There are a couple of different things going on. # The thumbor service has a lot of trouble producing a 2880-pixel version... [21:13:39] 10Operations, 10Traffic: Set CORS headers on error pages? - https://phabricator.wikimedia.org/T270526 (10RLazarus) p:05Triage→03Medium [21:13:56] 10Operations, 10Traffic: Set CORS headers on error pages? - https://phabricator.wikimedia.org/T270526 (10RLazarus) [21:13:58] 10Operations, 10Thumbor, 10serviceops: Image fails to load with CORS violation - https://phabricator.wikimedia.org/T270209 (10RLazarus) [21:29:53] 10Operations, 10Thumbor, 10serviceops: Image fails to load with CORS violation - https://phabricator.wikimedia.org/T270209 (10RLazarus) [21:30:30] 10Operations, 10Thumbor, 10serviceops: Image fails to load with CORS violation - https://phabricator.wikimedia.org/T270209 (10AntiCompositeNumber) > When I fetched it directly, the request hung for a minute or so before coming back as a 503, likely because of a timeout somewhere in the stack. Yup, Thumbor g... [21:34:02] 10Operations, 10Thumbor, 10serviceops: Image fails to load with CORS violation - https://phabricator.wikimedia.org/T270209 (10RLazarus) Ah, cheers @AntiCompositeNumber, I thought it sounded familiar. So it sounds like this will get better with T216815, which I think I've heard mutterings about doing Soon™. [21:45:29] 10Operations, 10Thumbor, 10serviceops: Image fails to load with CORS violation - https://phabricator.wikimedia.org/T270209 (10RLazarus) So, since the root cause here is a known issue, I'm duping this ticket over to the SVG rendering issue, but thanks @RoySmith for filing -- the subtasks are still open and wi... [21:46:03] 10Operations, 10Thumbor, 10serviceops: Image fails to load with CORS violation - https://phabricator.wikimedia.org/T270209 (10RLazarus) [21:52:35] !log flushing gerrit project cache [21:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:32] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, 10Traffic: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10AntiCompositeNumber) [21:54:03] 10Operations, 10Thumbor, 10serviceops: Image fails to load with CORS violation - https://phabricator.wikimedia.org/T270209 (10AntiCompositeNumber) [22:43:51] (03CR) 10Thcipriani: [C: 03+1] Enable gitiles named anchor [puppet] - 10https://gerrit.wikimedia.org/r/650598 (https://phabricator.wikimedia.org/T269300) (owner: 10Nikki Nikkhoui) [23:14:42] (03PS1) 10Ladsgroup: druid: Migrate hiera() to lookup() and setting datatype in middlemanager [puppet] - 10https://gerrit.wikimedia.org/r/650617 (https://phabricator.wikimedia.org/T209953) [23:18:27] (03PS1) 10Dzahn: doc: don't include envoy when in cloud, allowing tests [puppet] - 10https://gerrit.wikimedia.org/r/650618 [23:20:57] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/27212/" [puppet] - 10https://gerrit.wikimedia.org/r/650618 (owner: 10Dzahn) [23:22:46] (03CR) 10Dzahn: "I wonder how you were able to use this on "doc" before?" [puppet] - 10https://gerrit.wikimedia.org/r/650618 (owner: 10Dzahn) [23:24:08] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/27213/" [puppet] - 10https://gerrit.wikimedia.org/r/650617 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [23:24:42] (03CR) 10Dzahn: "ah.. you used the profile class directly .. I see" [puppet] - 10https://gerrit.wikimedia.org/r/650618 (owner: 10Dzahn) [23:25:37] (03CR) 10Dzahn: "The point of this is "the doc role works just fine on buster, there is no need to keep this on stretch and switch is easy".https://phabric" [puppet] - 10https://gerrit.wikimedia.org/r/650618 (owner: 10Dzahn) [23:34:54] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/27217/" [puppet] - 10https://gerrit.wikimedia.org/r/649708 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [23:35:23] 10Operations, 10Thumbor, 10serviceops: Image fails to load with CORS violation - https://phabricator.wikimedia.org/T270209 (10RLazarus) Doh, thanks @AntiCompositeNumber. :) [23:37:08] (03CR) 10Dzahn: "noop on conf2001, conf1004" [puppet] - 10https://gerrit.wikimedia.org/r/649708 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [23:42:19] (03CR) 10Dzahn: [C: 03+2] Enable gitiles named anchor [puppet] - 10https://gerrit.wikimedia.org/r/650598 (https://phabricator.wikimedia.org/T269300) (owner: 10Nikki Nikkhoui) [23:50:36] (03CR) 10Dduvall: [C: 03+2] pipeline: Fix malformed pipeline config [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/650567 (owner: 10Dduvall) [23:51:16] (03PS1) 10Dzahn: site: apply doc role on doc1002 [puppet] - 10https://gerrit.wikimedia.org/r/650620 (https://phabricator.wikimedia.org/T247653) [23:52:10] (03PS2) 10Dzahn: site: apply doc role on doc1002 [puppet] - 10https://gerrit.wikimedia.org/r/650620 (https://phabricator.wikimedia.org/T247653) [23:52:36] (03CR) 10Dzahn: [C: 03+2] site: apply doc role on doc1002 [puppet] - 10https://gerrit.wikimedia.org/r/650620 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn)