[11:49:04] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:00] 10SRE, 10ops-eqiad, 10Discovery, 10Discovery-Search: elastic1060 reported errors in getsel - https://phabricator.wikimedia.org/T278630 (10Gehel) [11:56:28] (03PS1) 10Zabe: Enable assignment of importupload on enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675498 (https://phabricator.wikimedia.org/T278683) [11:57:51] !log cp4027: re-enable JIT compilation in normalize-path.lua -- https://github.com/apache/trafficserver/issues/7423 [11:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:22] (03CR) 10Kosta Harlan: [C: 03+1] api-gateway: make discovery service timeouts configurable per service [deployment-charts] - 10https://gerrit.wikimedia.org/r/674562 (https://phabricator.wikimedia.org/T277297) (owner: 10Hnowlan) [12:14:10] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:15:55] (03PS1) 10ArielGlenn: Rename PerWorkerJobBatches to JobBatch [dumps] - 10https://gerrit.wikimedia.org/r/675499 (https://phabricator.wikimedia.org/T252396) [12:16:50] (03CR) 10jerkins-bot: [V: 04-1] Rename PerWorkerJobBatches to JobBatch [dumps] - 10https://gerrit.wikimedia.org/r/675499 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [12:26:51] (03PS2) 10ArielGlenn: Rename PerWorkerJobBatches to JobBatch [dumps] - 10https://gerrit.wikimedia.org/r/675499 (https://phabricator.wikimedia.org/T252396) [12:32:33] (03PS1) 10ArielGlenn: suppress stdout/error for test unless new option --verbose is passed [dumps] - 10https://gerrit.wikimedia.org/r/675500 [12:35:00] (03CR) 10Elukey: [C: 03+1] profile::aqs_next: add stub password [labs/private] - 10https://gerrit.wikimedia.org/r/675174 (https://phabricator.wikimedia.org/T249755) (owner: 10Hnowlan) [12:36:15] !log cp4027: re-enable JIT compilation in all ats-be lua scripts -- https://github.com/apache/trafficserver/issues/7423 [12:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:38] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10hashar) [12:41:12] (03PS1) 10ArielGlenn: display errors from failed commands correctly for recompress job [dumps] - 10https://gerrit.wikimedia.org/r/675501 [12:42:24] (03PS1) 10Urbanecm: urbanecm's dotfiles: Remove wikimedia.org from no_proxy [puppet] - 10https://gerrit.wikimedia.org/r/675502 [12:47:50] (03PS1) 10Majavah: beta: Switch traffic to deployment-mediawiki11 [puppet] - 10https://gerrit.wikimedia.org/r/675503 [13:03:09] !log cp4027: rollback luajit experiment https://github.com/apache/trafficserver/issues/7423#issuecomment-809354214 [13:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:35] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts... [13:06:36] (03PS1) 10Volans: tests: link documentation on failure [homer/public] - 10https://gerrit.wikimedia.org/r/675505 [13:08:05] (03CR) 10Volans: "Example output on failure:" [homer/public] - 10https://gerrit.wikimedia.org/r/675505 (owner: 10Volans) [13:08:32] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [13:08:34] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [13:10:04] (03CR) 10Jbond: [C: 03+2] beta: Switch traffic to deployment-mediawiki11 [puppet] - 10https://gerrit.wikimedia.org/r/675503 (owner: 10Majavah) [13:11:30] (03PS1) 10Effie Mouzeli: install_server: switch parsoid servers to buster [puppet] - 10https://gerrit.wikimedia.org/r/675506 (https://phabricator.wikimedia.org/T245757) [13:13:52] (03CR) 10Jbond: [C: 03+2] Switch firejail-convert to python3 [puppet] - 10https://gerrit.wikimedia.org/r/674684 (https://phabricator.wikimedia.org/T247364) (owner: 10RhinosF1) [13:15:46] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/675506 (https://phabricator.wikimedia.org/T245757) (owner: 10Effie Mouzeli) [13:16:50] (03CR) 10Zabe: [C: 04-1] "in discussion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675498 (https://phabricator.wikimedia.org/T278683) (owner: 10Zabe) [13:17:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:19:11] (03CR) 10Effie Mouzeli: [C: 03+2] install_server: switch parsoid servers to buster [puppet] - 10https://gerrit.wikimedia.org/r/675506 (https://phabricator.wikimedia.org/T245757) (owner: 10Effie Mouzeli) [13:19:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:24:36] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2001.codfw.wmnet with reason: REIMAGE [13:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:40] effie: ^ I'm assuming that means that it should be safe to update deployment-prep parsoid servers to buster? [13:26:33] (03CR) 10Ema: [C: 03+2] Allow access to the Maps service from MediaWiki-Vagrant [puppet] - 10https://gerrit.wikimedia.org/r/674009 (owner: 10Andrew-WMDE) [13:26:40] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2001.codfw.wmnet with reason: REIMAGE [13:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:07] (03PS1) 10Phuedx: vector: Disable WVUI search widget treatment A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675509 (https://phabricator.wikimedia.org/T276917) [13:30:58] (03PS1) 10Ema: varnish: add Mediawiki-Vagrant case to 21-maps.vtc [puppet] - 10https://gerrit.wikimedia.org/r/675510 [13:44:10] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] profile::aqs_next: add stub password [labs/private] - 10https://gerrit.wikimedia.org/r/675174 (https://phabricator.wikimedia.org/T249755) (owner: 10Hnowlan) [13:48:41] (03CR) 10Ema: [C: 03+2] varnish: add Mediawiki-Vagrant case to 21-maps.vtc [puppet] - 10https://gerrit.wikimedia.org/r/675510 (owner: 10Ema) [14:13:17] Majavah: sorry I didn't see that earlier, yes, we do not expect any issues [14:13:49] effie: ack, thanks! copying from -sre, is it expected that newly installed parsoid nodes have parsoid/js installed and running? I thought everything was already migrated to parsoid/php [14:16:35] I guess that would be https://gerrit.wikimedia.org/r/c/operations/puppet/+/577044/? [14:18:50] (03PS1) 10Hashar: contint: serve compressed json as application/json [puppet] - 10https://gerrit.wikimedia.org/r/675515 (https://phabricator.wikimedia.org/T249268) [14:18:51] to my knowledge they are running php only [14:20:00] (03CR) 10Hashar: "I don't know how to test it." [puppet] - 10https://gerrit.wikimedia.org/r/675515 (https://phabricator.wikimedia.org/T249268) (owner: 10Hashar) [14:20:48] I just installed deployment-parsoid12 with role::parsoid and parsoid.service is definitely running on nodejs, but that puppet patch suggests that no-one has yet got around removing js [14:48:49] (03PS2) 10Phuedx: ULS: Remove unused ULSEventLogging variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670224 (https://phabricator.wikimedia.org/T275894) [15:15:39] (03CR) 10Jbond: [C: 03+2] urbanecm's dotfiles: Remove wikimedia.org from no_proxy [puppet] - 10https://gerrit.wikimedia.org/r/675502 (owner: 10Urbanecm) [15:16:14] thanks jbond42 [15:20:15] (03CR) 10Ema: [C: 03+2] varnish::instance: drop use of array_concat [puppet] - 10https://gerrit.wikimedia.org/r/661769 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [15:20:53] Urbanecm: np [15:25:48] 10SRE, 10ops-eqiad, 10Discovery, 10Discovery-Search (Current work): elastic1060 reported errors in getsel - https://phabricator.wikimedia.org/T278630 (10Gehel) [15:25:57] (03PS5) 10Ema: Remove comment about "Same regex as above in https_recv_redirect" [puppet] - 10https://gerrit.wikimedia.org/r/606457 (owner: 10Reedy) [15:26:24] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10jijiki) I have reimaged parse2001 as a test, and it appears that puppet is unable to run successful... [15:26:45] (03CR) 10Ema: [C: 03+2] Remove comment about "Same regex as above in https_recv_redirect" [puppet] - 10https://gerrit.wikimedia.org/r/606457 (owner: 10Reedy) [15:26:54] PROBLEM - parsoid on parse2001 is CRITICAL: connect to address 10.192.0.182 and port 8000: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [15:35:09] 10SRE, 10ops-eqiad, 10Discovery, 10Discovery-Search (Current work): elastic1060 reported errors in getsel - https://phabricator.wikimedia.org/T278630 (10Gehel) [15:35:31] (03PS6) 10CRusnov: Add CAS authentication support [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) [15:35:33] (03CR) 10CRusnov: Add CAS authentication support (039 comments) [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [15:35:50] PROBLEM - mediawiki-installation DSH group on parse2001 is CRITICAL: Host parse2001 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:38:45] (03PS7) 10CRusnov: Add CAS authentication support [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) [15:39:27] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [15:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:06] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [15:41:10] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [15:45:37] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:24] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [15:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:41] (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron: conntrackd: disable startup resync [puppet] - 10https://gerrit.wikimedia.org/r/675530 (https://phabricator.wikimedia.org/T268335) [15:48:48] (03CR) 10CRusnov: "> It did log me in again with a different user (Volans vs volans). I noticed that IDP allow me to login with both." [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [15:53:27] (03CR) 10CRusnov: "> As far as the url with the login ticket in it, this is what I've gotten when I cut and paste the URL after a successful login to another" [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [15:54:00] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: neutron: conntrackd: disable startup resync [puppet] - 10https://gerrit.wikimedia.org/r/675530 (https://phabricator.wikimedia.org/T268335) (owner: 10Arturo Borrero Gonzalez) [15:58:54] (03CR) 10Paladox: [C: 03+1] gerrit: escape remarkup for Phabricator comments [2] [puppet] - 10https://gerrit.wikimedia.org/r/675479 (https://phabricator.wikimedia.org/T93331) (owner: 10Hashar) [16:00:32] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:06:16] (03PS1) 10ArielGlenn: Only abort a fragment in a batch so many times before we fail it [dumps] - 10https://gerrit.wikimedia.org/r/675543 [16:06:50] (03PS2) 10ArielGlenn: Only abort a fragment in a batch so many times before we fail it [dumps] - 10https://gerrit.wikimedia.org/r/675543 (https://phabricator.wikimedia.org/T252396) [16:07:15] (03PS1) 10Gergő Tisza: Run GrowthExperiments listTaskCounts.php script every hour [puppet] - 10https://gerrit.wikimedia.org/r/675544 (https://phabricator.wikimedia.org/T278411) [16:07:51] (03CR) 10Volans: "> Patch Set 7:" (033 comments) [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [16:08:20] (03CR) 10jerkins-bot: [V: 04-1] Run GrowthExperiments listTaskCounts.php script every hour [puppet] - 10https://gerrit.wikimedia.org/r/675544 (https://phabricator.wikimedia.org/T278411) (owner: 10Gergő Tisza) [16:09:24] (03PS4) 10Zabe: Add 'editautoreviewprotected' and 'templateeditor' protection level to hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667306 (https://phabricator.wikimedia.org/T275076) [16:11:44] !log depooled aqs1004 for transfer of large tables to aqs1010 [16:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:30] (03CR) 10Jbond: "> Patch Set 7:" [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [16:15:49] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=aqs1004.eqiad.wmnet [16:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:16] (03CR) 10Jbond: "> Patch Set 7:" [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [16:18:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10aborrero) 05Resolved→03Open Would you mind if I leave the ticket open until the following are addressed in netbox? * Inform... [16:26:22] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:32:33] (03CR) 10Volans: "question inline" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/675116 (owner: 10Ayounsi) [16:38:35] (03CR) 10Lars Wirzenius: [C: 03+1] "I don't see any problems in the Python code, but I admit I don't understand what this actually does, so I'm only CR+1'ing it." [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/675325 (owner: 1020after4) [16:46:38] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: introduce eqiad1 service implementation [puppet] - 10https://gerrit.wikimedia.org/r/675556 (https://phabricator.wikimedia.org/T270704) [16:46:40] (03CR) 10Jbond: "> > > The username bit I think a configuration change has tweaked, but I'm trying to reproduce it. Both volans and Volans should be the sa" [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [16:49:44] (03PS2) 10Gergő Tisza: Run GrowthExperiments listTaskCounts.php script every hour [puppet] - 10https://gerrit.wikimedia.org/r/675544 (https://phabricator.wikimedia.org/T278411) [16:53:54] (03CR) 10Ayounsi: alertmanager: open tasks for librenms alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675129 (https://phabricator.wikimedia.org/T225140) (owner: 10Filippo Giunchedi) [16:55:56] (03CR) 10Ayounsi: [C: 03+1] tests: link documentation on failure [homer/public] - 10https://gerrit.wikimedia.org/r/675505 (owner: 10Volans) [16:57:00] (03CR) 10Ayounsi: Look for a VC match before a device match (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/675116 (owner: 10Ayounsi) [16:57:43] 10SRE, 10DBA, 10Platform Engineering, 10Wikimedia-Incident: Appservers latency spike / parser cache growth 2021-03-28 - https://phabricator.wikimedia.org/T278655 (10jijiki) I am not sure if it is related, but I saw a pattern that fits the pattern in the task description. I looked for slow requests, since t... [17:00:35] (03PS1) 10Elukey: WIP - Idea about how to segment values.yaml between teams [deployment-charts] - 10https://gerrit.wikimedia.org/r/675558 (https://phabricator.wikimedia.org/T278208) [17:01:03] (03CR) 10Hnowlan: "> Patch Set 4: Code-Review+1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/674562 (https://phabricator.wikimedia.org/T277297) (owner: 10Hnowlan) [17:01:05] (03CR) 10Hnowlan: [C: 03+2] api-gateway: make discovery service timeouts configurable per service [deployment-charts] - 10https://gerrit.wikimedia.org/r/674562 (https://phabricator.wikimedia.org/T277297) (owner: 10Hnowlan) [17:02:35] (03Merged) 10jenkins-bot: api-gateway: make discovery service timeouts configurable per service [deployment-charts] - 10https://gerrit.wikimedia.org/r/674562 (https://phabricator.wikimedia.org/T277297) (owner: 10Hnowlan) [17:03:25] (03PS1) 10Majavah: beta: Add new buster jobrunner and parsoid, remove old appservers [puppet] - 10https://gerrit.wikimedia.org/r/675559 [17:03:45] (03PS1) 10Majavah: beta: add deployment-parsoid12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675560 [17:05:04] (03CR) 10Volans: [C: 03+2] tests: link documentation on failure [homer/public] - 10https://gerrit.wikimedia.org/r/675505 (owner: 10Volans) [17:06:00] (03Merged) 10jenkins-bot: tests: link documentation on failure [homer/public] - 10https://gerrit.wikimedia.org/r/675505 (owner: 10Volans) [17:06:28] (03CR) 10Elukey: "Alex/Janis - As I wrote in the task, this may be an idea (if I am not totally off about how helm works) to allow our teams to have shared " [deployment-charts] - 10https://gerrit.wikimedia.org/r/675558 (https://phabricator.wikimedia.org/T278208) (owner: 10Elukey) [17:09:14] (03PS1) 10Hnowlan: api-gateway: Fix bad variable injection [deployment-charts] - 10https://gerrit.wikimedia.org/r/675562 [17:17:53] (03PS2) 10Majavah: beta: Add new buster jobrunner and parsoid, remove old appservers [puppet] - 10https://gerrit.wikimedia.org/r/675559 [17:19:55] 10SRE, 10SRE-Access-Requests: Request for adding Ladsgroup to mailman-admins group - https://phabricator.wikimedia.org/T278616 (10jijiki) p:05Triage→03Medium [17:20:09] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Clare Ming - https://phabricator.wikimedia.org/T278265 (10jijiki) p:05Triage→03Medium [17:20:18] 10SRE, 10Wikimedia-Mailing-lists, 10Mobile: Mailman on lists.wikimedia.org is not mobile friendly - https://phabricator.wikimedia.org/T190055 (10jijiki) p:05Triage→03Medium [17:21:54] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:22:39] chaomodus: anything WIP that might cause it? ^^^ [17:22:55] Not to my knowledge [17:23:19] * volans running the cookbook in test moe [17:23:23] rog [17:23:28] *mode [17:24:02] volans: i added some dns for some ganeti hosts earlier, i though i pushed them all but if you see them then thats me [17:24:32] (03PS1) 10Elukey: WIP - kubernetes: segment the infrastructure_users puppet variable [puppet] - 10https://gerrit.wikimedia.org/r/675566 (https://phabricator.wikimedia.org/T278224) [17:26:19] 10SRE, 10Icinga, 10observability: icinga login case mismatch - https://phabricator.wikimedia.org/T275920 (10jijiki) [17:26:40] (03CR) 10Elukey: "Alex/Janis - This is a very high level idea about a possible way to keep the users config centralized, allowing a little segmentation base" [puppet] - 10https://gerrit.wikimedia.org/r/675566 (https://phabricator.wikimedia.org/T278224) (owner: 10Elukey) [17:26:44] 10SRE, 10cloud-services-team (Kanban): cloudvirt2003-dev: debian installer partman recipe prompts for actions - https://phabricator.wikimedia.org/T277014 (10jijiki) p:05Triage→03Medium [17:27:12] 10SRE, 10DBA, 10Wikimedia-Incident: 14 March 2021 Wikimedia API Outage - https://phabricator.wikimedia.org/T277417 (10jijiki) p:05Triage→03Medium [17:27:35] jbond42: no it's not you, it's +virt.cloudgw.eqiad1 [17:27:42] cc XioNoX [17:27:51] ack [17:28:23] it's virt.cloudgw.eqiad1.wikimediacloud.org to be precise, another not yet decided if should be driven by netbox or not zonefile [17:28:41] volans: CC arturo [17:28:46] (03PS5) 10DharmrajRathod98: Improved: timestamp validation in cli/recover-dump [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) [17:28:53] I was looking at the changelog in netbox right now [17:29:05] a.rturo is already off for today, per -cloud-admin [17:35:17] (03CR) 10Jbond: "> I had a quick look at the django code and i would say this is a bug upstream." [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [17:35:59] (03CR) 10Jbond: [C: 03+2] beta: Add new buster jobrunner and parsoid, remove old appservers [puppet] - 10https://gerrit.wikimedia.org/r/675559 (owner: 10Majavah) [17:36:49] (03CR) 10CRusnov: "> Patch Set 7:" [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [17:37:08] I',ll merge the change in dns given that it's not included [17:37:17] so it's a noop [17:37:41] !log volans@cumin1001 START - Cookbook sre.dns.netbox [17:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:54] (03Abandoned) 10Legoktm: Add 5 second "greet pause" delay to lists.wikimedia.org SMTP [puppet] - 10https://gerrit.wikimedia.org/r/371958 (https://phabricator.wikimedia.org/T173143) (owner: 10Herron) [17:47:14] (03PS1) 10Gergő Tisza: GrowthExperiments: enable link recommendations backend on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675567 (https://phabricator.wikimedia.org/T278710) [17:47:16] (03PS1) 10Gergő Tisza: GrowthExperiments: enable link recommendations backend on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675568 (https://phabricator.wikimedia.org/T278710) [17:47:18] (03PS1) 10Gergő Tisza: GrowthExperiments: enable link recommendations backend on group 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675569 (https://phabricator.wikimedia.org/T278710) [17:47:38] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:08] 10SRE, 10DBA, 10Platform Engineering, 10Wikimedia-Incident: Appservers latency spike / parser cache growth 2021-03-28 - https://phabricator.wikimedia.org/T278655 (10Pchelolo) ParserCache is split on action=render, so none of the action=render pages are usually cached. See T263581 for more details. We've t... [17:53:06] (03CR) 10Ppchelko: [C: 03+2] api-gateway: Fix bad variable injection [deployment-charts] - 10https://gerrit.wikimedia.org/r/675562 (owner: 10Hnowlan) [17:53:25] 10SRE, 10Wikimedia-Mailing-lists: Mailman sends bounce notification messages to list-admins with a subject line in Chinese language - https://phabricator.wikimedia.org/T278574 (10Legoktm) @aklapper can you provide a fully copy of the message either in a paste or private email? It would also be helpful if you c... [17:54:23] (03Merged) 10jenkins-bot: api-gateway: Fix bad variable injection [deployment-charts] - 10https://gerrit.wikimedia.org/r/675562 (owner: 10Hnowlan) [18:06:00] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:26:32] (03PS3) 10ArielGlenn: Only abort a fragment in a batch so many times before we fail it [dumps] - 10https://gerrit.wikimedia.org/r/675543 (https://phabricator.wikimedia.org/T252396) [18:27:15] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: db2135 crashed - https://phabricator.wikimedia.org/T278408 (10LSobanski) Wonder if this could have also been the reason for T272614. We still have 34 hosts running 10.4.13, should these be fast-tracked for an upgrade? [18:41:02] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [18:41:04] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [19:06:29] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=aqs1004.eqiad.wmnet [19:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:49:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:55:32] PROBLEM - Long running screen/tmux on centrallog1001 is CRITICAL: CRIT: Long running tmux process. (user: cdanis PID: 32451, 1730404s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [20:04:25] (03Abandoned) 10Matthias Mullie: Enable media change tags on wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674882 (https://phabricator.wikimedia.org/T266067) (owner: 10Matthias Mullie) [20:26:27] (03PS2) 10Legoktm: mailman3: Add remove_from_lists helper [puppet] - 10https://gerrit.wikimedia.org/r/675353 [20:26:29] (03PS2) 10Legoktm: mailman3: Add rsync for mailman2 archives for importing [puppet] - 10https://gerrit.wikimedia.org/r/675354 (https://phabricator.wikimedia.org/T278609) [20:26:31] (03PS2) 10Legoktm: mailman3: Use Stdlib::Fqdn type where possible [puppet] - 10https://gerrit.wikimedia.org/r/675355 [20:26:33] (03PS3) 10Legoktm: mailman3: Add discard_held_messages script and timer [puppet] - 10https://gerrit.wikimedia.org/r/675356 [20:26:35] (03PS1) 10Legoktm: mailman3: Add documentation to classes, merge hyperkitty into web [puppet] - 10https://gerrit.wikimedia.org/r/675584 [20:26:37] (03PS1) 10Legoktm: mailman3: Explicitly don't use dbconfig-mysql system [puppet] - 10https://gerrit.wikimedia.org/r/675585 (https://phabricator.wikimedia.org/T278499) [20:27:26] (03CR) 10jerkins-bot: [V: 04-1] mailman3: Add remove_from_lists helper [puppet] - 10https://gerrit.wikimedia.org/r/675353 (owner: 10Legoktm) [20:28:01] (03CR) 10Legoktm: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/675353 (owner: 10Legoktm) [20:28:57] (03CR) 10jerkins-bot: [V: 04-1] mailman3: Add discard_held_messages script and timer [puppet] - 10https://gerrit.wikimedia.org/r/675356 (owner: 10Legoktm) [20:32:56] (03PS3) 10Legoktm: mailman3: Add remove_from_lists helper [puppet] - 10https://gerrit.wikimedia.org/r/675353 [20:32:58] (03PS3) 10Legoktm: mailman3: Add rsync for mailman2 archives for importing [puppet] - 10https://gerrit.wikimedia.org/r/675354 (https://phabricator.wikimedia.org/T278609) [20:33:00] (03PS3) 10Legoktm: mailman3: Use Stdlib::Fqdn type where possible [puppet] - 10https://gerrit.wikimedia.org/r/675355 [20:33:02] (03PS4) 10Legoktm: mailman3: Add discard_held_messages script and timer [puppet] - 10https://gerrit.wikimedia.org/r/675356 [20:33:04] (03PS2) 10Legoktm: mailman3: Add documentation to classes, merge hyperkitty into web [puppet] - 10https://gerrit.wikimedia.org/r/675584 [20:33:06] (03PS2) 10Legoktm: mailman3: Explicitly don't use dbconfig-mysql system [puppet] - 10https://gerrit.wikimedia.org/r/675585 (https://phabricator.wikimedia.org/T278499) [20:49:05] 10SRE, 10GitLab (Initialization): Offboard Oly Kalinichenko (Speed & Function) - https://phabricator.wikimedia.org/T278475 (10Legoktm) [21:00:03] 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10jijiki) p:05Triage→03Medium [21:00:37] 10SRE, 10Wikimedia-IRC-RC-Server: Set up spare irc1001.wikimedia.org in eqiad - https://phabricator.wikimedia.org/T278255 (10jijiki) p:05Triage→03Medium [21:00:47] 10SRE, 10Wikimedia-Mailing-lists, 10observability: Setup monitoring for mailman3 - https://phabricator.wikimedia.org/T278280 (10jijiki) p:05Triage→03Medium [21:01:55] 10SRE, 10serviceops: bring 35 new mediawiki appserver in codfw into production (mw2377 - mw2402) - https://phabricator.wikimedia.org/T278396 (10jijiki) p:05Triage→03High [21:02:41] 10SRE, 10Wikimedia-Mailing-lists: lists-next: “confirm” and “welcome” emails lack List-Id header - https://phabricator.wikimedia.org/T278431 (10jijiki) p:05Triage→03Medium [21:03:02] 10SRE, 10Wikimedia-Mailing-lists: lists-next: no clickable link in “confirm” email - https://phabricator.wikimedia.org/T278432 (10jijiki) p:05Triage→03Medium [21:03:16] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 45.42 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:03:16] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: lists-next: bad name in “welcome” email - https://phabricator.wikimedia.org/T278433 (10jijiki) p:05Triage→03Medium [21:03:44] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Improve workflow for mailman database bootstrapping and updates - https://phabricator.wikimedia.org/T278499 (10jijiki) p:05Triage→03Medium [21:04:42] 10SRE, 10Platform Engineering, 10Services, 10Wikimedia-Mailing-lists: Decide on future of public services@ mailing list (which has no maintainers) - https://phabricator.wikimedia.org/T278516 (10jijiki) p:05Triage→03Low [21:05:08] 10SRE, 10Wikimedia-Mailing-lists: Mailman sends bounce notification messages to list-admins with a subject line in Chinese language - https://phabricator.wikimedia.org/T278574 (10jijiki) p:05Triage→03Medium [21:05:36] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 84.47 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:07:06] 10SRE, 10Wikimedia-Mailing-lists: Mailman sends bounce notification messages to list-admins with a subject line in Chinese language - https://phabricator.wikimedia.org/T278574 (10Peachey88) > or was this a one-off @Nemo_bis raised this was happening the other day on irc as well (-tech iirc) [21:13:15] 10SRE, 10Wikimedia-Mailing-lists: Install mailman3 on lists1001.wikimedia.org - https://phabricator.wikimedia.org/T278610 (10jijiki) p:05Triage→03Medium [21:13:29] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10jijiki) p:05Triage→03Medium [21:13:44] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Install mailman3 and mailman2 at the same time on the cloud - https://phabricator.wikimedia.org/T278612 (10jijiki) p:05Triage→03Medium [21:14:17] 10SRE, 10ops-eqiad, 10Discovery, 10Discovery-Search (Current work): elastic1060 reported errors in getsel - https://phabricator.wikimedia.org/T278630 (10jijiki) p:05Triage→03High [21:14:35] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10jijiki) p:05Triage→03Medium [21:15:49] 10SRE, 10Beta-Cluster-Infrastructure: cannot curl to wiki from beta mw appservers - https://phabricator.wikimedia.org/T278599 (10jijiki) p:05Triage→03Medium [21:16:33] 10SRE, 10Wikimedia-Mailing-lists: Figure out plan for mailman IP situation - https://phabricator.wikimedia.org/T278495 (10jijiki) p:05Triage→03Medium [21:17:04] 10SRE, 10GitLab (Initialization): Offboard Oly Kalinichenko (Speed & Function) - https://phabricator.wikimedia.org/T278475 (10jijiki) p:05Triage→03High [21:18:06] 10SRE, 10Icinga, 10observability: icinga login case mismatch - https://phabricator.wikimedia.org/T275920 (10jijiki) p:05Triage→03Medium [21:30:48] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [21:38:06] (03CR) 10Kosta Harlan: Run GrowthExperiments listTaskCounts.php script every hour (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675544 (https://phabricator.wikimedia.org/T278411) (owner: 10Gergő Tisza) [21:42:16] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [22:04:58] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:05:22] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:05:30] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active - NTT, AS2914/IPv4: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:10:56] 10SRE, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10crusnov) It appears the kafka-main2* cluster is indeed listening on ipv6, it just seems to need DNS (especially in the face of the eqiad ones already having this D... [22:28:32] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 59, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:04:46] (03CR) 10Gergő Tisza: Run GrowthExperiments listTaskCounts.php script every hour (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675544 (https://phabricator.wikimedia.org/T278411) (owner: 10Gergő Tisza) [23:07:52] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:08:18] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:20:06] 10SRE, 10Beta-Cluster-Infrastructure: cannot curl to wiki from beta mw appservers - https://phabricator.wikimedia.org/T278599 (10Krinkle) Using the `Host` doesn't work for various reasons as `http://localhost` is served by health-check rather than MediaWiki. To approach MW, one can use `--connect-to` instead,... [23:39:07] (03CR) 10CRusnov: [C: 03+1] "LGTM. I think the only notable thing would be to document who is responsible for any given outcome of these reports." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/674977 (https://phabricator.wikimedia.org/T222931) (owner: 10Ayounsi) [23:44:56] 10SRE, 10Wikimedia-Mailing-lists, 10cloud-services-team (Kanban): auto-subscribe cloud-vps and/or toolforge users to cloud-announce - https://phabricator.wikimedia.org/T278361 (10Legoktm) The easiest solution is to implement a `everyone@wmcloud.org` alias, (with some only allowing mails via cloud-announce@li...