[01:08:20] PROBLEM - Thanos sidecar is failing to upload blocks on alert1001 is CRITICAL: cluster=prometheus instance=prometheus1004 job=thanos-sidecar prometheus=ops site=eqiad https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar [01:27:00] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:30:20] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 234 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:32:04] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:36:02] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 52 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:27:42] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:30:32] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 149 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:32:46] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:36:18] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 57 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:03:58] RECOVERY - snapshot of s7 in codfw on alert1001 is OK: Last snapshot for s7 at codfw (db2100.codfw.wmnet:3317) taken on 2020-11-02 01:31:32 (1021 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [03:18:01] 10Operations, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Tgr) >>! In T93049#6512485, @Pchelolo wrote: > As soon as this happens again, please add an example here and we will invest... [03:54:21] (03PS1) 10Hoo man: Do weekly dumps of Wikidata Lexeme [puppet] - 10https://gerrit.wikimedia.org/r/637895 (https://phabricator.wikimedia.org/T264883) [03:54:46] (03CR) 10jerkins-bot: [V: 04-1] Do weekly dumps of Wikidata Lexeme [puppet] - 10https://gerrit.wikimedia.org/r/637895 (https://phabricator.wikimedia.org/T264883) (owner: 10Hoo man) [03:57:04] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:57:06] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:02:06] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:02:10] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:27:12] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:27:16] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:32:20] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:32:24] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:57:02] tgr|away are you around? [05:10:38] DannyS712: o/ [05:32:41] 10Operations, 10CheckUser, 10Traffic: Log source port for anonymous users and expose it for sysops/checkusers - https://phabricator.wikimedia.org/T181368 (10Ladsgroup) I think this shouldn't go in mw side of things, it should be part of the analytics data lake ([[https://wikitech.wikimedia.org/wiki/Analytics... [05:42:46] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:03:10] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:04:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:09:03] !log oblivian@cumin1001 START - Cookbook sre.network.cf [06:09:03] !log oblivian@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [06:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:14] !log oblivian@cumin1001 START - Cookbook sre.network.cf [06:09:16] !log oblivian@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [06:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:19] (03CR) 10ArielGlenn: Do weekly dumps of Wikidata Lexeme (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637895 (https://phabricator.wikimedia.org/T264883) (owner: 10Hoo man) [07:16:18] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:24:16] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard per DC to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) >>! In T252391#6592606, @jijiki wrote: > * `mc2036.codfw.wmnet` has been reimaged to buster without redis-server... [07:31:17] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10Patch-For-Review, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) `firmware-bnx2x` installed manually on kafka-jumbo1006, we can retry the switch anytime to see if it works. [07:34:24] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10wiki_willy) Hi @Marostegui - @Jclark-ctr is in charge of gathering up all the decom'd hardware for recycling, so we can have him check this week for any spare drives lying around. We should... [07:52:09] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Spare Drive Onsite for db1091 - https://phabricator.wikimedia.org/T266988 (10wiki_willy) [07:53:16] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Spare Drive Onsite for db1091 - https://phabricator.wikimedia.org/T266988 (10wiki_willy) [07:53:21] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10wiki_willy) [07:53:32] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10wiki_willy) 05Open→03Resolved [08:09:09] (03CR) 10Hashar: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/636084 (https://phabricator.wikimedia.org/T266024) (owner: 10Dzahn) [08:10:22] (03PS2) 10Ladsgroup: [WIP] varnish: Improve wording of the browser security error a bit [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656) [08:10:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks great, two comments inline." (032 comments) [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637672 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat) [08:12:56] (03PS3) 10Ladsgroup: [WIP] varnish: Improve wording of the browser security error a bit [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656) [08:33:08] 10Operations, 10SRE-swift-storage, 10Patch-For-Review, 10User-fgiunchedi: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Host is fully in service now [08:39:26] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [08:40:03] (03CR) 10Hashar: [C: 03+1] "Will look at deploying it this today" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636083 (https://phabricator.wikimedia.org/T266024) (owner: 10Legoktm) [08:40:44] !log upgrade thanos to 0.16 in codfw/eqiad - T261281 [08:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:51] T261281: Improve performance of Thanos (+ Prometheus) - https://phabricator.wikimedia.org/T261281 [08:41:06] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [08:41:37] !log installing openldap security updates on LDAP replicas [08:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:06] (03CR) 10Ayounsi: [C: 03+2] Add uRPF strict mode to Customers links [homer/public] - 10https://gerrit.wikimedia.org/r/636653 (https://phabricator.wikimedia.org/T266561) (owner: 10Ayounsi) [08:42:39] (03Merged) 10jenkins-bot: Add uRPF strict mode to Customers links [homer/public] - 10https://gerrit.wikimedia.org/r/636653 (https://phabricator.wikimedia.org/T266561) (owner: 10Ayounsi) [08:44:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=thanos-compact site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:46:16] !log add uRPF strict to ulsfo office links - T266561 [08:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:22] T266561: Apply uRPF strict mode on Customer links - https://phabricator.wikimedia.org/T266561 [08:46:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:46:57] 10Operations, 10netops, 10Patch-For-Review: Apply uRPF strict mode on Customer links - https://phabricator.wikimedia.org/T266561 (10ayounsi) Nothing more in the logs. [08:47:19] 10Operations, 10netops, 10Patch-For-Review: Apply uRPF strict mode on Customer links - https://phabricator.wikimedia.org/T266561 (10ayounsi) 05Open→03Resolved [08:51:37] 10Operations, 10LDAP-Access-Requests, 10Discovery-Search (Current work): Give Trey jones access necessary to support Search Platform Airflow jobs - https://phabricator.wikimedia.org/T266995 (10Gehel) [08:52:59] (03PS1) 10Gehel: admin: Trey Jones needs access to support Search Platform Airflow jobs [puppet] - 10https://gerrit.wikimedia.org/r/638019 (https://phabricator.wikimedia.org/T266995) [08:53:10] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.456e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [08:54:31] 10Operations, 10LDAP-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Give Trey jones access necessary to support Search Platform Airflow jobs - https://phabricator.wikimedia.org/T266995 (10Gehel) a:03RKemper [08:55:59] (03CR) 10Volans: "If the mapping is 1:1 between desktop and mobile records, I'm wondering if we should instead take advantage of the fact that those are tem" [dns] - 10https://gerrit.wikimedia.org/r/637849 (https://phabricator.wikimedia.org/T152882) (owner: 10Ladsgroup) [08:56:54] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:59:41] (03PS4) 10Nikerabbit: Stop defining wmgULSCompactLinksForNewAccounts and wmgULSCompactLinksEnableAnon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634224 [08:59:43] (03PS1) 10Nikerabbit: Stop reading wmgULSCompactLinksForNewAccounts and wmgULSCompactLinksEnableAnon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/638020 [09:01:56] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:04:47] (03CR) 10JMeybohm: [C: 04-1] "Thanks! One issue and a nit 😊" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/637753 (owner: 10Jeena Huneidi) [09:06:06] PROBLEM - Check systemd state on thanos-fe2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:25] 10Operations, 10netops, 10cloud-services-team (Kanban): Enable L3 routing on cloudsw nodes - https://phabricator.wikimedia.org/T265288 (10ayounsi) As explained previously on IRC, `208.80.155.88/29` is part of the eqiad IP space, `185.15.56.240/29` is part of the WMCS IP space. When a "customer" connects to... [09:14:58] RECOVERY - Thanos compact has not run on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:15:04] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:00] RECOVERY - Thanos sidecar is failing to upload blocks on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar [09:19:30] RECOVERY - Check systemd state on thanos-fe2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:38] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the fix!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/637728 (https://phabricator.wikimedia.org/T266767) (owner: 10Ayounsi) [09:24:06] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/637734 (https://phabricator.wikimedia.org/T265340) (owner: 10Ayounsi) [09:25:10] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.456e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:27:12] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:29:42] (03PS2) 10Kormat: Initial (re)packaging [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637672 (https://phabricator.wikimedia.org/T266763) [09:31:28] (03PS1) 10JMeybohm: Lint the chart _scaffold by creating a dummy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/638025 [09:31:58] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:04] (03PS2) 10JMeybohm: Lint the chart _scaffold by creating a dummy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/638025 [09:32:14] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:33:32] RECOVERY - Thanos compact has not run on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:34:21] (03CR) 10jerkins-bot: [V: 04-1] Lint the chart _scaffold by creating a dummy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/638025 (owner: 10JMeybohm) [09:36:24] (03CR) 10JMeybohm: [C: 03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/636905 (owner: 10Kosta Harlan) [09:36:26] PROBLEM - Check systemd state on thanos-fe1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:39:08] (03Merged) 10jenkins-bot: Define scaffold_version before attempting to use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/636905 (owner: 10Kosta Harlan) [09:39:30] (03CR) 10Ayounsi: [C: 03+2] PuppetDB import: don't do empty saves [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/637728 (https://phabricator.wikimedia.org/T266767) (owner: 10Ayounsi) [09:41:11] (03CR) 10Ayounsi: [C: 03+2] PuppetDB import, set interface type when renaming ##PRIMARY## [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/637734 (https://phabricator.wikimedia.org/T265340) (owner: 10Ayounsi) [09:45:16] (03PS3) 10JMeybohm: Lint the chart _scaffold by creating a dummy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/638025 [09:46:16] (03CR) 10JMeybohm: "Currently expected to fail, should be fine after Id735ce4bb2619a6814a97968cec3295689cc0050 is merged" [deployment-charts] - 10https://gerrit.wikimedia.org/r/638025 (owner: 10JMeybohm) [09:47:38] (03CR) 10jerkins-bot: [V: 04-1] Lint the chart _scaffold by creating a dummy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/638025 (owner: 10JMeybohm) [09:48:12] RECOVERY - Check systemd state on thanos-fe1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:53:40] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.456e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:55:42] PROBLEM - Thanos sidecar is failing to upload blocks on alert1001 is CRITICAL: cluster=prometheus instance=prometheus1004 job=thanos-sidecar prometheus=ops site=eqiad https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar [10:23:25] !log installing openldap security updates on corp LDAP replicas [10:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:33] (03PS1) 10Itamar Givon: Revert JS parser commits [extensions/Wikibase] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/637801 (https://phabricator.wikimedia.org/T266671) [10:27:34] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:27:36] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:28:38] !log oblivian@cumin1001 START - Cookbook sre.network.cf [10:28:38] !log oblivian@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [10:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:45] !log oblivian@cumin1001 START - Cookbook sre.network.cf [10:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:47] !log oblivian@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [10:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637672 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat) [10:32:38] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:32:40] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:33:38] <_joe_> hnowlan: ^^ [10:33:47] <_joe_> cassandra on maps2002 keeps failing [10:34:07] (03PS3) 10Kormat: Initial (re)packaging [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637672 (https://phabricator.wikimedia.org/T266763) [10:34:09] (03PS3) 10Kormat: debian: add user/group + systemd service [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637683 (https://phabricator.wikimedia.org/T266763) [10:36:38] (03PS1) 10Filippo Giunchedi: package_builder: use --no-cowdancer-update when updating chroots [puppet] - 10https://gerrit.wikimedia.org/r/638032 [10:37:04] (03CR) 10jerkins-bot: [V: 04-1] package_builder: use --no-cowdancer-update when updating chroots [puppet] - 10https://gerrit.wikimedia.org/r/638032 (owner: 10Filippo Giunchedi) [10:38:44] (03PS2) 10Filippo Giunchedi: package_builder: use --no-cowdancer-update when updating chroots [puppet] - 10https://gerrit.wikimedia.org/r/638032 [10:41:43] (03CR) 10Muehlenhoff: debian: add user/group + systemd service (031 comment) [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637683 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat) [10:45:54] (03PS2) 10Filippo Giunchedi: prometheus: re-enable compaction by default [puppet] - 10https://gerrit.wikimedia.org/r/636362 (https://phabricator.wikimedia.org/T261281) [10:45:56] (03PS1) 10Filippo Giunchedi: thanos: use systemd overrides for query/store/compact [puppet] - 10https://gerrit.wikimedia.org/r/638036 (https://phabricator.wikimedia.org/T261281) [10:50:19] !log jmm@cumin2001 START - Cookbook sre.hosts.decommission [10:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:33] (03CR) 10Physikerwelt: "I apprechiate this as an intermediate solution for T266673" [extensions/Wikibase] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/637801 (https://phabricator.wikimedia.org/T266671) (owner: 10Itamar Givon) [10:57:40] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [10:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:45] 10Operations, 10Patch-For-Review: Migrate LDAP replicas to Buster - https://phabricator.wikimedia.org/T264388 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2001 for hosts: `ldap-replica2001.wikimedia.org` - ldap-replica2001.wikimedia.org (**WARN**) - **Failed downtime host on I... [10:58:06] RECOVERY - Thanos sidecar is failing to upload blocks on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar [10:59:53] !log jmm@cumin2001 START - Cookbook sre.hosts.decommission [10:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:28] _joe_: whoa what, that's bad. codfw maps shouldn't be affected by changes at all [11:00:39] looking [11:00:43] <_joe_> thanks :) [11:06:00] !log upgrade thanos to 0.16.0 on prometheus hosts - T261281 [11:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:06] T261281: Improve performance of Thanos (+ Prometheus) - https://phabricator.wikimedia.org/T261281 [11:07:04] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:33] (03PS1) 10ArielGlenn: update worker scripts to loop in secondary batch worker mode [dumps] - 10https://gerrit.wikimedia.org/r/638043 (https://phabricator.wikimedia.org/T252396) [11:12:20] (03PS1) 10Filippo Giunchedi: Add Debian directory [debs/kthxbye] - 10https://gerrit.wikimedia.org/r/638044 (https://phabricator.wikimedia.org/T266535) [11:12:41] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/638045 (https://phabricator.wikimedia.org/T128546) [11:13:40] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [11:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:45] 10Operations, 10Patch-For-Review: Migrate LDAP replicas to Buster - https://phabricator.wikimedia.org/T264388 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2001 for hosts: `ldap-replica2002.wikimedia.org` - ldap-replica2002.wikimedia.org (**WARN**) - **Failed downtime host on I... [11:14:28] (03PS1) 10Muehlenhoff: Remove ldap-replica2001/2002 from DNS [dns] - 10https://gerrit.wikimedia.org/r/638067 (https://phabricator.wikimedia.org/T264388) [11:14:44] (03CR) 10Gilles: "Sounds good!" [puppet] - 10https://gerrit.wikimedia.org/r/636024 (https://phabricator.wikimedia.org/T266155) (owner: 10Gilles) [11:15:44] (03CR) 10Vgutierrez: [C: 03+1] Remove ldap-eqiad-replica0[12] from acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/637500 (https://phabricator.wikimedia.org/T264388) (owner: 10Muehlenhoff) [11:16:17] (03CR) 10Muehlenhoff: [C: 03+2] Remove ldap-replica2001/2002 from DNS [dns] - 10https://gerrit.wikimedia.org/r/638067 (https://phabricator.wikimedia.org/T264388) (owner: 10Muehlenhoff) [11:18:30] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [11:18:30] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:16] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:18] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:24:11] 10Operations, 10Puppet, 10observability, 10Patch-For-Review, and 2 others: Puppet: get row/rack info from Netbox - https://phabricator.wikimedia.org/T229397 (10Volans) We discussed it a bit during the Infrastructure Foundation last meeting on Wed. I'll try to summarize the outcome of it, please correct me... [11:24:48] (03PS1) 10Muehlenhoff: Remove ldap-replica2001/2002 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/638069 [11:25:01] (03CR) 10jerkins-bot: [V: 04-1] Remove ldap-replica2001/2002 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/638069 (owner: 10Muehlenhoff) [11:25:28] (03PS2) 10Muehlenhoff: Remove ldap-replica2001/2002 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/638069 [11:29:46] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [11:30:04] jan_drewniak: Dear deployers, time to do the Wikimedia Portals Update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201102T1130). [11:30:59] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/638045 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:31:01] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.4 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/638072 [11:31:37] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/638045 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:32:09] (03CR) 10Muehlenhoff: [C: 03+2] Remove ldap-replica2001/2002 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/638069 (owner: 10Muehlenhoff) [11:32:33] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.4 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/638072 (owner: 10Volans) [11:33:46] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:638045| Bumping portals to master (T128546)]] (duration: 01m 00s) [11:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:53] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.4 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/638072 (owner: 10Volans) [11:33:53] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [11:34:44] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:638045| Bumping portals to master (T128546)]] (duration: 00m 58s) [11:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:01] (03CR) 10Muehlenhoff: [C: 03+2] Remove ldap-eqiad-replica0[12] from acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/637500 (https://phabricator.wikimedia.org/T264388) (owner: 10Muehlenhoff) [11:42:37] 10Operations, 10Wikidata, 10Wikidata Query UI, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Addshore) So I now see that the custom-config used to be in the build repo in the production branch. It was removed in https://gerrit.wikimedia.org/r/c/wikidata/query/gu... [11:46:08] (03PS1) 10Volans: Upstream release v0.0.4 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/638073 [11:48:17] (03PS1) 10Muehlenhoff: Remove ldap-replica2001/2002 from acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/638075 [11:48:46] (03CR) 10Vgutierrez: [C: 03+1] Remove ldap-replica2001/2002 from acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/638075 (owner: 10Muehlenhoff) [11:49:34] (03CR) 10Muehlenhoff: [C: 03+2] Remove ldap-replica2001/2002 from acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/638075 (owner: 10Muehlenhoff) [11:50:31] (03PS1) 10Muehlenhoff: Remove obsolete Hiera files [puppet] - 10https://gerrit.wikimedia.org/r/638076 [11:50:38] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.4 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/638073 (owner: 10Volans) [11:51:17] !log disable thumbor on thumbor1001 and thumbor1002 to test 636024 [11:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:32] agr [11:51:41] (03Merged) 10jenkins-bot: Upstream release v0.0.4 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/638073 (owner: 10Volans) [11:51:58] !log disable puppet on thumbor1001 and thumbor1002 to test 636024 [11:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:53] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete Hiera files [puppet] - 10https://gerrit.wikimedia.org/r/638076 (owner: 10Muehlenhoff) [11:53:18] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.03049 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [11:55:37] 10Operations, 10Patch-For-Review: Migrate LDAP replicas to Buster - https://phabricator.wikimedia.org/T264388 (10MoritzMuehlenhoff) 05Open→03Resolved ldap-replica1001/1002/2003/2004 are now running Buster, old Stretch instances have been removed. [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: May I have your attention please! European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201102T1200) [12:00:05] matthiasmullie, hashar, Nikerabbit, and ItamarWMDE: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:11] o/ [12:00:18] o/ [12:00:23] (03PS2) 10Hnowlan: Enable replication in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/608726 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper) [12:00:30] I can deploy today [12:00:54] hi. I will do my config change to ExtensionDistributor later ( https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/636083/ ) [12:01:25] o7 [12:01:46] and one probably want to +2 the pending Wikibase right now [12:01:48] ( https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/637801 ) [12:02:11] !log uploaded python3-wmflib_0.0.4 to apt.wikimedia.org buster-wikimedia [12:02:14] o/ [12:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:22] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert JS parser commits [extensions/Wikibase] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/637801 (https://phabricator.wikimedia.org/T266671) (owner: 10Itamar Givon) [12:02:26] hashar: good point :) [12:03:27] (03PS2) 10Lucas Werkmeister (WMDE): Fix array depth for properties array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637778 (https://phabricator.wikimedia.org/T266835) (owner: 10Matthias Mullie) [12:03:34] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Fix array depth for properties array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637778 (https://phabricator.wikimedia.org/T266835) (owner: 10Matthias Mullie) [12:05:13] (03Merged) 10jenkins-bot: Fix array depth for properties array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637778 (https://phabricator.wikimedia.org/T266835) (owner: 10Matthias Mullie) [12:05:26] Lucas_WMDE: that one doesn't have to go to mwdebug [12:05:36] okay [12:06:17] lunch & [12:07:40] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:637778|Fix array depth for properties array (T266835)]] (duration: 00m 59s) [12:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:48] T266835: [betalabs] MediaSearch - Internal error MediaQueryBuilder.php: Unsupported operand types - https://phabricator.wikimedia.org/T266835 [12:08:35] (03PS2) 10Lucas Werkmeister (WMDE): Stop reading wmgULSCompactLinksForNewAccounts and wmgULSCompactLinksEnableAnon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/638020 (owner: 10Nikerabbit) [12:08:46] Thanks, Lucas_WMDE! [12:08:48] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Stop reading wmgULSCompactLinksForNewAccounts and wmgULSCompactLinksEnableAnon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/638020 (owner: 10Nikerabbit) [12:09:03] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:637778|Fix array depth for properties array (T266835)]], Beta part (prod no-op) (duration: 00m 58s) [12:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:17] Nikerabbit: your changes also look like they can’t really be tested on mwdebug, right? [12:09:27] since the wmf.14 code already doesn’t read the config settings anymore [12:09:46] Lucas_WMDE: I can check, but not expecting to see any difference. Only labs wikis will change [12:09:52] ok [12:10:11] well, would have changed, but like you said ULS does not read them anymore [12:10:17] so no difference there expected either [12:11:06] (03Merged) 10jenkins-bot: Stop reading wmgULSCompactLinksForNewAccounts and wmgULSCompactLinksEnableAnon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/638020 (owner: 10Nikerabbit) [12:12:23] Nikerabbit: first change is on mwdebug… [12:12:25] wait [12:12:31] sorry, I’m still on mwdebug2001 [12:12:43] but should probably return to mwdebug100* now that the DC switch is over [12:13:29] well, I guess you can still target mwdebug2001 from the x-wikimedia-debug extension? [12:13:43] yeah I can [12:14:21] no change visible logged in or logged out [12:14:31] ok, syncing [12:15:11] (03PS5) 10Lucas Werkmeister (WMDE): Stop defining wmgULSCompactLinksForNewAccounts and wmgULSCompactLinksEnableAnon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634224 (owner: 10Nikerabbit) [12:15:29] !log upgraded python3-wmflib to 0.0.4 on cumin[12]001 [12:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:38] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Stop defining wmgULSCompactLinksForNewAccounts and wmgULSCompactLinksEnableAnon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634224 (owner: 10Nikerabbit) [12:15:58] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:638020|Stop reading wmgULSCompactLinksForNewAccounts and wmgULSCompactLinksEnableAnon]] (duration: 00m 58s) [12:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:32] (03Merged) 10jenkins-bot: Stop defining wmgULSCompactLinksForNewAccounts and wmgULSCompactLinksEnableAnon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634224 (owner: 10Nikerabbit) [12:17:10] Nikerabbit: second change is also on mwdebug2001 now [12:17:54] still lookd good. I assume fatal-monitor is quiet on this? [12:18:04] yeah, looks like it [12:18:33] good [12:19:28] syncing [12:19:52] meanwhile, the Wikibase gate-and-submit build had a random composer error in the wikibase-client-docker job, apparently :( [12:20:22] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:634224|Stop defining wmgULSCompactLinksForNewAccounts and wmgULSCompactLinksEnableAnon]], 1/2 (production) (duration: 01m 02s) [12:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:36] but the deployment calendar is otherwise fairly empty today, so I’d say if we need a second gate-and-submit and overrun the backport window it’s not a big problem [12:21:44] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:634224|Stop defining wmgULSCompactLinksForNewAccounts and wmgULSCompactLinksEnableAnon]], 2/2 (Beta) (duration: 00m 57s) [12:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:07] okay, I think that was all the config changes [12:22:12] so now we wait for Wikibase CI :) [12:22:35] :) [12:22:40] I think I’ll cancel the four remaining jobs so we can retry immediately and wait a bit less long [12:23:03] (03CR) 10jerkins-bot: [V: 04-1] Revert JS parser commits [extensions/Wikibase] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/637801 (https://phabricator.wikimedia.org/T266671) (owner: 10Itamar Givon) [12:23:40] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Let’s try again (wikibase-client-docker had a random-looking composer error so I aborted the remaining jobs to save time)." [extensions/Wikibase] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/637801 (https://phabricator.wikimedia.org/T266671) (owner: 10Itamar Givon) [12:24:16] * itamarWMDE opens up zuul to watch [12:26:50] (03CR) 10Muehlenhoff: [C: 03+1] "Seems fine. If the performance loss is notable we could also switch to a backport of cowbuilder 0.89, but it's probably not needed." [puppet] - 10https://gerrit.wikimedia.org/r/638032 (owner: 10Filippo Giunchedi) [12:34:00] * Lucas_WMDE whistles the Jeopardy! theme [12:39:22] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 610 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:39:56] eeeeek [12:40:00] that’s a large spike [12:40:06] * Lucas_WMDE saunters over to logstash [12:40:29] not seeing anything in logspam-watch yet though [12:40:43] [{exception_id}] {exception_url} ErrorException from line 820 of /srv/mediawiki/php-1.36.0-wmf.14/vendor/wikimedia/parsoid/src/Config/Env.php: PHP Notice: Undefined index: mwf1 [12:40:45] (03PS1) 10Giuseppe Lavagetto: Add base php cli image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/638095 (https://phabricator.wikimedia.org/T265324) [12:40:55] from logstash it looks like it was a temporary job queue issue? [12:40:56] that sounds...weird? [12:40:59] mainly on commonswiki [12:41:02] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 11 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:41:06] several exceptions in JobQueueEventBus.php [12:41:12] “Could not enqueue jobs from stream …” [12:41:29] with RecordLintJob, wikibase-addUsagesForPage, cirrusSearchElasticaWrite, cirrusSearchLinksUpdate as the main culprits [12:41:37] but I suspect those are just the most common jobs overall [12:42:10] but it also seems to have recovered already [12:42:37] yeah, seems to be fine https://usercontent.irccloud-cdn.com/file/rJf0rFA7/image.png [12:42:38] (ok, there was also a large volume of deferred updates that failed to run) [12:43:17] Lucas_WMDE: I take it that you're waiting for Wikibase CI? [12:43:22] yeah [12:43:22] if so, would you mind syncing https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/637819 for me? [12:43:32] * Lucas_WMDE looks [12:43:56] 'otrs_wikiwiki' might be the worst dbname I’ve encountered yet [12:44:03] + [12:44:09] +1 [12:44:18] but we can't change that, so...we'll have to live with that [12:44:22] yeah [12:44:26] quick grep shows that’s its name indeed [12:44:32] (it kinda matches the wiki's URL through, otrs-wiki.wikimedia.org) [12:44:55] (I can’t see the discussion linked in the Phab task so I just have to trust you :P ) [12:45:19] hehe [12:45:25] Lucas_WMDE OTRS agent here, can confirm community consensus for the change [12:45:30] ok thanks [12:45:42] lol I can’t even run action=ßuery&siprop=namespaces to see if it’s ns100 [12:45:42] np [12:46:06] but I can see that in IS.php [12:46:11] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add Response namespace at otrs_wikiwiki to namespaces searched by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637819 (https://phabricator.wikimedia.org/T266917) (owner: 10Urbanecm) [12:46:17] (03PS2) 10Lucas Werkmeister (WMDE): Add Response namespace at otrs_wikiwiki to namespaces searched by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637819 (https://phabricator.wikimedia.org/T266917) (owner: 10Urbanecm) [12:46:21] it is Lucas_WMDE https://usercontent.irccloud-cdn.com/file/xgFPZwuf/image.png [12:46:42] through I always look the IDs up via IS.php [12:46:57] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637819 (https://phabricator.wikimedia.org/T266917) (owner: 10Urbanecm) [12:47:18] It took me longer since the actual query is action=query&meta=siteinfo&siprop=namespaces but can confirm [12:47:34] Though I think we can all trust Urbanecm [12:47:50] thank you :) [12:47:53] (03Merged) 10jenkins-bot: Add Response namespace at otrs_wikiwiki to namespaces searched by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637819 (https://phabricator.wikimedia.org/T266917) (owner: 10Urbanecm) [12:48:02] DannyS712: yeah I was just too lazy to type out the full URL ^^ [12:48:27] Urbanecm: want to quickly test it on mwdebug2001? [12:48:31] Lucas_WMDE: sure [12:48:34] Urbanecm if there is time now, can we done the security patch instead of later? [12:48:35] through I'D prefer 1002 [12:48:41] (or 1001) [12:48:45] Lucas_WMDE: we're post-switchover [12:48:49] yeah, I hadn’t reset my script to eqiad yet [12:48:55] done now but old terminals are still open ^^ [12:49:21] DannyS712: from my side, sure - through Lucas_WMDE leads this window [12:49:38] @loc [12:49:40] not sure we have the time for that [12:49:51] woops [12:49:51] Lucas_WMDE 2001 isn't showing the change for me [12:49:58] hm [12:50:03] oh, now it is [12:50:10] just had to refresh a dozen times [12:50:10] ah okay [12:51:04] > not sure we have the time for that [12:51:04] Okay, just wondering since I figured I was around now [12:51:17] confirmed it works Lucas_WMDE [12:51:21] okay, syncing [12:52:21] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:637819|Add Response namespace at otrs_wikiwiki to namespaces searched by default (T266917)]] (duration: 00m 58s) [12:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:28] T266917: Add Response namespace at otrswiki to namespaces searched by default - https://phabricator.wikimedia.org/T266917 [12:52:33] thank you Lucas_WMDE ! [12:52:37] np [12:56:02] (03Merged) 10jenkins-bot: Revert JS parser commits [extensions/Wikibase] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/637801 (https://phabricator.wikimedia.org/T266671) (owner: 10Itamar Givon) [12:57:25] \o/ [12:57:27] itamarWMDE: the revert is on mwdebug2001 now [12:58:24] huh, test edit says the wiki in read-only mode? [12:58:44] is that because I can’t edit in codfw? [12:58:51] (03PS4) 10Kormat: Initial (re)packaging [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637672 (https://phabricator.wikimedia.org/T266763) [12:59:07] trying mwdebug1001 instead [12:59:50] (03PS5) 10Kormat: Initial (re)packaging [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637672 (https://phabricator.wikimedia.org/T266763) [12:59:53] guess that was it, yeah [13:00:10] https://www.wikidata.org/w/index.php?title=Q4115189&diff=1301636953&oldid=1301615072 [13:00:24] \o/ [13:00:59] \o/ Yhanks Lucas_WMDE [13:01:14] With a T even, thanks ;) [13:01:42] ok, syncing [13:02:49] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.36.0-wmf.14/extensions/Wikibase: Backport: [[gerrit:637801|Revert JS parser commits (T266671)]] (duration: 01m 09s) [13:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:56] T266671: Revert commit 7f430f142d from `Malformed input error on text which is not malformed` - https://phabricator.wikimedia.org/T266671 [13:02:57] (03CR) 10Hnowlan: [C: 03+2] Enable replication in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/608726 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper) [13:03:47] !log EU backport&config window done [13:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:55] Lucas_WMDE: ftr, yes, codfw servers are read-only to avoid issues :-) [13:04:07] yeah, makes sense :) [13:07:33] (03CR) 10Filippo Giunchedi: [C: 03+2] package_builder: use --no-cowdancer-update when updating chroots [puppet] - 10https://gerrit.wikimedia.org/r/638032 (owner: 10Filippo Giunchedi) [13:15:18] (03PS2) 10Hnowlan: maps: add maps(200[5-9]|2010) as maps hosts [puppet] - 10https://gerrit.wikimedia.org/r/637554 (https://phabricator.wikimedia.org/T266820) [13:22:36] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Spare Drive Onsite for db1091 - https://phabricator.wikimedia.org/T266988 (10Hermann) [13:22:38] 10Operations, 10LDAP-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Give Trey jones access necessary to support Search Platform Airflow jobs - https://phabricator.wikimedia.org/T266995 (10Hermann) [13:25:56] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:26:19] (03PS4) 10Kormat: debian: add user/group + systemd service [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637683 (https://phabricator.wikimedia.org/T266763) [13:27:00] 10Operations, 10LDAP-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Give Trey jones access necessary to support Search Platform Airflow jobs - https://phabricator.wikimedia.org/T266995 (10DannyS712) [13:27:02] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Spare Drive Onsite for db1091 - https://phabricator.wikimedia.org/T266988 (10DannyS712) [13:27:28] (03CR) 10Kormat: "Made a few small fixes and done some basic testing of the completed pacakge." (032 comments) [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637672 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat) [13:27:53] (03CR) 10Kormat: debian: add user/group + systemd service (031 comment) [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637683 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat) [13:32:32] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is OK: HTTP OK: HTTP/1.0 200 OK - 23586 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:35:24] (03PS1) 10Elukey: dumps::web::html: fix pageview-complete's settings [puppet] - 10https://gerrit.wikimedia.org/r/638102 [13:37:09] (03CR) 10Elukey: [C: 03+2] dumps::web::html: fix pageview-complete's settings [puppet] - 10https://gerrit.wikimedia.org/r/638102 (owner: 10Elukey) [13:40:03] (03CR) 10Muehlenhoff: "Looks good, two comments inline." (032 comments) [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637683 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat) [13:40:17] !log elukey@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper [13:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:30] 10Operations, 10observability: VictorOps ~5min delay from email received to incident paging - https://phabricator.wikimedia.org/T266800 (10fgiunchedi) This is now a case with VO support. They'll be following up with their transactional email provider. [13:40:51] !log roll restart zookeeper ok an-conf* to pick up new openjdk upgrades [13:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:21] (03PS5) 10Kormat: debian: add user/group + systemd service [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637683 (https://phabricator.wikimedia.org/T266763) [13:43:22] (03PS2) 10Hashar: Remove $wgExtDistListFile, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636083 (https://phabricator.wikimedia.org/T266024) (owner: 10Legoktm) [13:44:15] (03CR) 10Kormat: debian: add user/group + systemd service (032 comments) [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637683 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat) [13:45:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637683 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat) [13:46:34] !log elukey@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) [13:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:44] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) | Host | hit-front rate 2020-09-10 -> 2020-09-23| hit-front rate 2020-10-29 -> 2020-11-02 | | cp4027 | 70.3% | 64.4% | | cp4028 | 72.... [13:49:32] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [13:50:47] (03CR) 10Muehlenhoff: [C: 03+1] "Changes between PS3 and PS5 also LGTM" [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637672 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat) [13:51:10] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:51:29] (03CR) 10Kormat: [V: 03+2 C: 03+2] Initial (re)packaging [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637672 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat) [13:51:40] (03CR) 10Kormat: [V: 03+2 C: 03+2] debian: add user/group + systemd service [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637683 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat) [13:53:36] (03PS1) 10Fdans: dumps::web::html Change location of pageview complete landing to readme.html [puppet] - 10https://gerrit.wikimedia.org/r/638105 [13:55:19] (03CR) 10ArielGlenn: [C: 03+1] "OK by me, anyone else need to weigh in on this?" [puppet] - 10https://gerrit.wikimedia.org/r/638105 (owner: 10Fdans) [13:56:54] (03CR) 10Elukey: [C: 03+2] dumps::web::html Change location of pageview complete landing to readme.html [puppet] - 10https://gerrit.wikimedia.org/r/638105 (owner: 10Fdans) [13:57:04] (03CR) 10Hashar: [C: 03+2] "Configuration cleanup. The file is gone from Gerrit." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636083 (https://phabricator.wikimedia.org/T266024) (owner: 10Legoktm) [13:57:23] ^ late config change [13:57:52] (03Merged) 10jenkins-bot: Remove $wgExtDistListFile, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636083 (https://phabricator.wikimedia.org/T266024) (owner: 10Legoktm) [13:58:17] (03PS5) 10Effie Mouzeli: Switch Thumbor haproxy load balancing to IP hash [puppet] - 10https://gerrit.wikimedia.org/r/636024 (https://phabricator.wikimedia.org/T266155) (owner: 10Gilles) [14:01:20] !log hashar@deploy1001 Synchronized wmf-config/CommonSettings.php: Remove $wgExtDistListFile, unused - T266024 (duration: 00m 58s) [14:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:27] T266024: Phase out https://gerrit.wikimedia.org/mediawiki-extensions.txt - https://phabricator.wikimedia.org/T266024 [14:03:20] (03PS1) 10Kormat: debian: Fix release name [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/638108 [14:03:47] (03CR) 10Kormat: [V: 03+2 C: 03+2] debian: Fix release name [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/638108 (owner: 10Kormat) [14:08:38] 10Operations, 10observability: VictorOps ~5min delay from email received to incident paging - https://phabricator.wikimedia.org/T266800 (10Volans) @fgiunchedi should we consider converting out transport from email to API calls at this point? Should give us an immediate feedback that we relayed the alert to VO... [14:09:28] (03PS6) 10Effie Mouzeli: Switch Thumbor haproxy load balancing to IP hash [puppet] - 10https://gerrit.wikimedia.org/r/636024 (https://phabricator.wikimedia.org/T266155) (owner: 10Gilles) [14:17:53] !log uploaded orchestrator 3.2.3-1 to apt [14:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:01] (03PS1) 10Effie Mouzeli: swift: pass the 'X-Client-IP' header to thumbor [puppet] - 10https://gerrit.wikimedia.org/r/638109 (https://phabricator.wikimedia.org/T266155) [14:19:44] (03CR) 10Effie Mouzeli: "I tried "balance hdr(X-Client-IP)", which didn't work at all! What happened was that we have configured "http-request set-header X-Client-" [puppet] - 10https://gerrit.wikimedia.org/r/636024 (https://phabricator.wikimedia.org/T266155) (owner: 10Gilles) [14:20:37] (03PS2) 10Kormat: orchestrator: Support running as non-root [puppet] - 10https://gerrit.wikimedia.org/r/637693 (https://phabricator.wikimedia.org/T266763) [14:20:39] (03PS9) 10Kormat: orchestrator: Support sqlite backend [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) [14:21:19] 10Operations, 10DBA, 10Orchestrator, 10Patch-For-Review: Repackage orchestrator - https://phabricator.wikimedia.org/T266763 (10Kormat) Repackaging is done, now just need https://gerrit.wikimedia.org/r/c/operations/puppet/+/637693 merged so it can be deployed. [14:21:20] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [14:24:32] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [14:25:34] (03PS1) 10Filippo Giunchedi: thanos: configure memcached size via hiera [puppet] - 10https://gerrit.wikimedia.org/r/638110 (https://phabricator.wikimedia.org/T261281) [14:28:40] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/26248/thanos-fe1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/638110 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [14:34:58] !log rolling restart of cassandra in restbase-dev to pick up Java security updates [14:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:34] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [14:38:39] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [14:38:40] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:36] (03CR) 10Gilles: [C: 03+1] swift: pass the 'X-Client-IP' header to thumbor [puppet] - 10https://gerrit.wikimedia.org/r/638109 (https://phabricator.wikimedia.org/T266155) (owner: 10Effie Mouzeli) [14:42:10] 10Operations, 10Puppet: Puppet Proposal to remove require_package - https://phabricator.wikimedia.org/T266479 (10akosiaris) The idea was indeed to just make sure that the packages are installed before anything else in the class happens. These days, if one puts `ensure_packages()` at the top of the manifest, we... [14:46:38] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [14:50:28] (03CR) 10Cparle: Generation of json dumps for wikimedia commons (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle) [15:12:52] (03PS1) 10JMeybohm: Remove kubernetes sources [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/638114 [15:12:54] (03PS1) 10JMeybohm: Package binary kubernetes releases [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/638115 [15:13:56] (03PS2) 10JMeybohm: Package binary kubernetes releases [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/638115 (https://phabricator.wikimedia.org/T266766) [15:13:58] (03PS1) 10Ottomata: Produce canary events every 15 minutes [puppet] - 10https://gerrit.wikimedia.org/r/638116 (https://phabricator.wikimedia.org/T266573) [15:19:10] (03PS2) 10Ottomata: Produce canary events every 15 minutes [puppet] - 10https://gerrit.wikimedia.org/r/638116 (https://phabricator.wikimedia.org/T266573) [15:21:12] (03CR) 10Ottomata: [C: 03+2] Produce canary events every 15 minutes [puppet] - 10https://gerrit.wikimedia.org/r/638116 (https://phabricator.wikimedia.org/T266573) (owner: 10Ottomata) [15:22:05] (03PS1) 10Filippo Giunchedi: thanos: add query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/638119 (https://phabricator.wikimedia.org/T261281) [15:22:08] (03PS1) 10Filippo Giunchedi: prometheus: add thanos query-frontend jobs [puppet] - 10https://gerrit.wikimedia.org/r/638120 (https://phabricator.wikimedia.org/T261281) [15:22:10] (03PS1) 10Filippo Giunchedi: role: add query_frontend to thanos frontend [puppet] - 10https://gerrit.wikimedia.org/r/638121 (https://phabricator.wikimedia.org/T261281) [15:22:12] (03PS1) 10Filippo Giunchedi: pontoon: use frontends for query_frontend memcache [puppet] - 10https://gerrit.wikimedia.org/r/638122 (https://phabricator.wikimedia.org/T261281) [15:35:48] (03CR) 10Filippo Giunchedi: "The end result of this patch series is PCC-ed here: https://puppet-compiler.wmflabs.org/compiler1002/26250/thanos-fe1001.eqiad.wmnet/fulld" [puppet] - 10https://gerrit.wikimedia.org/r/638121 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [15:36:18] !log imported php-excimer/php-luasandbox to component/php72 for buster-wikimedia [15:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:46] (03CR) 10Gehel: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/637554 (https://phabricator.wikimedia.org/T266820) (owner: 10Hnowlan) [15:40:08] (03PS4) 10Mforns: Add ::profile::analytics::refinery::network_region_config [puppet] - 10https://gerrit.wikimedia.org/r/637559 (https://phabricator.wikimedia.org/T254332) [15:41:30] (03CR) 10jerkins-bot: [V: 04-1] Add ::profile::analytics::refinery::network_region_config [puppet] - 10https://gerrit.wikimedia.org/r/637559 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [15:47:21] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10RobH) a:05Cmjohnson→03RobH >>! In T260370#6593013, @Marostegui wrote: > es1032 has RAID0 instead of RAID10. > Can we get that one re-done with RAI... [15:47:30] (03PS5) 10Mforns: Add ::profile::analytics::refinery::network_region_config [puppet] - 10https://gerrit.wikimedia.org/r/637559 (https://phabricator.wikimedia.org/T254332) [15:48:53] (03CR) 10Mforns: Add ::profile::analytics::refinery::network_region_config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637559 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [15:58:31] (03PS1) 10Hnowlan: maps: add maps100[5-8] and maps1010 [puppet] - 10https://gerrit.wikimedia.org/r/638125 [16:04:15] (03CR) 10Hnowlan: [C: 03+2] maps: add maps(200[5-9]|2010) as maps hosts [puppet] - 10https://gerrit.wikimedia.org/r/637554 (https://phabricator.wikimedia.org/T266820) (owner: 10Hnowlan) [16:04:47] (03PS2) 10Hnowlan: maps: add maps100[5-8] and maps1010 [puppet] - 10https://gerrit.wikimedia.org/r/638125 [16:08:55] (03PS3) 10Hnowlan: maps: add maps(200[5-9]|2010) as maps hosts [puppet] - 10https://gerrit.wikimedia.org/r/637554 (https://phabricator.wikimedia.org/T266820) [16:20:58] 10Operations, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for David Caro - https://phabricator.wikimedia.org/T267040 (10dcaro) [16:23:19] PROBLEM - Check systemd state on maps2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:39] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10RobH) [16:28:54] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10RobH) [16:29:52] 10Operations, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for David Caro - https://phabricator.wikimedia.org/T267040 (10nskaggs) +1 from me [16:36:03] PROBLEM - Maps HTTPS on maps2005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [16:37:00] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [16:37:00] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:37:03] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [16:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:06] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:14] sorry, that's me [16:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:24] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [16:37:24] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:53] RECOVERY - Check systemd state on maps2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:44:57] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [16:45:07] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [16:45:28] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Spare Drive Onsite for db1091 - https://phabricator.wikimedia.org/T266988 (10Cmjohnson) @wiki_willy I do not have any spare SSDs that would match what is in that server now. [16:45:46] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [16:45:51] (03CR) 10Meno25: "* Updated Arabic (ar) translation to match the current English source" [puppet] - 10https://gerrit.wikimedia.org/r/638055 (owner: 10Meno25) [16:46:12] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [16:48:10] 10Operations, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10RobH) [16:50:32] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Spare Drive Onsite for db1091 - https://phabricator.wikimedia.org/T266988 (10wiki_willy) Thanks for checking @Cmjohnson, just a heads the refresh for this server should be onsite towards the end of November via T264336. @Marostegui - are you ok with still having t... [16:57:36] (03CR) 10Meno25: "This is the Arabic (ar) translation of the new text:" [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656) (owner: 10Ladsgroup) [17:00:22] 10Operations, 10ops-eqiad, 10DC-Ops: Audit down ports - https://phabricator.wikimedia.org/T218751 (10ayounsi) Fyi: `lang=diff ayounsi@asw2-b-eqiad# show | compare [edit interfaces interface-range disabled] member xe-2/0/21 { ... } + member ge-5/0/10; + member ge-5/0/9; + member ge-8/0/22; +... [17:02:41] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) [17:07:24] 10Operations, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for David Caro - https://phabricator.wikimedia.org/T267040 (10aborrero) [17:11:01] 10Operations, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for David Caro - https://phabricator.wikimedia.org/T267040 (10aborrero) p:05Triage→03High [17:13:50] (03PS1) 10Hnowlan: postgres: set max connections in postgres based on replica count [puppet] - 10https://gerrit.wikimedia.org/r/638143 (https://phabricator.wikimedia.org/T266820) [17:14:25] (03CR) 10jerkins-bot: [V: 04-1] postgres: set max connections in postgres based on replica count [puppet] - 10https://gerrit.wikimedia.org/r/638143 (https://phabricator.wikimedia.org/T266820) (owner: 10Hnowlan) [17:14:33] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 127 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:16:07] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 10 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:16:21] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) @wiki_willy and @elukey I do not have enough 10G rack space to fit 24 2U servers, Currently, I have 17 2U spaces in 10G racks. This is a... [17:21:23] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 57.19 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:24:25] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [17:25:13] (03PS2) 10Hnowlan: postgres: set max connections in postgres based on replica count [puppet] - 10https://gerrit.wikimedia.org/r/638143 (https://phabricator.wikimedia.org/T266820) [17:26:00] 10Operations, 10Traffic, 10Upstream: OCSP Stapling for Intermediates - https://phabricator.wikimedia.org/T148134 (10Aklapper) [17:26:33] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 82.89 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:26:37] (03CR) 10jerkins-bot: [V: 04-1] postgres: set max connections in postgres based on replica count [puppet] - 10https://gerrit.wikimedia.org/r/638143 (https://phabricator.wikimedia.org/T266820) (owner: 10Hnowlan) [17:27:28] 10Operations, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Pchelolo) Ok, it did execute the job twice: Once on 27th: ` 2020-10-27 19:52:28 [34499b04-8b9a-4cd1-95dd-9229906705c7] mw... [17:27:53] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [17:29:17] (03PS3) 10Hnowlan: postgres: set max connections in postgres based on replica count [puppet] - 10https://gerrit.wikimedia.org/r/638143 (https://phabricator.wikimedia.org/T266820) [17:30:39] (03CR) 10jerkins-bot: [V: 04-1] postgres: set max connections in postgres based on replica count [puppet] - 10https://gerrit.wikimedia.org/r/638143 (https://phabricator.wikimedia.org/T266820) (owner: 10Hnowlan) [17:35:38] 10Operations, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Pchelolo) Ok, a bit more: ` 16:06 ppchelko@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobque... [17:36:03] (03PS4) 10Hnowlan: postgres: set max connections in postgres based on replica count [puppet] - 10https://gerrit.wikimedia.org/r/638143 (https://phabricator.wikimedia.org/T266820) [17:37:21] (03CR) 10jerkins-bot: [V: 04-1] postgres: set max connections in postgres based on replica count [puppet] - 10https://gerrit.wikimedia.org/r/638143 (https://phabricator.wikimedia.org/T266820) (owner: 10Hnowlan) [17:38:14] (03PS1) 10Ahmon Dancy: set cpu_model_extra_flags = vmx,pcid [puppet] - 10https://gerrit.wikimedia.org/r/638146 [17:44:10] (03PS5) 10Hnowlan: postgres: set max connections in postgres based on replica count [puppet] - 10https://gerrit.wikimedia.org/r/638143 (https://phabricator.wikimedia.org/T266820) [17:45:32] (03CR) 10jerkins-bot: [V: 04-1] postgres: set max connections in postgres based on replica count [puppet] - 10https://gerrit.wikimedia.org/r/638143 (https://phabricator.wikimedia.org/T266820) (owner: 10Hnowlan) [17:46:22] (03PS1) 10Elukey: profile::analytics::database::meta: specify max_connections for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/638148 [17:48:13] (03CR) 10BryanDavis: "andrewbogott: do we have a way to test this in isolation in the codfw cluster?" [puppet] - 10https://gerrit.wikimedia.org/r/638146 (owner: 10Ahmon Dancy) [17:48:43] 10Operations, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission ganeti100[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T255553 (10ayounsi) FYI: `lang=diff [edit interfaces interface-range disabled] member ge-7/0/11 { ... } + member ge-4/0/22; + member ge-4/0/23; +... [17:50:49] (03PS6) 10Hnowlan: postgres: set max connections in postgres based on replica count [puppet] - 10https://gerrit.wikimedia.org/r/638143 (https://phabricator.wikimedia.org/T266820) [17:52:13] (03CR) 10jerkins-bot: [V: 04-1] postgres: set max connections in postgres based on replica count [puppet] - 10https://gerrit.wikimedia.org/r/638143 (https://phabricator.wikimedia.org/T266820) (owner: 10Hnowlan) [17:53:10] 10Operations, 10DNS, 10Traffic, 10serviceops, 10Services (watching): nodejs / restbase services (mobileapps, aqs, recommendation-api, etc?) fail persistently after short windows of DNS unavailability - https://phabricator.wikimedia.org/T162818 (10Aklapper) 05Stalled→03Open The previous comments don't... [17:54:20] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-deploy100[1-4] - https://phabricator.wikimedia.org/T267050 (10RobH) [17:54:43] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-deploy100[1-4] - https://phabricator.wikimedia.org/T267050 (10RobH) [17:54:50] (03PS1) 10Jgreen: switch payments-listener to codfw for final testing [dns] - 10https://gerrit.wikimedia.org/r/638149 (https://phabricator.wikimedia.org/T265688) [17:56:35] (03CR) 10Jgreen: [C: 03+2] switch payments-listener to codfw for final testing [dns] - 10https://gerrit.wikimedia.org/r/638149 (https://phabricator.wikimedia.org/T265688) (owner: 10Jgreen) [18:00:04] ryankemper: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikidata Query Service weekly deploy . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201102T1800). [18:02:07] (03PS7) 10Hnowlan: postgres: set max connections in postgres based on replica count [puppet] - 10https://gerrit.wikimedia.org/r/638143 (https://phabricator.wikimedia.org/T266820) [18:05:06] 10Operations, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Pchelolo) So, we have found it: the same exact job has been executed twice. I have deployed change-prop for jobqueue right... [18:10:03] (03PS8) 10Hnowlan: postgres: set max connections in postgres based on replica count [puppet] - 10https://gerrit.wikimedia.org/r/638143 (https://phabricator.wikimedia.org/T266820) [18:11:45] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [18:13:25] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [18:14:20] !log push new pfw policies - T267051 [18:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:30] !bash < hnowlan> the cookie licks you [18:14:30] bd808: Stored quip at https://bash.toolforge.org/quip/xlAqinUBpU87LSFJ7Mx- [18:15:01] (03PS6) 10Ottomata: Add ::profile::analytics::refinery::network_region_config [puppet] - 10https://gerrit.wikimedia.org/r/637559 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [18:17:33] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [18:17:35] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:34] (03CR) 10Ottomata: [C: 03+2] Add ::profile::analytics::refinery::network_region_config [puppet] - 10https://gerrit.wikimedia.org/r/637559 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [18:29:54] (03CR) 10Hnowlan: "pcc output: https://puppet-compiler.wmflabs.org/compiler1001/26262/" [puppet] - 10https://gerrit.wikimedia.org/r/638143 (https://phabricator.wikimedia.org/T266820) (owner: 10Hnowlan) [18:31:26] (03CR) 10Giuseppe Lavagetto: [C: 03+1] systemd::timer: fix TODO of adding type definition for timer job [puppet] - 10https://gerrit.wikimedia.org/r/633853 (owner: 10Dzahn) [18:37:43] (03PS1) 10Jgreen: flip payments-listener back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/638155 [18:40:44] ACKNOWLEDGEMENT - Disk space on maps2002 is CRITICAL: DISK CRITICAL - free space: /srv 268 MB (0% inode=99%): Hnowlan cassandra issues being investigated https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps2002&var-datasource=codfw+prometheus/ops [18:40:44] ACKNOWLEDGEMENT - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 272514488256 and 367629 seconds Hnowlan cassandra issues being investigated https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:41:22] (03PS1) 10Jgreen: flip payments-listener back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/638156 (https://phabricator.wikimedia.org/T265688) [18:43:25] (03CR) 10Jgreen: [C: 03+2] flip payments-listener back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/638156 (https://phabricator.wikimedia.org/T265688) (owner: 10Jgreen) [18:47:34] (03PS1) 10Dzahn: decom testvm1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/638159 (https://phabricator.wikimedia.org/T245757) [18:49:36] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [18:49:38] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:43] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [18:49:43] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:21] (03PS2) 10Dzahn: decom testvm1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/638159 (https://phabricator.wikimedia.org/T245757) [18:52:15] (03CR) 10Dzahn: [C: 03+2] decom testvm1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/638159 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [18:53:41] jouncebot: next [18:53:41] In 0 hour(s) and 6 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201102T1900) [18:54:24] here [18:54:52] hello DannyS712 [18:55:00] I'm currently doing prep work [18:58:48] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [18:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:02] !log decom'ing testvm1001 [18:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:07] !log added dcaro to ops and wmf ldap groups [18:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201102T1900). [19:00:04] DannyS712: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:10] Still here [19:00:19] and is the sticker cool? [19:00:22] I can deploy today! [19:01:33] (03CR) 10Ottomata: [C: 03+1] "Looks like no-op in main cluster. +1" [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [19:01:40] let me know when its ready to test Urbanecm [19:01:52] DannyS712: available at mwdebug1002 ow [19:02:57] confirmed to work - page DOM and links reflect desired change [19:03:03] great [19:03:04] s/links/buttons [19:03:16] let me try to test it for real (not disclosing further) [19:04:26] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [19:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:06] DannyS712: confirmed it works, syncing [19:07:20] !log Deployed security fix for T205908 [19:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:30] DannyS712: should work w/o mwdebug now [19:07:48] confirmed [19:07:51] great [19:07:56] anything else? [19:08:12] nope [19:12:25] volans: I have a case of "decom cookbook Failed to run the sre.dns.netbox cookbook: Cumin execution failed" on a ganeti VM in eqiad. should i worry and/or make a ticket? it happens after most (or all) other decom steps worked fine [19:15:46] [ERROR clustershell.py:431 in _failed_commands_report] [19:15:57] from the extended log [19:17:14] (03PS9) 10Hnowlan: postgres: set max connections in postgres based on replica count [puppet] - 10https://gerrit.wikimedia.org/r/638143 (https://phabricator.wikimedia.org/T266820) [19:20:03] PROBLEM - Maps HTTPS on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:21:11] RECOVERY - Maps HTTPS on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 0.156 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:25:47] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:26:02] (03PS1) 10Urbanecm: abusefilter.php: Enable wgAbuseFilterNotificationsPrivate by default for WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/638162 (https://phabricator.wikimedia.org/T266298) [19:26:53] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [19:27:53] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:27:57] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.034 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [19:30:47] (03PS1) 10Aklapper: phabricator weekly changes email: List stalled task stalled for years [puppet] - 10https://gerrit.wikimedia.org/r/638163 (https://phabricator.wikimedia.org/T252522) [19:31:50] (03CR) 10Gehel: [C: 03+1] "LTGM" [puppet] - 10https://gerrit.wikimedia.org/r/638125 (owner: 10Hnowlan) [19:34:44] (03CR) 10Dzahn: "back in the days there was always the "is MobileFrontend extension enabled or not" to determine if a wiki is "ready for mobile", as far as" [dns] - 10https://gerrit.wikimedia.org/r/637849 (https://phabricator.wikimedia.org/T152882) (owner: 10Ladsgroup) [19:37:40] (03CR) 10Dzahn: "why is "nyc" special in this patch? It seems to work both mobile and not mobile but it's not in the regular place?" [dns] - 10https://gerrit.wikimedia.org/r/637849 (https://phabricator.wikimedia.org/T152882) (owner: 10Ladsgroup) [19:38:20] (03CR) 10Andrew Bogott: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/638146 (owner: 10Ahmon Dancy) [19:38:37] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10JAnstee_WMF) We are working to get our Director onboarded to phabricator and will hopefully be able to add to the card soon for approval! [19:41:25] (03PS2) 10Dzahn: Update email address for Nuria [puppet] - 10https://gerrit.wikimedia.org/r/636936 (owner: 10Muehlenhoff) [19:45:12] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [19:45:27] (03CR) 10Dzahn: [C: 03+2] Update email address for Nuria [puppet] - 10https://gerrit.wikimedia.org/r/636936 (owner: 10Muehlenhoff) [19:46:19] 10Operations, 10Analytics-Radar, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10Dzahn) email address changed: https://gerrit.wikimedia.org/r/c/operations/puppet/+/636936 [19:46:50] mutante: just run the sre.dns.netbox cookbook manually [19:47:54] volans: just like [cumin1001:~] $ sudo -i cookbook sre.dns.netbox 'sync after decom of ganeti VM' [19:47:58] ? [19:48:24] I'd refer the hostname in the message [19:48:29] but yes that's the gist [19:48:37] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [19:48:38] the messagethat the decom would have used is: [19:48:39] {hosts} decommissioned, removing all IPs except the asset tag one [19:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:50] 'sync after decom of ganeti VM testvm1001' [19:48:55] that actually for VMs is even inaccurate [19:48:58] k [19:49:28] the clustershell error is probably known to you then [19:49:32] it's running [19:51:16] yeah it's actually a netbox api error I have a patch to add automati retry on all netbox api calls [19:51:32] need to update it to use the new wmflib.requests module that abstracts that bits [19:52:07] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:52:09] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:54:24] mutante: the icinga above is yours, but recovery after the cookbook runs (takes a bit) [19:55:13] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 71%, RTA = 6577.61 ms [19:55:14] PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 71%, RTA = 6584.63 ms [19:55:21] volans: ACK, i am at the "done" prompt and confirming it now. thank you [19:56:16] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:59] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:03:06] np anytime [20:03:23] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:05:25] PROBLEM - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [20:07:17] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:09:06] doesnt look good but since it's the management router it's not UBN and maybe maintenance [20:11:57] RECOVERY - Juniper alarms on mr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [20:12:11] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 38, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:12:27] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 301.22 ms [20:12:35] RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 231.04 ms [20:13:09] looks like it indeed [20:17:44] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Server Moves to Free up 7x 2u Spaces on 10g Racks - https://phabricator.wikimedia.org/T267065 (10wiki_willy) [20:18:52] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10wiki_willy) [20:18:58] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Server Moves to Free up 7x 2u Spaces on 10g Racks - https://phabricator.wikimedia.org/T267065 (10wiki_willy) [20:19:03] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10wiki_willy) Thanks for the heads up @Cmjohnson . @elukey - do you have any servers on existing servers on 10g switches, that you might be able to d... [20:29:33] (03CR) 10Ahmon Dancy: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/638146 (owner: 10Ahmon Dancy) [20:36:53] (03CR) 10Andrew Bogott: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/638146 (owner: 10Ahmon Dancy) [20:43:41] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290 (10Aklapper) @vgutierrez: Is https://gerrit.wikimedia.org/r/c/operations/debs/pybal/+/418866 still wanted? What exactly (task, person?) is this task [stalled](https... [20:43:47] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Server Moves to Free up Space on 10g Racks - https://phabricator.wikimedia.org/T267065 (10wiki_willy) [20:48:52] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Server Moves to Free up Space on 10g Racks - https://phabricator.wikimedia.org/T267065 (10wiki_willy) [20:57:14] (03CR) 10Huji: [C: 04-1] "See phab." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/638162 (https://phabricator.wikimedia.org/T266298) (owner: 10Urbanecm) [21:00:04] chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201102T2100). [21:03:04] (03PS4) 10Jeena Huneidi: linkrecommendation: Add deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/636916 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [21:04:43] (03CR) 10jerkins-bot: [V: 04-1] linkrecommendation: Add deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/636916 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [21:09:07] (03PS2) 10Jeena Huneidi: Scaffold improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/637753 [21:10:53] (03CR) 10Huji: "Redacted" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/638162 (https://phabricator.wikimedia.org/T266298) (owner: 10Urbanecm) [21:11:37] (03CR) 10Urbanecm: "> Patch Set 1: -Code-Review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/638162 (https://phabricator.wikimedia.org/T266298) (owner: 10Urbanecm) [21:12:40] (03CR) 10Huji: [C: 03+1] abusefilter.php: Enable wgAbuseFilterNotificationsPrivate by default for WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/638162 (https://phabricator.wikimedia.org/T266298) (owner: 10Urbanecm) [21:13:57] (03PS3) 10Jeena Huneidi: Scaffold improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/637753 [21:14:54] (03CR) 10Jeena Huneidi: Scaffold improvements (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/637753 (owner: 10Jeena Huneidi) [21:19:00] (03CR) 10Huji: [C: 03+1] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/638162 (https://phabricator.wikimedia.org/T266298) (owner: 10Urbanecm) [21:19:30] (03CR) 10Urbanecm: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/638162 (https://phabricator.wikimedia.org/T266298) (owner: 10Urbanecm) [21:24:37] 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10Krinkle) [21:24:53] 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10Krinkle) a:05Krinkle→03nnikkhoui [21:26:18] (03PS5) 10Jeena Huneidi: linkrecommendation: Add deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/636916 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [21:29:09] RECOVERY - Maps - OSM synchronization lag - eqiad on alert1001 is OK: (C)2.592e+05 ge (W)1.764e+05 ge 1.601e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [21:31:29] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [21:36:21] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [21:50:51] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:51:29] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [21:51:39] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:52:36] (03CR) 10Ahmon Dancy: "> I probably don't want to flip on a feature on a hypervisor that's running a bunch of different user VMs. Can you tell me specifically w" [puppet] - 10https://gerrit.wikimedia.org/r/638146 (owner: 10Ahmon Dancy) [21:52:38] 10Operations, 10Commons, 10SRE-swift-storage: Recently more broken files (premature end of file at 5MB size) that were cross-wiki uploaded to Commons - https://phabricator.wikimedia.org/T266903 (10Draceane) p:05High→03Unbreak! [21:57:33] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:58:11] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [21:58:23] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:00:04] Reedy and sbassett: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201102T2200) [22:03:52] !log applied 113a244a66 on phab1001 to hotfix T240862 [22:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:01] T240862: Can't do shallow clone from phabricator - https://phabricator.wikimedia.org/T240862 [22:07:33] 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Document remaining database load groups - https://phabricator.wikimedia.org/T267077 (10nnikkhoui) [22:19:22] !log restart php7.3-fpm on phab1001 [22:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:00] 10Operations, 10Performance-Team, 10Traffic, 10serviceops, 10Performance Issue: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10Dzahn) . [22:37:13] (03CR) 10Razzi: "PCC diff: https://puppet-compiler.wmflabs.org/compiler1001/26265/" [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [22:37:18] 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Document remaining database load groups - https://phabricator.wikimedia.org/T267077 (10ArielGlenn) I can say something about the "dump" group if someone points me at a location and tells me an appropriate format. [22:42:56] (03PS1) 10Razzi: nginx: Remove profile::tlsproxy::service [puppet] - 10https://gerrit.wikimedia.org/r/638185 (https://phabricator.wikimedia.org/T240439) [22:50:09] 10Operations, 10ops-eqiad, 10Reading Epics (Analytics): an-coord1001 ram upgrade - https://phabricator.wikimedia.org/T266709 (10wiki_willy) a:03Cmjohnson [23:03:44] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install ms-be106[0-3] - https://phabricator.wikimedia.org/T265093 (10wiki_willy)