[00:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210223T0000). [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:01:22] RECOVERY - SSH on analytics1058.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:01:43] (03PS1) 10Cwhite: profile, elasticsearch: add option to configure systemd Before= value [puppet] - 10https://gerrit.wikimedia.org/r/666231 (https://phabricator.wikimedia.org/T275405) [00:10:35] mutante: thanks, yeah `kibana.service` is known [00:11:35] (03CR) 10Razzi: "Hi @Bstorm, I see profile::wmcs::db::wikireplicas::dedicated::analytics is included in the new role::wmcs::db::wikireplicas::dedicated::an" [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [00:12:38] ACKNOWLEDGEMENT - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn known https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:12:50] ryankemper: yep, I sent an ACK so that it's not unhandled alert [00:17:20] RECOVERY - Check systemd state on webperf2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:18:25] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10RobH) 05Open→03Resolved [00:19:50] RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:24:01] (03PS5) 10Razzi: wikireplicas: Add configuration for clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/661529 (https://phabricator.wikimedia.org/T269211) [00:30:11] 10SRE, 10WMF-JobQueue: Jobs are not getting executed or executed really slowly - https://phabricator.wikimedia.org/T275437 (10Legoktm) p:05Triage→03High [00:30:39] https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?viewPanel=15&orgId=1&from=now-6h&to=now looks like its going down, ever so slightly [00:39:22] (03CR) 10Razzi: [C: 04-1] "The approach is good and it has all the users / groups that are neede that I know of. Requesting a very small change, only changing a comm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666133 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [00:57:52] (03CR) 10Bstorm: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [01:00:10] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 15763437280 and 1466 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:05:43] (03PS2) 10Legoktm: docker_registry_ha: Have restricted/ images that are limited read/write (try 2) [puppet] - 10https://gerrit.wikimedia.org/r/664683 (https://phabricator.wikimedia.org/T273521) [01:10:44] (03CR) 10Legoktm: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/664683 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [01:20:36] (03PS6) 10Legoktm: k8s: Add docker-registry credentials to pull restricted images [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) [01:21:00] (03PS1) 10Legoktm: Update docker-registry hiera keys again [labs/private] - 10https://gerrit.wikimedia.org/r/666238 [01:21:18] (03PS3) 10Legoktm: docker_registry_ha: Have restricted/ images that are limited read/write (try 2) [puppet] - 10https://gerrit.wikimedia.org/r/664683 (https://phabricator.wikimedia.org/T273521) [01:21:20] (03PS7) 10Legoktm: k8s: Add docker-registry credentials to pull restricted images [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) [01:22:44] (03CR) 10Legoktm: "Rebased on the "(try 2)" patch which changes the account name. I also included a brief summary of the potential security implications of t" [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [02:07:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.32 [core] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/666243 [02:45:26] PROBLEM - dump of matomo in eqiad on alert1001 is CRITICAL: Last dump for matomo at eqiad (db1108.eqiad.wmnet:3351) taken on 2021-02-23 02:29:51 is 0 GB, but previous one was 1 GB, a change of 35.4% https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [03:07:46] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:15:18] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:26:06] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:28:34] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.064 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:53:06] (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.32 [core] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/666243 (https://phabricator.wikimedia.org/T274936) (owner: 10TrainBranchBot) [05:08:22] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:13:31] !log krinkle@deploy1001 Started deploy [integration/docroot@44d5685]: I307e8f4f6979 [05:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:38] !log krinkle@deploy1001 Finished deploy [integration/docroot@44d5685]: I307e8f4f6979 (duration: 00m 06s) [05:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:00] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state