[00:00:04] twentyafterfour: May I have your attention please! Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210415T0000) [00:00:31] (03CR) 10Legoktm: [C: 03+1] Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/679526 (owner: 10RLazarus) [00:03:05] 10SRE, 10serviceops: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [00:03:10] 10SRE, 10serviceops: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [00:04:04] 10SRE, 10serviceops: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [00:06:00] (03CR) 10RLazarus: [C: 03+2] Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/679526 (owner: 10RLazarus) [00:09:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Dzahn) Hi all, I just made the (so far missing) decom ticket [[T280203]] for mw1261 through mw1301. From the procurement date and ticket in n... [00:18:18] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on alert1001 is CRITICAL: 44.58 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:19:17] (03PS1) 10Dzahn: site/conftool/DHCP: remove old eqiad appservers in A5, mostly canaries [puppet] - 10https://gerrit.wikimedia.org/r/679527 (https://phabricator.wikimedia.org/T280203) [00:22:22] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [00:23:05] ^ I think they are normal effect of the depool revert above. correcet me if I'm wrong [00:25:09] yea, the actual grafana chart also looks like esams replaces what is gone from codfw, totals staying the same [00:28:02] (03PS1) 10Dzahn: trafficserver: comment about a server that won't exist anymore [puppet] - 10https://gerrit.wikimedia.org/r/679530 (https://phabricator.wikimedia.org/T280203) [00:29:08] (03PS2) 10Dzahn: trafficserver: comment about a server that won't exist anymore [puppet] - 10https://gerrit.wikimedia.org/r/679530 (https://phabricator.wikimedia.org/T280203) [00:29:34] 10SRE, 10MediaWiki-Vagrant, 10phan: It should be possible to install php-ast using apt-get on MediaWiki-Vagrant - https://phabricator.wikimedia.org/T234240 (10Reedy) https://packages.debian.org/stretch/php-ast php-ast is packaged for stretch... What's the issue with using that? [00:29:45] 10SRE, 10MediaWiki-Vagrant, 10phan: Phan should work out of the box on MediaWiki-Vagrant - https://phabricator.wikimedia.org/T234240 (10Reedy) [00:29:58] PROBLEM - Check systemd state on ms-be1060 is CRITICAL: CRITICAL - degraded: The following units failed: session-70496.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:38] ACKNOWLEDGEMENT - Varnish traffic drop between 30min ago and now at codfw on alert1001 is CRITICAL: 22.28 le 60 daniel_zahn esams repool https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:34:22] ACKNOWLEDGEMENT - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. daniel_zahn esams repool https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [00:46:56] PROBLEM - Check systemd state on logstash1008 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:06] PROBLEM - Check systemd state on logstash2006 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:44] PROBLEM - Check systemd state on logstash1007 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:48] PROBLEM - Check systemd state on logstash1009 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:52] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:58] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:48:58] PROBLEM - Check systemd state on logstash2004 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:00] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:10] PROBLEM - Check systemd state on logstash2005 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:40] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [00:56:00] RECOVERY - Check systemd state on ms-be1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:00] !log mwscript extensions/Wikibase/repo/maintenance/changePropertyDataType.php wikidatawiki --property-id P8671 --new-data-type external-id (T278427) [01:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:08] T278427: Convert Deutsche Bahn station code (P8671) from String to External Identifier - https://phabricator.wikimedia.org/T278427 [01:22:58] PROBLEM - Check systemd state on ores2001 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:25:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:28:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:38:27] 10SRE, 10MediaWiki-Vagrant, 10phan: Phan should work out of the box on MediaWiki-Vagrant - https://phabricator.wikimedia.org/T234240 (10Reedy) Hmm. So that wants to install PHP 7.0... I guess this is blocked by {T256822}... Though https://packages.debian.org/buster/php-ast might be a bit too old.. [01:38:59] 10SRE, 10MediaWiki-Vagrant, 10phan: Phan should work out of the box on MediaWiki-Vagrant - https://phabricator.wikimedia.org/T234240 (10Reedy) [01:39:18] PROBLEM - ores_workers_running on ores2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [01:42:12] RECOVERY - Check systemd state on ores2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:44:00] RECOVERY - ores_workers_running on ores2001 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [01:44:05] 10SRE: Package php-ast in {stretch,buster}-wikimedia/component - https://phabricator.wikimedia.org/T280210 (10Reedy) [01:47:00] 10SRE, 10Packaging: Package php-ast in {stretch,buster}-wikimedia/component - https://phabricator.wikimedia.org/T280210 (10Reedy) [01:47:22] 10SRE, 10Packaging: Update php-xdebug to 2.7.2 in apt.wikimedia.org - https://phabricator.wikimedia.org/T263933 (10Reedy) [02:00:34] RECOVERY - WDQS high update lag on wdqs2001 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.139e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:22:53] ACKNOWLEDGEMENT - Check systemd state on logstash1007 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage.service cole_white known - T274394 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:22:53] ACKNOWLEDGEMENT - Check systemd state on logstash1008 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage.service cole_white known - T274394 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:22:53] ACKNOWLEDGEMENT - Check systemd state on logstash1009 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage.service cole_white known - T274394 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:22:53] ACKNOWLEDGEMENT - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service cole_white known - T274394 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:22:53] ACKNOWLEDGEMENT - Check systemd state on logstash2004 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage.service cole_white known - T274394 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:22:54] ACKNOWLEDGEMENT - Check systemd state on logstash2005 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage.service cole_white known - T274394 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:22:54] ACKNOWLEDGEMENT - Check systemd state on logstash2006 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage.service cole_white known - T274394 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:22:55] ACKNOWLEDGEMENT - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service cole_white known - T274394 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:25:47] (03PS1) 10Cwhite: elasticsearch: curator remove stdout redirect [puppet] - 10https://gerrit.wikimedia.org/r/679553 (https://phabricator.wikimedia.org/T274394) [03:18:02] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:36:38] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.070 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:01:41] (03CR) 10Ryan Kemper: [C: 03+2] WDQS: Wait for updater to catchup during data transfer. [cookbooks] - 10https://gerrit.wikimedia.org/r/679320 (https://phabricator.wikimedia.org/T280108) (owner: 10Gehel) [04:06:00] !log T280108 T267927 Merged https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/679320, will verify correct behavior of `data-transfer` cookbook [04:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:06:11] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [04:06:12] T280108: WDQS data-transfer cookbook needs to wait for updater to catchup on lag - https://phabricator.wikimedia.org/T280108 [04:14:14] !log T280108 T267927 `wdqs2008` (source) caught up on lag, xfering to `wdqs1004`: `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs1004.eqiad.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927` [04:14:18] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [04:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:23] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [04:14:24] T280108: WDQS data-transfer cookbook needs to wait for updater to catchup on lag - https://phabricator.wikimedia.org/T280108 [04:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:21:10] (03PS2) 10Razzi: clouddb: enable alerting for clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/677977 (https://phabricator.wikimedia.org/T269211) [04:22:26] (03CR) 10Razzi: "I raised the memory monitoring to be 95% to warn and 98% to alert, is this cutting it too close @elukey? If not we might have to lower the" [puppet] - 10https://gerrit.wikimedia.org/r/677977 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [04:22:58] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (backup1002), Fresh: 96 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:27:27] (03CR) 10Razzi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/677977 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [05:02:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166 to clone db1179 T275633', diff saved to https://phabricator.wikimedia.org/P15344 and previous config saved to /var/cache/conftool/dbconfig/20210415-050239-marostegui.json [05:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:49] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [05:05:25] (03PS1) 10Razzi: netboot: WIP make flerovium reuse /srv directory [puppet] - 10https://gerrit.wikimedia.org/r/679607 (https://phabricator.wikimedia.org/T278421) [05:06:07] (03PS2) 10Razzi: netboot: WIP make flerovium reuse /srv directory [puppet] - 10https://gerrit.wikimedia.org/r/679607 (https://phabricator.wikimedia.org/T278421) [05:11:02] (03PS1) 10Marostegui: production.my.cnf: Add innodb_change_buffering = none [puppet] - 10https://gerrit.wikimedia.org/r/679609 (https://phabricator.wikimedia.org/T263443) [05:11:35] (03CR) 10Marostegui: [C: 03+2] production.my.cnf: Add innodb_change_buffering = none [puppet] - 10https://gerrit.wikimedia.org/r/679609 (https://phabricator.wikimedia.org/T263443) (owner: 10Marostegui) [05:20:52] (03PS1) 10Marostegui: mariadb: Productionize db1179 [puppet] - 10https://gerrit.wikimedia.org/r/679611 (https://phabricator.wikimedia.org/T275633) [05:29:49] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1179 [puppet] - 10https://gerrit.wikimedia.org/r/679611 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [05:30:05] (03PS1) 10Nray: Add mediawiki.pref_diff stream to wgEventLoggingStreamNames/wgEventStreams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679613 (https://phabricator.wikimedia.org/T277588) [05:30:07] (03PS1) 10Nray: Add $wgWMEVectorPrefDiffSalt to private/readme [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679614 (https://phabricator.wikimedia.org/T277588) [05:44:39] !log start deleting archive of wikidata-bugs T262773 [05:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:48] T262773: Stop archiving the wikidata-bugs mailinglist in pipermail - https://phabricator.wikimedia.org/T262773 [05:54:27] !log end of cleaning archive of pywikibot-bugs and wikidata-bugs T262773 [05:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:36] T262773: Stop archiving the wikidata-bugs mailinglist in pipermail - https://phabricator.wikimedia.org/T262773 [06:00:30] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [06:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:20] RECOVERY - Query Service HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [06:01:34] RECOVERY - WDQS high update lag on wdqs2004 is OK: (C)3600 ge (W)1200 ge 505.3 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [06:02:22] RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.076 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:04:34] (03CR) 10Giuseppe Lavagetto: "Are we really using this anywhere? I don't think it's used by any of our code, what about removing it?" [puppet] - 10https://gerrit.wikimedia.org/r/677620 (owner: 10Hnowlan) [06:06:44] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Update rewrite rule for https://www.mediawiki.org/FAQ [puppet] - 10https://gerrit.wikimedia.org/r/676508 (owner: 10DannyS712) [06:14:10] (03CR) 10Elukey: "It is convenient for us (Analytics) to test AQS in wmcs without the need of a deploy server with scap (outside deployment-prep basically)." [puppet] - 10https://gerrit.wikimedia.org/r/677620 (owner: 10Hnowlan) [06:16:20] 10SRE, 10DC-Ops, 10SRE-tools, 10Sustainability (Incident Followup): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10Marostegui) @Volans thanks for the explanation! There's something I have been wonder... [06:16:45] (03CR) 10Giuseppe Lavagetto: "> Side note: I know that aqs should probably be better in k8s but it is currently something difficult for us to do (stale restbase deps, w" [puppet] - 10https://gerrit.wikimedia.org/r/677620 (owner: 10Hnowlan) [06:17:13] (03CR) 10Elukey: clouddb: enable alerting for clouddb1021 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677977 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [06:31:48] (03CR) 10Elukey: [C: 03+2] Set hue.wikimedia.org for an-tool1009 [puppet] - 10https://gerrit.wikimedia.org/r/678860 (https://phabricator.wikimedia.org/T264896) (owner: 10Elukey) [06:32:31] !log move hue.wikimedia.org to an-tool1009 (from analytics-tool1001) [06:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:24] (03CR) 10Elukey: [C: 03+2] Move hue.wikimedia.org to the an-tool1009 backend [puppet] - 10https://gerrit.wikimedia.org/r/678861 (https://phabricator.wikimedia.org/T264896) (owner: 10Elukey) [06:33:53] !log !log T280108 T267927 `data-transfer` to `wdqs1004` was successful; cookbook failed due to a newly introduced minor type error that didn't effect the transfer itself [06:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:02] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [06:34:03] T280108: WDQS data-transfer cookbook needs to wait for updater to catchup on lag - https://phabricator.wikimedia.org/T280108 [06:34:20] * ryankemper wrote !log twice, will edit message manually [06:37:49] running puppet on A:cp-text nodes to pick up the hue changes [06:37:59] (batch of 4, sleep 30s) [06:42:01] 10SRE, 10ops-eqiad, 10DC-Ops: ps1-a7-eqiad power over threshold alerts - https://phabricator.wikimedia.org/T276743 (10ayounsi) It keeps alerting, I disabled alerting for that device until then. Once fixed please re-enable it in https://librenms.wikimedia.org/device/43/edit [06:52:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10ayounsi) >>! In T272403#6984722, @Cmjohnson wrote: > @aborrero The 2nd interfaces are > cloudgw1001 cloudsw1-c8 xe-0/0/19 cabl... [06:57:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 10%: Repool db1166 after cloning db1179', diff saved to https://phabricator.wikimedia.org/P15346 and previous config saved to /var/cache/conftool/dbconfig/20210415-065704-root.json [06:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:08] (03PS1) 10Marostegui: install_server: Do not reimage db1180 [puppet] - 10https://gerrit.wikimedia.org/r/679693 (https://phabricator.wikimedia.org/T275633) [07:07:17] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install conf200[456].codfw.wmnet - https://phabricator.wikimedia.org/T275637 (10Joe) Thanks @Papaul! We'll now work on service implementation. [07:08:26] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1180 [puppet] - 10https://gerrit.wikimedia.org/r/679693 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [07:12:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: Repool db1166 after cloning db1179', diff saved to https://phabricator.wikimedia.org/P15347 and previous config saved to /var/cache/conftool/dbconfig/20210415-071207-root.json [07:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1146 (s2,s4) to upgrade kernel', diff saved to https://phabricator.wikimedia.org/P15348 and previous config saved to /var/cache/conftool/dbconfig/20210415-071600-marostegui.json [07:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 25%: Repool db1146:3314 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15350 and previous config saved to /var/cache/conftool/dbconfig/20210415-072436-root.json [07:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: Repool db1166 after cloning db1179', diff saved to https://phabricator.wikimedia.org/P15351 and previous config saved to /var/cache/conftool/dbconfig/20210415-072711-root.json [07:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:39] PROBLEM - Thanos compact is halted on alert1001 is CRITICAL: cluster=thanos instance=thanos-fe2001 job=thanos-compact prometheus=ops site=codfw https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [07:34:18] (03PS1) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [07:37:32] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.05022 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [07:37:36] (03CR) 10Ryan Kemper: elasticsearch: refactor various rolling operations (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [07:37:47] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me [07:38:06] <_joe_> sigh [07:38:08] hi [07:38:13] mmm [07:38:18] looking [07:38:27] checking s4 master [07:38:29] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 1 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [07:38:31] <_joe_> appseerver affected to [07:38:33] around [07:38:38] <_joe_> I think it's db related yes [07:39:00] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is CRITICAL: 0.2973 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [07:39:02] <_joe_> and yes related to s4 [07:39:04] yes [07:39:10] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [07:39:15] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [07:39:22] marostegui: what should we do ? [07:39:24] * volans here [07:39:29] effie: I am checking [07:39:37] can somebody become IC please ? :) [07:39:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 50%: Repool db1146:3314 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15352 and previous config saved to /var/cache/conftool/dbconfig/20210415-073940-root.json [07:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:51] <_joe_> elukey: wait [07:39:51] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.7727 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [07:40:01] <_joe_> it's probably simple enough we don't need one [07:40:11] I am seeing lots of select on commons master like: SELECT /* LinksUpdate::acquirePageLock */ GET_LOCK('commonswiki:LinksUpdate:atomicity:pageid:10367 [07:40:21] here too [07:40:29] <_joe_> linksupdate? [07:40:41] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 258 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:41:03] https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=37&orgId=1&from=now-24h&to=now&var-server=db1138&var-port=9104 [07:41:06] that's s4 master [07:41:33] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [07:41:34] <_joe_> is it recovering? [07:41:36] <_joe_> yeah [07:41:47] _joe_: I still see them [07:41:55] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 351 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:42:10] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.5777 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [07:42:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: Repool db1166 after cloning db1179', diff saved to https://phabricator.wikimedia.org/P15353 and previous config saved to /var/cache/conftool/dbconfig/20210415-074214-root.json [07:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:25] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [07:42:41] [Exception RuntimeException] (/srv/mediawiki/php-1.37.0-wmf.1/includes/deferred/LinksUpdate.php:189) Could not acquire lock for page ID '51023886'. [07:42:50] they look gone now [07:42:52] a bunch of those, which is why it might be recovering? [07:43:07] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.381 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [07:43:29] legoktm: the master was processing lots of: SELECT /* LinksUpdate::acquirePageLock */ GET_LOCK('commonswiki:LinksUpdate:atomicity:pageid:10367 [07:43:33] no idea what that is [07:43:38] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.5012 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [07:43:45] https://commons.wikimedia.org/?curid=10367 [07:43:46] I am seeing another spike of them [07:43:52] LinksUpdate updates the *_links tables used for backlink tracking after an edit [07:43:52] show processlist; [07:43:53] gah [07:44:04] just some random file?? [07:44:15] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 7 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:44:16] legoktm: I am seeing lots of different pageid [07:44:21] hmm [07:44:42] | 2189852520 | wikiuser | 10.64.16.168:50392 | commonswiki | Query | 7 | User lock | SELECT /* LinksUpdate::acquirePageLock */ GET_LOCK('commonswiki:LinksUpdate:atomicity:pageid:103679733' [07:44:48] or [07:44:51] | 2189853872 | wikiuser | 10.64.16.154:56092 | commonswiki | Query | 0 | User lock | SELECT /* LinksUpdate::acquirePageLock */ GET_LOCK('commonswiki:LinksUpdate:atomicity:pageid:4454357', 15) AS lockstatus | 0.000 | [07:45:19] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:46:27] see _security [07:46:50] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.08139 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [07:46:50] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.6212 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [07:47:05] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me [07:48:31] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [07:48:33] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/description/{title} (Get description for test page) timed out before a response was received: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was r [07:48:33] }/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [07:49:57] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 758 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:50:41] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [07:53:07] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [07:53:29] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 950 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:54:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 75%: Repool db1146:3314 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15354 and previous config saved to /var/cache/conftool/dbconfig/20210415-075444-root.json [07:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:47] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 8 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:56:04] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.1779 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [07:56:04] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.6061 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [07:56:51] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 37 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:57:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: Repool db1166 after cloning db1179', diff saved to https://phabricator.wikimedia.org/P15355 and previous config saved to /var/cache/conftool/dbconfig/20210415-075718-root.json [07:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:29] PROBLEM - Apache HTTP on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:57:31] PROBLEM - Apache HTTP on mw1374 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:57:31] PROBLEM - Apache HTTP on mw1379 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:59:41] RECOVERY - Apache HTTP on mw1347 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:59:41] RECOVERY - Apache HTTP on mw1379 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:59:41] RECOVERY - Apache HTTP on mw1374 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:59:52] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is CRITICAL: 0.1521 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [08:00:41] 10SRE, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee Manuel - https://phabricator.wikimedia.org/T280162 (10Lea_WMDE) I can confirm @Manuel is on my team and approve. [08:01:16] (03PS1) 10Ryan Kemper: wdqs: result is list not dictionary [cookbooks] - 10https://gerrit.wikimedia.org/r/679702 (https://phabricator.wikimedia.org/T280108) [08:01:56] (03PS2) 10Ryan Kemper: wdqs: result is list not dictionary [cookbooks] - 10https://gerrit.wikimedia.org/r/679702 (https://phabricator.wikimedia.org/T280108) [08:02:15] For all users - the SRE team is aware of the current issues to the wikis, work is in progress to fix it [08:02:47] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 861 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:03:02] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.5015 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [08:03:47] (03CR) 10Volans: "[optional] given the refactor effort it would be nice to migrate to the new class API (see https://doc.wikimedia.org/spicerack/master/intr" [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [08:04:30] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.6903 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [08:04:31] (03CR) 10jerkins-bot: [V: 04-1] wdqs: result is list not dictionary [cookbooks] - 10https://gerrit.wikimedia.org/r/679702 (https://phabricator.wikimedia.org/T280108) (owner: 10Ryan Kemper) [08:05:16] (03CR) 10Ryan Kemper: "See https://phabricator.wikimedia.org/T280108#7001720 for testing" [cookbooks] - 10https://gerrit.wikimedia.org/r/679702 (https://phabricator.wikimedia.org/T280108) (owner: 10Ryan Kemper) [08:05:21] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [08:05:37] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [08:07:25] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:09:10] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is CRITICAL: 0.1213 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [08:09:33] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most [08:09:33] January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [08:09:47] PROBLEM - PHP7 rendering on mw1379 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:09:47] PROBLEM - PHP7 rendering on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:09:47] PROBLEM - PHP7 rendering on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:09:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 100%: Repool db1146:3314 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15356 and previous config saved to /var/cache/conftool/dbconfig/20210415-080947-root.json [08:09:49] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [08:09:49] PROBLEM - PHP7 rendering on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:09:53] PROBLEM - PHP7 rendering on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:09:55] PROBLEM - PHP7 rendering on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:09:55] PROBLEM - Apache HTTP on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:09:55] PROBLEM - PHP7 rendering on mw1287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:09:57] PROBLEM - PHP7 rendering on mw1402 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:09:57] PROBLEM - Apache HTTP on mw1287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:59] PROBLEM - PHP7 rendering on mw1377 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:10:01] PROBLEM - PHP7 rendering on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:10:02] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [08:10:03] PROBLEM - Apache HTTP on mw1380 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:10:03] PROBLEM - PHP7 rendering on mw1359 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:10:03] PROBLEM - PHP7 rendering on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:10:05] PROBLEM - Apache HTTP on mw1356 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:10:05] PROBLEM - Apache HTTP on mw1376 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:10:05] PROBLEM - PHP7 rendering on mw1362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:10:05] PROBLEM - PHP7 rendering on mw1358 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:10:07] PROBLEM - PHP7 rendering on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:10:07] PROBLEM - Apache HTTP on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:10:11] PROBLEM - Apache HTTP on mw1396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:10:13] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [08:10:17] PROBLEM - PHP7 rendering on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:10:19] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me [08:10:19] PROBLEM - Apache HTTP on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:10:31] PROBLEM - PHP7 rendering on mw1356 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:10:55] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=G [08:11:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1146:3314', diff saved to https://phabricator.wikimedia.org/P15357 and previous config saved to /var/cache/conftool/dbconfig/20210415-081127-marostegui.json [08:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:49] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [08:11:51] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [08:12:01] RECOVERY - PHP7 rendering on mw1315 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:12:01] RECOVERY - PHP7 rendering on mw1379 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:12:01] RECOVERY - PHP7 rendering on mw1297 is OK: HTTP OK: HTTP/1.1 302 Found - 655 bytes in 0.401 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:12:03] RECOVERY - PHP7 rendering on mw1290 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:12:07] RECOVERY - PHP7 rendering on mw1281 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:12:09] RECOVERY - Apache HTTP on mw1283 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:12:09] RECOVERY - PHP7 rendering on mw1287 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:12:09] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 1065 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:12:11] RECOVERY - PHP7 rendering on mw1288 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 2.107 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:12:11] RECOVERY - PHP7 rendering on mw1402 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:12:11] RECOVERY - Apache HTTP on mw1287 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:12:13] RECOVERY - PHP7 rendering on mw1377 is OK: HTTP OK: HTTP/1.1 302 Found - 655 bytes in 0.831 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:12:15] RECOVERY - PHP7 rendering on mw1345 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:12:17] RECOVERY - PHP7 rendering on mw1289 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:12:17] RECOVERY - Apache HTTP on mw1380 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:12:19] RECOVERY - PHP7 rendering on mw1359 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 1.561 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:12:19] RECOVERY - Apache HTTP on mw1376 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:12:19] RECOVERY - PHP7 rendering on mw1358 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:12:19] RECOVERY - PHP7 rendering on mw1362 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.087 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:12:21] RECOVERY - Apache HTTP on mw1356 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.407 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:12:21] RECOVERY - PHP7 rendering on mw1285 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:12:21] RECOVERY - Apache HTTP on mw1316 is OK: HTTP OK: HTTP/1.1 302 Found - 641 bytes in 0.339 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:12:25] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.5303 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [08:12:25] RECOVERY - Apache HTTP on mw1396 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:12:29] RECOVERY - PHP7 rendering on mw1283 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:12:31] RECOVERY - Apache HTTP on mw1297 is OK: HTTP OK: HTTP/1.1 302 Found - 641 bytes in 0.241 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:12:33] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [08:12:43] RECOVERY - PHP7 rendering on mw1356 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 1.218 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:13:17] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:13:54] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.6156 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [08:14:11] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [08:14:29] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [08:14:48] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.5816 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [08:14:48] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.0303 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [08:15:35] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 659 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:16:51] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 17 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:17:21] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [08:18:03] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09524 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [08:20:15] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:21:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 5%: Repool db1146:3314 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15358 and previous config saved to /var/cache/conftool/dbconfig/20210415-082115-root.json [08:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:31] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 4:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [08:28:15] 10SRE, 10DC-Ops, 10SRE-tools, 10Sustainability (Incident Followup): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10Volans) @Marostegui yes, I didn't mention all the other efforts because a bit off to... [08:33:11] (03PS1) 10Marostegui: mariadb: Productionzie db1182 [puppet] - 10https://gerrit.wikimedia.org/r/679707 (https://phabricator.wikimedia.org/T275633) [08:34:16] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionzie db1182 [puppet] - 10https://gerrit.wikimedia.org/r/679707 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [08:34:18] (03PS2) 10Arturo Borrero Gonzalez: Update NAT exceptions for kraz -> irc1001/irc2001 [puppet] - 10https://gerrit.wikimedia.org/r/679278 (https://phabricator.wikimedia.org/T280225) (owner: 10Muehlenhoff) [08:36:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 10%: Repool db1146:3314 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15359 and previous config saved to /var/cache/conftool/dbconfig/20210415-083618-root.json [08:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:52] (03PS1) 10Arturo Borrero Gonzalez: cloud: drop NAT exception for IRCD (kraz) [puppet] - 10https://gerrit.wikimedia.org/r/679709 (https://phabricator.wikimedia.org/T280225) [08:44:53] (03PS1) 10Arturo Borrero Gonzalez: cr/firewall: dro IRCD exception (kraz) [homer/public] - 10https://gerrit.wikimedia.org/r/679716 (https://phabricator.wikimedia.org/T280225) [08:47:27] PROBLEM - Check systemd state on es1020 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:48:02] !log free space and bounce thanos-compact [08:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:37] RECOVERY - Thanos compact is halted on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [08:50:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 10%: Repool db1146:3312', diff saved to https://phabricator.wikimedia.org/P15360 and previous config saved to /var/cache/conftool/dbconfig/20210415-085017-root.json [08:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:57] (03CR) 10Muehlenhoff: [C: 03+1] "Makes sense, irc1001 is serving irc.wikimedia.org since 25th of March without such an exception and AFAICT there were no issues/complaints" [puppet] - 10https://gerrit.wikimedia.org/r/679709 (https://phabricator.wikimedia.org/T280225) (owner: 10Arturo Borrero Gonzalez) [08:51:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 25%: Repool db1146:3314 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15361 and previous config saved to /var/cache/conftool/dbconfig/20210415-085122-root.json [08:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:45] (03CR) 10Muehlenhoff: [C: 03+1] "Makes sense, irc1001 is serving irc.wikimedia.org since 25th of March without such an exception and AFAICT there were no issues/complaints" [homer/public] - 10https://gerrit.wikimedia.org/r/679716 (https://phabricator.wikimedia.org/T280225) (owner: 10Arturo Borrero Gonzalez) [09:00:00] RECOVERY - Check systemd state on es1020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:52] (03CR) 10Ema: [C: 03+2] cache_upload: set nuke_limit to 1000 [puppet] - 10https://gerrit.wikimedia.org/r/679364 (https://phabricator.wikimedia.org/T275809) (owner: 10Ema) [09:04:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: drop NAT exception for IRCD (kraz) [puppet] - 10https://gerrit.wikimedia.org/r/679709 (https://phabricator.wikimedia.org/T280225) (owner: 10Arturo Borrero Gonzalez) [09:04:17] !log installing tomcat security updates [09:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:58] (03CR) 10Jbond: [C: 03+2] P:debmonitor::server: switch to mod_proxy_uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/679399 (owner: 10Jbond) [09:05:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 25%: Repool db1146:3312', diff saved to https://phabricator.wikimedia.org/P15362 and previous config saved to /var/cache/conftool/dbconfig/20210415-090520-root.json [09:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:30] 10SRE, 10Traffic, 10Patch-For-Review: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809 (10ema) I've added a dashboard called [[https://grafana.wikimedia.org/d/JTAWecXGk/varnish-anomalies?orgId=1 | Varnish Anomalies ]], currently plotting when `nuke_limit`... [09:06:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 50%: Repool db1146:3314 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15363 and previous config saved to /var/cache/conftool/dbconfig/20210415-090625-root.json [09:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:07] (03PS1) 10David Caro: wmcs.novafullstack: Point to a dedicated runbook [puppet] - 10https://gerrit.wikimedia.org/r/679721 [09:07:26] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: introduce eqiad1 service implementation [puppet] - 10https://gerrit.wikimedia.org/r/675556 (https://phabricator.wikimedia.org/T270704) [09:07:42] (03Abandoned) 10Arturo Borrero Gonzalez: Update NAT exceptions for kraz -> irc1001/irc2001 [puppet] - 10https://gerrit.wikimedia.org/r/679278 (https://phabricator.wikimedia.org/T280225) (owner: 10Muehlenhoff) [09:08:11] (03CR) 10Alexandros Kosiaris: [C: 03+2] linkrecommendation: Use the main_app resources for loaddatasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/679347 (owner: 10Alexandros Kosiaris) [09:08:45] (03PS1) 10Jbond: Revert "P:debmonitor::server: switch to mod_proxy_uwsgi" [puppet] - 10https://gerrit.wikimedia.org/r/679457 [09:10:00] (03Merged) 10jenkins-bot: linkrecommendation: Use the main_app resources for loaddatasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/679347 (owner: 10Alexandros Kosiaris) [09:10:05] 10SRE, 10Wikimedia-Incident: Uncached wiki requests partially unavailable due to excesive request rates from a bot - https://phabricator.wikimedia.org/T280232 (10jcrespo) [09:10:12] (03CR) 10Jbond: [C: 03+2] Revert "P:debmonitor::server: switch to mod_proxy_uwsgi" [puppet] - 10https://gerrit.wikimedia.org/r/679457 (owner: 10Jbond) [09:12:32] 10SRE, 10Wikimedia-Incident: Uncached wiki requests partially unavailable due to excesive request rates from a bot - https://phabricator.wikimedia.org/T280232 (10jcrespo) [09:13:32] PROBLEM - Check systemd state on es1025 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:36] PROBLEM - Check systemd state on mw1345 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:40] PROBLEM - Check systemd state on ms-fe1008 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:44] PROBLEM - Check systemd state on mc2021 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:44] PROBLEM - Check systemd state on elastic2025 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:52] PROBLEM - Check systemd state on ores2008 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:55] 10SRE, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee Manuel - https://phabricator.wikimedia.org/T280162 (10Manuel) Hi everyone and thank you for the welcome! @KFrancis My e-mail address is: manuel.merz@wikimedia.de [09:14:06] PROBLEM - Check systemd state on an-worker1105 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:16] PROBLEM - Check systemd state on ganeti1014 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:24] PROBLEM - Check systemd state on db2078 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:34] PROBLEM - Check systemd state on es1020 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:36] PROBLEM - Check systemd state on mw1379 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:39] * jbond42 looking [09:14:48] PROBLEM - Check systemd state on mw2322 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:48] PROBLEM - Check systemd state on ores1004 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:56] PROBLEM - Check systemd state on parse2020 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:15:04] PROBLEM - Check systemd state on search-loader1001 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:15:14] PROBLEM - Check systemd state on mw2298 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:15:48] RECOVERY - Check systemd state on elastic2025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:15:54] RECOVERY - Check systemd state on ores2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:05] !log cp-upload: varnishadm -n frontend param.set nuke_limit 1000 T275809 [09:16:08] RECOVERY - Check systemd state on an-worker1105 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:13] T275809: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809 [09:16:18] RECOVERY - Check systemd state on ganeti1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:24] RECOVERY - Check systemd state on db2078 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:36] RECOVERY - Check systemd state on es1020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:38] RECOVERY - Check systemd state on mw1379 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:48] RECOVERY - Check systemd state on mw2322 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:50] RECOVERY - Check systemd state on ores1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:58] RECOVERY - Check systemd state on parse2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:04] 10SRE, 10Wikimedia-Incident: Uncached wiki requests partially unavailable due to excesive request rates from a bot - https://phabricator.wikimedia.org/T280232 (10jcrespo) [09:17:04] RECOVERY - Check systemd state on search-loader1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:16] RECOVERY - Check systemd state on mw2298 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:34] RECOVERY - Check systemd state on es1025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:42] RECOVERY - Check systemd state on mw1345 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:46] RECOVERY - Check systemd state on ms-fe1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:52] RECOVERY - Check systemd state on mc2021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:20:12] (03PS1) 10Jbond: hiera - sso-debmon: drop tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/679724 [09:20:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 50%: Repool db1146:3312', diff saved to https://phabricator.wikimedia.org/P15364 and previous config saved to /var/cache/conftool/dbconfig/20210415-092024-root.json [09:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:39] (03CR) 10Jbond: [V: 03+2 C: 03+2] hiera - sso-debmon: drop tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/679724 (owner: 10Jbond) [09:21:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 75%: Repool db1146:3314 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15365 and previous config saved to /var/cache/conftool/dbconfig/20210415-092129-root.json [09:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:56] (03PS1) 10Jbond: hiera: sso-debmon add debmon-client [puppet] - 10https://gerrit.wikimedia.org/r/679726 [09:27:12] (03CR) 10Jbond: [V: 03+2 C: 03+2] hiera: sso-debmon add debmon-client [puppet] - 10https://gerrit.wikimedia.org/r/679726 (owner: 10Jbond) [09:28:40] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.496e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:30:11] 10SRE, 10Release-Engineering-Team, 10SRE-Access-Requests: Requesting deployment access for HMonroy - https://phabricator.wikimedia.org/T280177 (10fgiunchedi) [09:31:25] (03PS1) 10Jbond: P:debmonitor::server: switch to mod_proxy_uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/679458 [09:32:25] 10SRE, 10Release-Engineering-Team, 10SRE-Access-Requests: Requesting deployment access for HMonroy - https://phabricator.wikimedia.org/T280177 (10fgiunchedi) Thank you @Dzahn ! We're indeed seeking approval from #release-engineering-team (cc @thcipriani perhaps?) [09:35:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 75%: Repool db1146:3312', diff saved to https://phabricator.wikimedia.org/P15366 and previous config saved to /var/cache/conftool/dbconfig/20210415-093527-root.json [09:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:10] (03PS1) 10JMeybohm: Setup new etcd3 cluster on conf200[456] in codfw [puppet] - 10https://gerrit.wikimedia.org/r/679727 (https://phabricator.wikimedia.org/T271573) [09:36:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 100%: Repool db1146:3314 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15367 and previous config saved to /var/cache/conftool/dbconfig/20210415-093633-root.json [09:36:38] (03PS1) 10Muehlenhoff: Failover IDP to idp2001 [dns] - 10https://gerrit.wikimedia.org/r/679728 [09:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:36] (03CR) 10Filippo Giunchedi: "LGTM, though note that we'll also need a review similar to https://gerrit.wikimedia.org/r/c/operations/homer/public/+/679296 for this and " [puppet] - 10https://gerrit.wikimedia.org/r/679411 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [09:40:50] (03CR) 10Muehlenhoff: [C: 03+2] Failover IDP to idp2001 [dns] - 10https://gerrit.wikimedia.org/r/679728 (owner: 10Muehlenhoff) [09:43:37] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29054/console" [puppet] - 10https://gerrit.wikimedia.org/r/679399 (owner: 10Jbond) [09:44:44] (03PS1) 10JMeybohm: Add SRV records for new etcd3 cluster in codfw [dns] - 10https://gerrit.wikimedia.org/r/679731 (https://phabricator.wikimedia.org/T271573) [09:46:01] (03PS1) 10Jbond: P:debmonitor::server: switch to mod_proxy_uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/679732 [09:46:27] (03PS2) 10JMeybohm: Setup new etcd3 cluster on conf200[456] in codfw [puppet] - 10https://gerrit.wikimedia.org/r/679727 (https://phabricator.wikimedia.org/T271573) [09:46:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29055/console" [puppet] - 10https://gerrit.wikimedia.org/r/679732 (owner: 10Jbond) [09:47:01] (03PS2) 10JMeybohm: Add SRV records for new etcd3 cluster in codfw [dns] - 10https://gerrit.wikimedia.org/r/679731 (https://phabricator.wikimedia.org/T271573) [09:48:17] (03CR) 10Majavah: [C: 04-1] Setup new etcd3 cluster on conf200[456] in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679727 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [09:50:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 100%: Repool db1146:3312', diff saved to https://phabricator.wikimedia.org/P15368 and previous config saved to /var/cache/conftool/dbconfig/20210415-095031-root.json [09:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:35] PROBLEM - Thanos compact is halted on alert1001 is CRITICAL: cluster=thanos instance=thanos-fe2001 job=thanos-compact prometheus=ops site=codfw https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:53:32] (03PS1) 10JMeybohm: Add key for _etcd-server-ssl._tcp.v3.codfw.wmnet.key [labs/private] - 10https://gerrit.wikimedia.org/r/679734 (https://phabricator.wikimedia.org/T271573) [09:56:05] 10SRE, 10Wikimedia-Incident: Uncached wiki requests partially unavailable due to excessive request rates from a bot - https://phabricator.wikimedia.org/T280232 (10Aklapper) [09:57:07] 10SRE, 10Wikimedia-Incident: Uncached wiki requests partially unavailable due to excessive request rates from a bot - https://phabricator.wikimedia.org/T280232 (10Aklapper) Heads-up to @Schlurcher [09:59:42] (03PS1) 10Elukey: Add kafka-logging100{2,3} IPs to the kafka term of analytics filters [homer/public] - 10https://gerrit.wikimedia.org/r/679740 (https://phabricator.wikimedia.org/T279342) [10:00:04] mvolz: Your horoscope predicts another unfortunate [[mw:Services|Services]] – [[mw:Citoid|Citoid]] / Zotero deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210415T1000). [10:05:37] (03PS2) 10Jbond: P:debmonitor::server: switch to mod_proxy_uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/679732 [10:06:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29056/console" [puppet] - 10https://gerrit.wikimedia.org/r/679732 (owner: 10Jbond) [10:08:15] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [10:08:15] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [10:08:15] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' . [10:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:43] 10SRE, 10Wikimedia-Incident: Uncached wiki requests partially unavailable due to excessive request rates from a bot - https://phabricator.wikimedia.org/T280232 (10Ladsgroup) Requests the bot made are not that bad but probably had a terrible regression in wmf.1 (action=purge format=json forcelinkupdate= pageids... [10:12:13] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:debmonitor::server: switch to mod_proxy_uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/679732 (owner: 10Jbond) [10:13:17] (03PS1) 10Jbond: Revert "P:debmonitor::server: switch to mod_proxy_uwsgi" [puppet] - 10https://gerrit.wikimedia.org/r/679460 [10:16:34] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/679740 (https://phabricator.wikimedia.org/T279342) (owner: 10Elukey) [10:17:15] (03CR) 10Jbond: [C: 03+2] Revert "P:debmonitor::server: switch to mod_proxy_uwsgi" [puppet] - 10https://gerrit.wikimedia.org/r/679460 (owner: 10Jbond) [10:18:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={netbox_device_statistics,swagger_check_mathoid_cluster_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:18:18] (03CR) 10Elukey: [C: 03+2] Add kafka-logging100{2,3} IPs to the kafka term of analytics filters [homer/public] - 10https://gerrit.wikimedia.org/r/679740 (https://phabricator.wikimedia.org/T279342) (owner: 10Elukey) [10:19:03] (03PS1) 10Jbond: P:debmonitor::server: switch to mod_proxy_uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/679461 [10:19:24] (03CR) 10Filippo Giunchedi: "LGTM overall, comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679388 (https://phabricator.wikimedia.org/T238414) (owner: 10Jbond) [10:20:19] PROBLEM - debmonitor.discovery.wmnet:443 internal on debmonitor2002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 502 Proxy Error https://wikitech.wikimedia.org/wiki/Debmonitor [10:20:25] (03PS3) 10JMeybohm: Setup new etcd3 cluster on conf200[456] in codfw [puppet] - 10https://gerrit.wikimedia.org/r/679727 (https://phabricator.wikimedia.org/T271573) [10:20:27] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:21:08] !Log Add kafka-logging100{2,3} to the kafka term in the analytics filters on cr1/cr2 eqiad - ref: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/679740 [10:21:14] !log Add kafka-logging100{2,3} to the kafka term in the analytics filters on cr1/cr2 eqiad - ref: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/679740 [10:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:05] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' . [10:24:05] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [10:24:05] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [10:24:11] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, likely ok since with AFAIK with systemd::timer::job there are no emails by default (?)" [puppet] - 10https://gerrit.wikimedia.org/r/679553 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [10:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:31] (03CR) 10JMeybohm: Setup new etcd3 cluster on conf200[456] in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679727 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [10:26:38] (03PS2) 10Jbond: P:debmonitor::server: switch to mod_proxy_uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/679461 [10:27:17] jayme: nice! (new etcd cluster) [10:27:27] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add key for _etcd-server-ssl._tcp.v3.codfw.wmnet.key [labs/private] - 10https://gerrit.wikimedia.org/r/679734 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [10:27:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29058/console" [puppet] - 10https://gerrit.wikimedia.org/r/679461 (owner: 10Jbond) [10:28:27] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:debmonitor::server: switch to mod_proxy_uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/679461 (owner: 10Jbond) [10:31:49] (03PS1) 10Awight: admin: add awight to graphite-admins [puppet] - 10https://gerrit.wikimedia.org/r/679747 (https://phabricator.wikimedia.org/T280242) [10:34:30] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "The certificate is ok too" [puppet] - 10https://gerrit.wikimedia.org/r/679727 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [10:34:51] (03PS1) 10JMeybohm: htpasswd(): salt must be 8 characters [labs/private] - 10https://gerrit.wikimedia.org/r/679748 (https://phabricator.wikimedia.org/T271573) [10:35:19] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] htpasswd(): salt must be 8 characters [labs/private] - 10https://gerrit.wikimedia.org/r/679748 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [10:38:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/679716 (https://phabricator.wikimedia.org/T280225) (owner: 10Arturo Borrero Gonzalez) [10:39:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add SRV records for new etcd3 cluster in codfw [dns] - 10https://gerrit.wikimedia.org/r/679731 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [10:43:17] (03PS6) 10Jbond: base::firewall: add switch to use separate log file [puppet] - 10https://gerrit.wikimedia.org/r/679388 (https://phabricator.wikimedia.org/T238414) [10:43:40] (03PS4) 10Jbond: hiera - sretest: test sending ulog to separate file [puppet] - 10https://gerrit.wikimedia.org/r/679392 (https://phabricator.wikimedia.org/T238414) [10:44:19] (03PS5) 10Jbond: hiera - sretest: test sending ulog to separate file [puppet] - 10https://gerrit.wikimedia.org/r/679392 (https://phabricator.wikimedia.org/T238414) [10:45:07] RECOVERY - debmonitor.discovery.wmnet:443 internal on debmonitor2002 is OK: HTTP OK: Status line output matched HTTP/1.1 200 - 663 bytes in 0.166 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [10:48:49] (03CR) 10Jbond: [C: 03+2] "thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679388 (https://phabricator.wikimedia.org/T238414) (owner: 10Jbond) [10:48:57] (03CR) 10Jbond: [C: 03+2] hiera - sretest: test sending ulog to separate file [puppet] - 10https://gerrit.wikimedia.org/r/679392 (https://phabricator.wikimedia.org/T238414) (owner: 10Jbond) [10:53:41] (03PS4) 10JMeybohm: Setup new etcd3 cluster on conf200[456] in codfw [puppet] - 10https://gerrit.wikimedia.org/r/679727 (https://phabricator.wikimedia.org/T271573) [10:53:46] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [10:53:46] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [10:53:46] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' . [10:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:16] (03CR) 10jerkins-bot: [V: 04-1] Setup new etcd3 cluster on conf200[456] in codfw [puppet] - 10https://gerrit.wikimedia.org/r/679727 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [10:54:25] (03PS1) 10Jbond: P:base::firewall::log: set separate_file: true in production by default [puppet] - 10https://gerrit.wikimedia.org/r/679756 (https://phabricator.wikimedia.org/T238414) [10:54:43] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29061/console" [puppet] - 10https://gerrit.wikimedia.org/r/679727 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [10:55:06] (03CR) 10Ayounsi: [C: 03+1] cr/firewall: dro IRCD exception (kraz) [homer/public] - 10https://gerrit.wikimedia.org/r/679716 (https://phabricator.wikimedia.org/T280225) (owner: 10Arturo Borrero Gonzalez) [10:55:19] (03CR) 10Jbond: [C: 03+2] P:base::firewall::log: set separate_file: true in production by default [puppet] - 10https://gerrit.wikimedia.org/r/679756 (https://phabricator.wikimedia.org/T238414) (owner: 10Jbond) [10:56:10] (03PS5) 10JMeybohm: Setup new etcd3 cluster on conf200[456] in codfw [puppet] - 10https://gerrit.wikimedia.org/r/679727 (https://phabricator.wikimedia.org/T271573) [10:57:13] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29062/console" [puppet] - 10https://gerrit.wikimedia.org/r/679727 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [10:57:48] 10SRE, 10observability, 10Patch-For-Review, 10User-jbond: Write ulogd logs to a dedicated logfile - https://phabricator.wikimedia.org/T238414 (10jbond) 05Open→03Resolved a:03jbond I have updated puppet on production so that ulogd log entries are redirected to `/var/log/ulogd/syslog.log` please re-ope... [10:58:19] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [10:59:32] !log jiji@cumin1001 conftool action : set/pooled=yes; selector: name=wtp103[7-9].eqiad.wmnet [10:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] Amir1, Lucas_WMDE, apergos, and duesen: May I have your attention please! [[Backport windows|EU Backport and Config training]]
''''''. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210415T1100) [11:00:26] hm those tags are a bit odd [11:00:31] o/ [11:01:03] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [11:01:13] there are no patches in the window and no one in the google meet for the training, I'll give the google meet another 10 minutes and then close it up if no one shows [11:01:15] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['wtp1040.eqiad.wmnet', 'wtp1041.eqiad.wmnet', 'wtp1042.eqia... [11:01:41] apergos: https://phabricator.wikimedia.org/T279391 [11:02:09] ah the bot change form the deployment calendar change, gtk [11:02:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/679432 (https://phabricator.wikimedia.org/T279100) (owner: 10Dzahn) [11:03:57] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Setup new etcd3 cluster on conf200[456] in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679727 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [11:03:59] (03CR) 10JMeybohm: [C: 03+2] Add SRV records for new etcd3 cluster in codfw [dns] - 10https://gerrit.wikimedia.org/r/679731 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [11:07:39] (03PS6) 10JMeybohm: Setup new etcd3 cluster on conf200[456] in codfw [puppet] - 10https://gerrit.wikimedia.org/r/679727 (https://phabricator.wikimedia.org/T271573) [11:08:52] (03CR) 10JMeybohm: Setup new etcd3 cluster on conf200[456] in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679727 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [11:11:58] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cr/firewall: dro IRCD exception (kraz) [homer/public] - 10https://gerrit.wikimedia.org/r/679716 (https://phabricator.wikimedia.org/T280225) (owner: 10Arturo Borrero Gonzalez) [11:12:02] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Setup new etcd3 cluster on conf200[456] in codfw [puppet] - 10https://gerrit.wikimedia.org/r/679727 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [11:12:31] RECOVERY - mediawiki-installation DSH group on wtp1039 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [11:13:42] welp. looks like no changes and no one showing up, so I'm leaving the google meet and wandering off [11:14:25] !log merging homer changes for cr-eqiad (T280225) [11:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:35] T280225: Cloud: drop NAT exception for IRCD - https://phabricator.wikimedia.org/T280225 [11:14:43] apergos: ok to me to do some deploys? [11:14:57] !log merging homer changes for cr-codgw (T280225) [11:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:10] self-serve? be my guest, don't forget to add yourself to the window for the record [11:15:23] ok, thanks :) [11:16:23] (03PS1) 10Urbanecm: Add *.jfklibrary.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679761 (https://phabricator.wikimedia.org/T279506) [11:16:33] (03CR) 10Urbanecm: [C: 03+2] Add *.jfklibrary.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679761 (https://phabricator.wikimedia.org/T279506) (owner: 10Urbanecm) [11:17:20] (03Merged) 10jenkins-bot: Add *.jfklibrary.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679761 (https://phabricator.wikimedia.org/T279506) (owner: 10Urbanecm) [11:17:39] (03PS1) 10Jbond: P:debmonitor::server: Add ability to support multiple CA's [puppet] - 10https://gerrit.wikimedia.org/r/679763 [11:18:17] (03CR) 10JMeybohm: [C: 03+2] Setup new etcd3 cluster on conf200[456] in codfw [puppet] - 10https://gerrit.wikimedia.org/r/679727 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [11:18:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29063/console" [puppet] - 10https://gerrit.wikimedia.org/r/679763 (owner: 10Jbond) [11:19:39] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:debmonitor::server: Add ability to support multiple CA's [puppet] - 10https://gerrit.wikimedia.org/r/679763 (owner: 10Jbond) [11:19:57] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 6748a7f28ae9c71b3461b830f8323ded0a024a8e: Add *.jfklibrary.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T279506) (duration: 01m 51s) [11:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:07] T279506: Add https://www.jfklibrary.org/ to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T279506 [11:23:22] RECOVERY - mediawiki-installation DSH group on wtp1037 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [11:25:19] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on restbase-dev1004.eqiad.wmnet with reason: restarting for kernel update [11:25:20] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on restbase-dev1004.eqiad.wmnet with reason: restarting for kernel update [11:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:08] (03CR) 10Ema: trafficserver: comment about a server that won't exist anymore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679530 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [11:29:16] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1040.eqiad.wmnet with reason: REIMAGE [11:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:16] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1041.eqiad.wmnet with reason: REIMAGE [11:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:18] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1040.eqiad.wmnet with reason: REIMAGE [11:31:23] (03PS1) 10Muehlenhoff: Remove Python 2 packages on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/679768 (https://phabricator.wikimedia.org/T275873) [11:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:09] RECOVERY - mediawiki-installation DSH group on wtp1038 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [11:33:18] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1042.eqiad.wmnet with reason: REIMAGE [11:33:19] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1041.eqiad.wmnet with reason: REIMAGE [11:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:20] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1042.eqiad.wmnet with reason: REIMAGE [11:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:10] 10SRE, 10netops: BGP: prioritize directly connected peers - https://phabricator.wikimedia.org/T280054 (10ayounsi) See all our local-pref in https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations#BGP_communities I took 300 as an example, even though we don't use PEER_INTERNAL. Instead I can use 280 for dir... [11:44:42] (03PS1) 10JMeybohm: Repackaging for buster [software/etcd-mirror] (debian) - 10https://gerrit.wikimedia.org/r/679770 (https://phabricator.wikimedia.org/T271573) [11:45:15] !log restarting restbase1016 for kernel update [11:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:26] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1016.eqiad.wmnet [11:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:02] 10SRE: Integrate Buster 10.9 point update - https://phabricator.wikimedia.org/T279054 (10MoritzMuehlenhoff) [11:46:18] (03CR) 10jerkins-bot: [V: 04-1] Repackaging for buster [software/etcd-mirror] (debian) - 10https://gerrit.wikimedia.org/r/679770 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [11:48:29] (03PS1) 10Jbond: P:debmonitor::client: add scafolding to be able to use cfssl pki [puppet] - 10https://gerrit.wikimedia.org/r/679772 [11:48:31] (03PS1) 10Jbond: P:debmonitor::client: enable cfssl certs on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/679773 [11:49:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:51:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:51:25] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1016.eqiad.wmnet [11:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:53] (03PS2) 10Jbond: P:debmonitor::client: add scafolding to be able to use cfssl pki [puppet] - 10https://gerrit.wikimedia.org/r/679772 [11:52:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29065/console" [puppet] - 10https://gerrit.wikimedia.org/r/679772 (owner: 10Jbond) [11:55:38] (03PS3) 10Jbond: P:debmonitor::client: add scafolding to be able to use cfssl pki [puppet] - 10https://gerrit.wikimedia.org/r/679772 [11:56:08] (03PS2) 10Jbond: P:debmonitor::client: enable cfssl certs on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/679773 [11:56:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29066/console" [puppet] - 10https://gerrit.wikimedia.org/r/679772 (owner: 10Jbond) [11:57:25] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29067/console" [puppet] - 10https://gerrit.wikimedia.org/r/679773 (owner: 10Jbond) [12:00:11] (03PS3) 10Jbond: P:debmonitor::client: enable cfssl certs on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/679773 [12:00:15] (03CR) 10Jbond: [C: 03+1] "lgtm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679768 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [12:01:47] 10SRE, 10SRE-Access-Requests, 10observability, 10Graphite, 10Patch-For-Review: Requesting access to graphite hosts for awight - https://phabricator.wikimedia.org/T280242 (10Tobi_WMDE_SW) Endorsing the request. [12:02:05] (03PS1) 10WMDE-Fisch: Add filtering for the suggested values combo box [extensions/VisualEditor] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/679462 (https://phabricator.wikimedia.org/T271898) [12:07:32] (03CR) 10Elukey: [V: 03+1 C: 03+2] bigtop::mysql_jdbc: use component/libmysql-java for buster [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey) [12:09:11] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1017.eqiad.wmnet [12:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:53] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:12:04] !log redirect ns1 to authdns1001 [12:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:44] (03PS1) 10Filippo Giunchedi: thanos: strip internal prefix for /bucket/ explorer [puppet] - 10https://gerrit.wikimedia.org/r/679779 [12:15:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:17:10] (03PS4) 10Jbond: P:debmonitor::client: add scafolding to be able to use cfssl pki [puppet] - 10https://gerrit.wikimedia.org/r/679772 [12:17:16] (03PS1) 10Elukey: bigtop::mysql_jdbc: add general use case for jar_path [puppet] - 10https://gerrit.wikimedia.org/r/679780 [12:17:26] (03PS4) 10Jbond: P:debmonitor::client: enable cfssl certs on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/679773 [12:17:41] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1017.eqiad.wmnet [12:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:10] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1040.eqiad.wmnet', 'wtp1041.eqiad.wmnet', 'wtp1042.eqiad.wmnet'] ` and were **ALL** successful. [12:18:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29070/console" [puppet] - 10https://gerrit.wikimedia.org/r/679773 (owner: 10Jbond) [12:18:32] 10SRE, 10netops: BGP: prioritize directly connected peers - https://phabricator.wikimedia.org/T280054 (10jbond) > Instead I can use 280 for directly connected, and 290 for PEER_INTERNAL. sounds good to me [12:18:57] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1075 is OK: HTTP OK: HTTP/1.0 200 OK - 23622 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:19:11] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29069/console" [puppet] - 10https://gerrit.wikimedia.org/r/679780 (owner: 10Elukey) [12:19:35] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: strip internal prefix for /bucket/ explorer [puppet] - 10https://gerrit.wikimedia.org/r/679779 (owner: 10Filippo Giunchedi) [12:19:51] (03CR) 10Elukey: [V: 03+1 C: 03+2] bigtop::mysql_jdbc: add general use case for jar_path [puppet] - 10https://gerrit.wikimedia.org/r/679780 (owner: 10Elukey) [12:20:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:22:07] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1018.eqiad.wmnet [12:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:39] !log jiji@cumin1001 conftool action : set/pooled=yes; selector: name=wtp104[0-2].eqiad.wmnet [12:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:25:00] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['wtp1043.eqiad.wmnet', 'wtp1044.eqiad.wmnet', 'wtp1045.eqia... [12:25:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:27:34] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host authdns2001.wikimedia.org [12:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:14] (03CR) 10Reedy: [C: 04-1] "Needs adding to wmf-config/extension-list too" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO) [12:28:41] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1018.eqiad.wmnet [12:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:46] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:30:50] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:31:18] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:32:08] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 93, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:32:12] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:32:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host authdns2001.wikimedia.org [12:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:48] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:36:12] RECOVERY - Thanos compact is halted on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [12:37:42] !log redirect ns0 to authdns2001 [12:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:38] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.02584 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [12:39:40] PROBLEM - AuthDNS-over-TLS Works on authdns1001 is CRITICAL: CRITICAL: ns[012] kdig DoTLS check failure https://wikitech.wikimedia.org/wiki/DNS [12:41:18] (03PS1) 10JMeybohm: Fix flake8 errors [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/679784 [12:43:10] RECOVERY - AuthDNS-over-TLS Works on authdns1001 is OK: OK: ns[012] kdig DoTLS check success https://wikitech.wikimedia.org/wiki/DNS [12:44:23] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host authdns1001.wikimedia.org [12:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:42] 10SRE, 10SRE-Access-Requests, 10observability, 10Graphite, 10Patch-For-Review: Requesting access to graphite hosts for awight - https://phabricator.wikimedia.org/T280242 (10lmata) Looks good to me, approved, thanks @MoritzMuehlenhoff ! [12:49:48] (03CR) 10Mforns: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/679390 (https://phabricator.wikimedia.org/T279046) (owner: 10Awight) [12:50:11] (03PS2) 10JMeybohm: Repackaging for buster [software/etcd-mirror] (debian) - 10https://gerrit.wikimedia.org/r/679770 (https://phabricator.wikimedia.org/T271573) [12:50:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host authdns1001.wikimedia.org [12:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:13] (03CR) 10jerkins-bot: [V: 04-1] Repackaging for buster [software/etcd-mirror] (debian) - 10https://gerrit.wikimedia.org/r/679770 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [12:52:32] (03PS1) 10Arturo Borrero Gonzalez: labstore: allow dumps NFS share in the wikicommunityhealth project [puppet] - 10https://gerrit.wikimedia.org/r/679790 (https://phabricator.wikimedia.org/T279558) [12:53:01] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1043.eqiad.wmnet with reason: REIMAGE [12:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labstore: allow dumps NFS share in the wikicommunityhealth project [puppet] - 10https://gerrit.wikimedia.org/r/679790 (https://phabricator.wikimedia.org/T279558) (owner: 10Arturo Borrero Gonzalez) [12:54:52] !log redirect ns2 to dns3001 [12:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:02] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1044.eqiad.wmnet with reason: REIMAGE [12:55:08] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1043.eqiad.wmnet with reason: REIMAGE [12:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:17] PROBLEM - cassandra-a SSL 10.64.0.101:7001 on restbase1019 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [12:56:17] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1019.eqiad.wmnet [12:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:02] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1045.eqiad.wmnet with reason: REIMAGE [12:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:11] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1044.eqiad.wmnet with reason: REIMAGE [12:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:09] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1045.eqiad.wmnet with reason: REIMAGE [12:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Cmjohnson) Dell has us on a wild goose hunt. Responded to their questions with the following: 1. Has this system ever had a PCI card in slot 1? If so, what card? No, the... [13:01:48] RECOVERY - cassandra-a SSL 10.64.0.101:7001 on restbase1019 is OK: SSL OK - Certificate restbase1019-a valid until 2023-04-14 11:20:29 +0000 (expires in 728 days) https://phabricator.wikimedia.org/T120662 [13:02:05] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1019.eqiad.wmnet [13:02:10] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host dns3002.wikimedia.org [13:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:30] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:02:55] (03PS1) 10David Caro: wmcs: Run black and isort [cookbooks] - 10https://gerrit.wikimedia.org/r/679791 [13:02:57] (03PS1) 10David Caro: wmcs: Refactor a bit the openstack commands [cookbooks] - 10https://gerrit.wikimedia.org/r/679792 [13:04:52] PROBLEM - BFD status on cr3-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:05:56] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:07:12] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1020.eqiad.wmnet [13:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:20] RECOVERY - BFD status on cr3-esams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:08:20] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:11:05] (03CR) 10David Caro: "Tested locally by creating and deleting an etcd node from toolsbeta (created etcd-11, removed etcd-8)." [cookbooks] - 10https://gerrit.wikimedia.org/r/679792 (owner: 10David Caro) [13:11:25] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns3002.wikimedia.org [13:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:45] (03PS4) 10David Caro: toolforge.etcdctl: Allow getting the cluster health [software/spicerack] - 10https://gerrit.wikimedia.org/r/676367 (https://phabricator.wikimedia.org/T276338) [13:12:33] (03CR) 10Volans: "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679768 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [13:12:52] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1100 - https://phabricator.wikimedia.org/T280132 (10Cmjohnson) ticket opened with Dell! You have successfully submitted request SR1057103007. [13:13:19] !log redirect ns2 to dns3002 [13:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:52] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1020.eqiad.wmnet [13:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:14] PROBLEM - AuthDNS-over-TLS Works on authdns1001 is CRITICAL: CRITICAL: ns[012] kdig DoTLS check failure https://wikitech.wikimedia.org/wiki/DNS [13:14:34] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10procurement: Dc-Ops Commands for Cumin - https://phabricator.wikimedia.org/T279721 (10Cmjohnson) I need to be able to login to servers and run megacli commands as well as cat /proc/ [13:16:22] (03Abandoned) 10David Caro: ceph.common: pin any package from ceph repo to prio 1003 [puppet] - 10https://gerrit.wikimedia.org/r/677938 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [13:18:10] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1021.eqiad.wmnet [13:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:55] (03CR) 10Muehlenhoff: Remove Python 2 packages on Bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679768 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [13:19:27] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host dns3001.wikimedia.org [13:19:28] (03CR) 10David Caro: toolforge.etcdctl: Allow getting the cluster health (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/676367 (https://phabricator.wikimedia.org/T276338) (owner: 10David Caro) [13:19:32] RECOVERY - AuthDNS-over-TLS Works on authdns1001 is OK: OK: ns[012] kdig DoTLS check success https://wikitech.wikimedia.org/wiki/DNS [13:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:56] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:23:47] (03CR) 10Volans: [C: 04-1] "LGTM, apart the spurious diff consider it a +1. Couple of optional things inline." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/676367 (https://phabricator.wikimedia.org/T276338) (owner: 10David Caro) [13:25:32] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 118 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:25:32] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Fix flake8 errors [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/679784 (owner: 10JMeybohm) [13:25:49] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1021.eqiad.wmnet [13:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:14] (03CR) 10JMeybohm: [C: 03+2] Fix flake8 errors [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/679784 (owner: 10JMeybohm) [13:27:42] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 12 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:32:36] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1022.eqiad.wmnet [13:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:02] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns3001.wikimedia.org [13:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:03] 10SRE, 10Release-Engineering-Team, 10SRE-Access-Requests: Requesting deployment access for HMonroy - https://phabricator.wikimedia.org/T280177 (10thcipriani) >>! In T280177#7002028, @fgiunchedi wrote: > Thank you @Dzahn ! We're indeed seeking approval from #release-engineering-team (cc @thcipriani perhaps?)... [13:35:18] (03CR) 10Andrew Bogott: [C: 03+1] wmcs.novafullstack: Point to a dedicated runbook [puppet] - 10https://gerrit.wikimedia.org/r/679721 (owner: 10David Caro) [13:36:13] (03PS5) 10Jbond: P:debmonitor::client: add scafolding to be able to use cfssl pki [puppet] - 10https://gerrit.wikimedia.org/r/679772 [13:36:15] (03PS5) 10Jbond: P:debmonitor::client: enable cfssl certs on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/679773 [13:36:17] (03PS1) 10Jbond: P:pki::get_cert: Add new function to create certs and return there paths [puppet] - 10https://gerrit.wikimedia.org/r/679800 [13:37:55] (03CR) 10jerkins-bot: [V: 04-1] P:pki::get_cert: Add new function to create certs and return there paths [puppet] - 10https://gerrit.wikimedia.org/r/679800 (owner: 10Jbond) [13:38:55] (03CR) 10Filippo Giunchedi: "AFAICT the problem is having a list of hosts from two different ES clusters, not multiple hosts per-se. It looks like we should be able to" [puppet] - 10https://gerrit.wikimedia.org/r/679524 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [13:40:56] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1022.eqiad.wmnet [13:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:34] (03PS6) 10Jbond: P:debmonitor::client: add scafolding to be able to use cfssl pki [puppet] - 10https://gerrit.wikimedia.org/r/679772 [13:41:59] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1043.eqiad.wmnet', 'wtp1044.eqiad.wmnet', 'wtp1045.eqiad.wmnet'] ` and were **ALL** successful. [13:42:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29071/console" [puppet] - 10https://gerrit.wikimedia.org/r/679772 (owner: 10Jbond) [13:42:43] (03PS2) 10Jbond: P:pki::get_cert: Add new function to create certs and return there paths [puppet] - 10https://gerrit.wikimedia.org/r/679800 [13:43:20] (03PS7) 10Jbond: P:debmonitor::client: add scafolding to be able to use cfssl pki [puppet] - 10https://gerrit.wikimedia.org/r/679772 [13:43:29] (03PS6) 10Jbond: P:debmonitor::client: enable cfssl certs on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/679773 [13:45:22] (03CR) 10Jbond: [C: 03+1] Remove Python 2 packages on Bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679768 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [13:46:35] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host dns4002.wikimedia.org [13:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:40] (03PS8) 10Jbond: P:debmonitor::client: add scafolding to be able to use cfssl pki [puppet] - 10https://gerrit.wikimedia.org/r/679772 [13:47:56] (03PS7) 10Jbond: P:debmonitor::client: enable cfssl certs on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/679773 [13:48:31] (03CR) 10Volans: "Question inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/679772 (owner: 10Jbond) [13:48:39] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1023.eqiad.wmnet [13:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:10] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:51:16] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:51:20] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:52:10] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:52:16] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:52:20] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 98, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:52:35] (03PS8) 10Jbond: P:debmonitor::client: enable cfssl certs on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/679773 [13:52:38] XioNoX: expected? ^^^ [13:52:50] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1236.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:53:51] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns4002.wikimedia.org [13:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:05] volans: yes Moritz is rebooting dns nodes [13:54:16] !log upgrading packages and mediawiki on wikitech-static [13:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29074/console" [puppet] - 10https://gerrit.wikimedia.org/r/679773 (owner: 10Jbond) [13:55:42] elukey: ack, thx [13:55:47] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1023.eqiad.wmnet [13:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:55] volans: yes [13:56:01] (03CR) 10Muehlenhoff: P:debmonitor::client: add scafolding to be able to use cfssl pki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679772 (owner: 10Jbond) [13:56:12] PROBLEM - Wikitech and wt-static content in sync on labweb1001 is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech-static https://wikitech.wikimedia.org/wiki/Wikitech-static [13:56:20] PROBLEM - Wikitech-static main page has content on cloudweb2001-dev is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2050 bytes in 0.128 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [13:56:27] !log jiji@cumin1001 conftool action : set/pooled=yes; selector: name=wtp104[5-7].eqiad.wmnet [13:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:52] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2050 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [13:58:19] ACKNOWLEDGEMENT - Wikitech-static main page has content on cloudweb2001-dev is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2050 bytes in 0.139 second response time andrew bogott Im upgrading things https://wikitech.wikimedia.org/wiki/Wikitech-static [13:58:19] ACKNOWLEDGEMENT - Wikitech-static main page has content on labweb1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2050 bytes in 0.120 second response time andrew bogott Im upgrading things https://wikitech.wikimedia.org/wiki/Wikitech-static [13:59:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: Install pandoc [puppet] - 10https://gerrit.wikimedia.org/r/678259 (https://phabricator.wikimedia.org/T279787) (owner: 10Urbanecm) [14:00:40] !log ppchelko@deploy1002 Started deploy [restbase/deploy@4755f50]: T271983 [14:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:48] T271983: Add altwiki to RESTBase - https://phabricator.wikimedia.org/T271983 [14:03:06] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['wtp1046.eqiad.wmnet', 'wtp1047.eqiad.wmnet', 'wtp1048.eqia... [14:03:55] (03CR) 10David Caro: [C: 03+2] wmcs.novafullstack: Point to a dedicated runbook [puppet] - 10https://gerrit.wikimedia.org/r/679721 (owner: 10David Caro) [14:05:42] RECOVERY - Wikitech-static main page has content on cloudweb2001-dev is OK: HTTP OK: HTTP/1.1 200 OK - 25481 bytes in 2.785 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [14:06:26] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 25479 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [14:07:14] 10SRE, 10Release-Engineering-Team, 10SRE-Access-Requests: Requesting deployment access for HMonroy - https://phabricator.wikimedia.org/T280177 (10marcella) Hiring manager approved! [14:08:42] (03PS5) 10David Caro: toolforge.etcdctl: Allow getting the cluster health [software/spicerack] - 10https://gerrit.wikimedia.org/r/676367 (https://phabricator.wikimedia.org/T276338) [14:08:44] (03CR) 10David Caro: toolforge.etcdctl: Allow getting the cluster health (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/676367 (https://phabricator.wikimedia.org/T276338) (owner: 10David Caro) [14:09:51] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1024.eqiad.wmnet [14:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:41] (03PS1) 10Filippo Giunchedi: admin: add hmonroy to deployment [puppet] - 10https://gerrit.wikimedia.org/r/679811 (https://phabricator.wikimedia.org/T280177) [14:11:55] !log ppchelko@deploy1002 Finished deploy [restbase/deploy@4755f50]: T271983 (duration: 11m 15s) [14:12:03] !log ppchelko@deploy1002 Started deploy [restbase/deploy@4755f50]: T271983, try again [14:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:09] T271983: Add altwiki to RESTBase - https://phabricator.wikimedia.org/T271983 [14:12:14] PROBLEM - Thanos compact is halted on alert1001 is CRITICAL: cluster=thanos instance=thanos-fe2001 job=thanos-compact prometheus=ops site=codfw https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [14:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:36] thanos errors are known ^ [14:13:21] (03PS1) 10Giuseppe Lavagetto: Merge branch 'master' into debian [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/679812 [14:13:23] (03PS1) 10Giuseppe Lavagetto: New version for buster [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/679813 [14:13:25] (03CR) 10Volans: "It seems that the "pylint: no-member / Instance of 'list' has no 'difference_update' member (col 12)" is back :/" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/676367 (https://phabricator.wikimedia.org/T276338) (owner: 10David Caro) [14:13:44] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:13:46] (03Abandoned) 10Giuseppe Lavagetto: Merge branch 'master' into debian [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/679812 (owner: 10Giuseppe Lavagetto) [14:14:05] (03Abandoned) 10Giuseppe Lavagetto: New version for buster [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/679813 (owner: 10Giuseppe Lavagetto) [14:14:36] (03CR) 10jerkins-bot: [V: 04-1] toolforge.etcdctl: Allow getting the cluster health [software/spicerack] - 10https://gerrit.wikimedia.org/r/676367 (https://phabricator.wikimedia.org/T276338) (owner: 10David Caro) [14:14:47] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/676367 (https://phabricator.wikimedia.org/T276338) (owner: 10David Caro) [14:14:56] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/676367 (https://phabricator.wikimedia.org/T276338) (owner: 10David Caro) [14:15:38] (03PS9) 10Jbond: P:debmonitor::client: add scafolding to be able to use cfssl pki [puppet] - 10https://gerrit.wikimedia.org/r/679772 [14:16:00] (03CR) 10Jbond: "Thanks see inline or ping me on IRC (may be easier)" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/679772 (owner: 10Jbond) [14:17:32] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1024.eqiad.wmnet [14:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:52] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host dns2001.wikimedia.org [14:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:48] !log ppchelko@deploy1002 Finished deploy [restbase/deploy@4755f50]: T271983, try again (duration: 07m 45s) [14:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:56] T271983: Add altwiki to RESTBase - https://phabricator.wikimedia.org/T271983 [14:22:19] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:22:41] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:23:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdnsrec site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:23:30] Pchelolo: sorry about that, I didn't realize I was stepping on your deploy window [14:23:37] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:23:55] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 93, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:23:56] apergos: I didn't have one, I've negotiated it with Hugh [14:24:08] ah! well still. hope it's sorted [14:24:09] so, no worries, all good. I'm done poking around [14:24:21] 👍 [14:24:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:24:35] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns2001.wikimedia.org [14:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:12] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host dns2002.wikimedia.org [14:27:17] (03CR) 10Volans: "reply inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679772 (owner: 10Jbond) [14:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:37] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:30:57] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:31:01] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:31:07] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1046.eqiad.wmnet with reason: REIMAGE [14:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:13] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:32:18] arturo: thanks for the +2, appreciated :) [14:32:22] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns2002.wikimedia.org [14:32:26] Urbanecm: <3 [14:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:39] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 66, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:32:43] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 93, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:33:09] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1047.eqiad.wmnet with reason: REIMAGE [14:33:11] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1046.eqiad.wmnet with reason: REIMAGE [14:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:50] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host dns5001.wikimedia.org [14:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:08] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1048.eqiad.wmnet with reason: REIMAGE [14:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:16] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1047.eqiad.wmnet with reason: REIMAGE [14:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:19] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1048.eqiad.wmnet with reason: REIMAGE [14:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:19] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:38:27] (03CR) 10Jdlrobson: "is this associated with the right phab ticket?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679613 (https://phabricator.wikimedia.org/T277588) (owner: 10Nray) [14:40:13] (03Abandoned) 10JMeybohm: Repackaging for buster [software/etcd-mirror] (debian) - 10https://gerrit.wikimedia.org/r/679770 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [14:40:29] RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 317, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:41:29] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wdqs-blazegraph.service,wdqs-updater.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:40] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns5001.wikimedia.org [14:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:17] (03CR) 10Muehlenhoff: P:debmonitor::client: add scafolding to be able to use cfssl pki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679772 (owner: 10Jbond) [14:43:16] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host dns5002.wikimedia.org [14:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs: Run black and isort [cookbooks] - 10https://gerrit.wikimedia.org/r/679791 (owner: 10David Caro) [14:46:43] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:46:53] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:47:18] !log imported etcd-mirror_0.0.5-1 to buster-wikimedia [14:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:25] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:47:47] RECOVERY - Thanos compact is halted on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [14:48:19] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 63, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:48:29] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:49:03] RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 317, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:50:54] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns5002.wikimedia.org [14:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:21] (03CR) 10David Caro: [C: 03+2] toolforge.etcdctl: Allow getting the cluster health [software/spicerack] - 10https://gerrit.wikimedia.org/r/676367 (https://phabricator.wikimedia.org/T276338) (owner: 10David Caro) [14:51:31] (03CR) 10David Caro: [C: 03+2] wmcs: Run black and isort [cookbooks] - 10https://gerrit.wikimedia.org/r/679791 (owner: 10David Caro) [14:51:44] (03PS2) 10David Caro: wmcs: Run black and isort [cookbooks] - 10https://gerrit.wikimedia.org/r/679791 [14:53:59] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host dns1001.wikimedia.org [14:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:56] (03PS10) 10Jbond: P:debmonitor::client: add scafolding to be able to use cfssl pki [puppet] - 10https://gerrit.wikimedia.org/r/679772 [14:55:00] (03CR) 10Jbond: "thanks updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679772 (owner: 10Jbond) [14:55:12] (03PS9) 10Jbond: P:debmonitor::client: enable cfssl certs on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/679773 [14:56:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29075/console" [puppet] - 10https://gerrit.wikimedia.org/r/679773 (owner: 10Jbond) [14:56:37] !log elukey@deploy1002 Started deploy [analytics/refinery@497f6a5]: Regular analytics weekly train [14:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge.etcdctl: Allow getting the cluster health [software/spicerack] - 10https://gerrit.wikimedia.org/r/676367 (https://phabricator.wikimedia.org/T276338) (owner: 10David Caro) [14:58:49] (03PS7) 10Cwhite: logstash: provision per-target apifeatureusage jobs [puppet] - 10https://gerrit.wikimedia.org/r/679524 (https://phabricator.wikimedia.org/T274394) [14:59:01] (03Merged) 10jenkins-bot: toolforge.etcdctl: Allow getting the cluster health [software/spicerack] - 10https://gerrit.wikimedia.org/r/676367 (https://phabricator.wikimedia.org/T276338) (owner: 10David Caro) [14:59:51] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns1001.wikimedia.org [14:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:32] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10procurement: Dc-Ops Commands for Cumin - https://phabricator.wikimedia.org/T279721 (10RobH) [15:01:49] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10procurement: Dc-Ops Commands for Cumin - https://phabricator.wikimedia.org/T279721 (10RobH) [15:01:51] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.496e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [15:02:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, two typos inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/679772 (owner: 10Jbond) [15:02:16] (03PS11) 10Jbond: P:debmonitor::client: add scafolding to be able to use cfssl pki [puppet] - 10https://gerrit.wikimedia.org/r/679772 [15:02:22] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10procurement: Dc-Ops Commands for Cumin - https://phabricator.wikimedia.org/T279721 (10RobH) [15:03:08] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host dns1002.wikimedia.org [15:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:11] (03CR) 10Jbond: "updated thanks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/679772 (owner: 10Jbond) [15:04:31] (03PS10) 10Jbond: P:debmonitor::client: enable cfssl certs on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/679773 [15:05:25] (03CR) 10Cwhite: [C: 03+1] Remove Python 2 packages on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/679768 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [15:05:58] 10SRE, 10Wikimedia-Incident: Uncached wiki requests partially unavailable due to excessive request rates from a bot - https://phabricator.wikimedia.org/T280232 (10Jdforrester-WMF) >>! In T280232#7002156, @Ladsgroup wrote: > Requests the bot made are not that bad but probably had a terrible regression in wmf.1... [15:06:20] (03CR) 10CDanis: [C: 03+1] Remove Python 2 packages on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/679768 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [15:07:10] (03CR) 10Muehlenhoff: [C: 03+1] P:debmonitor::client: add scafolding to be able to use cfssl pki [puppet] - 10https://gerrit.wikimedia.org/r/679772 (owner: 10Jbond) [15:07:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={netbox_device_statistics,pdnsrec,swagger_check_eventstreams_internal_cluster_codfw} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:09:34] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns1002.wikimedia.org [15:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:50] !log elukey@deploy1002 Finished deploy [analytics/refinery@497f6a5]: Regular analytics weekly train (duration: 13m 12s) [15:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:00] (03PS2) 10Ottomata: refine - lowercase eventlogging legeacy table names in include/exclude regexes [puppet] - 10https://gerrit.wikimedia.org/r/679376 (https://phabricator.wikimedia.org/T273789) [15:11:20] (03PS3) 10Ottomata: refine - lowercase eventlogging legeacy table names in include/exclude regexes [puppet] - 10https://gerrit.wikimedia.org/r/679376 (https://phabricator.wikimedia.org/T273789) [15:11:36] elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/679376 [15:11:46] i'm going ot test a couple of things before we merge that [15:11:52] (03CR) 10Volans: [C: 03+1] "LGTM, should be a noop on current hosts" [puppet] - 10https://gerrit.wikimedia.org/r/679772 (owner: 10Jbond) [15:11:53] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wdqs-blazegraph.service,wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:12] (03PS4) 10Ottomata: refine - use 0.1.5 and lowercase eventlogging legeacy table names in include/exclude regexes [puppet] - 10https://gerrit.wikimedia.org/r/679376 (https://phabricator.wikimedia.org/T273789) [15:12:16] (03PS5) 10Bstorm: sonofgridengine: grid-configurator: introduce support for the new domain [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [15:12:56] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configurator: introduce support for the new domain [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [15:12:59] (03PS12) 10Jbond: P:debmonitor::client: add scafolding to be able to use cfssl pki [puppet] - 10https://gerrit.wikimedia.org/r/679772 [15:13:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:13:08] ottomata: maybe we could go in test first? [15:13:29] (03CR) 10jerkins-bot: [V: 04-1] refine - use 0.1.5 and lowercase eventlogging legeacy table names in include/exclude regexes [puppet] - 10https://gerrit.wikimedia.org/r/679376 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [15:13:46] elukey: good plan! [15:13:50] (03PS1) 10Ottomata: ProduceCanaryEvents - use refinery 0.1.5 [puppet] - 10https://gerrit.wikimedia.org/r/679825 [15:13:50] will make patch [15:13:52] (03CR) 10Bstorm: "Now that version passes locally. So I suspect it's failing from something else. Checking." [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [15:14:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29076/console" [puppet] - 10https://gerrit.wikimedia.org/r/679772 (owner: 10Jbond) [15:15:14] (03PS1) 10Ottomata: test/refine - use 0.1.5 and lowercase table regexes [puppet] - 10https://gerrit.wikimedia.org/r/679846 (https://phabricator.wikimedia.org/T273789) [15:15:25] (03CR) 10Bstorm: "Yeah, this passes `tox -e sonofgridengine`. It fails flake8 😂" [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [15:16:05] (03PS1) 10Elukey: Decommission hue-next.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/679847 (https://phabricator.wikimedia.org/T280262) [15:16:34] (03CR) 10Ottomata: [C: 03+2] ProduceCanaryEvents - use refinery 0.1.5 [puppet] - 10https://gerrit.wikimedia.org/r/679825 (owner: 10Ottomata) [15:16:42] (03CR) 10Ottomata: [C: 03+2] test/refine - use 0.1.5 and lowercase table regexes [puppet] - 10https://gerrit.wikimedia.org/r/679846 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [15:17:00] (03CR) 10David Caro: wmcs: Run black and isort [cookbooks] - 10https://gerrit.wikimedia.org/r/679791 (owner: 10David Caro) [15:17:04] (03CR) 10David Caro: [C: 03+2] wmcs: Run black and isort [cookbooks] - 10https://gerrit.wikimedia.org/r/679791 (owner: 10David Caro) [15:18:36] (03CR) 10Majavah: [C: 04-1] "Please add this to wikimedia.cloud as well" [puppet] - 10https://gerrit.wikimedia.org/r/679522 (https://phabricator.wikimedia.org/T263872) (owner: 10Ahmon Dancy) [15:19:11] (03PS8) 10Cwhite: logstash: provision per-target apifeatureusage jobs [puppet] - 10https://gerrit.wikimedia.org/r/679524 (https://phabricator.wikimedia.org/T274394) [15:19:14] (03PS1) 10Ottomata: ttest/refine_sanitize - use refinery 0.1.5 [puppet] - 10https://gerrit.wikimedia.org/r/679848 [15:19:38] (03Merged) 10jenkins-bot: wmcs: Run black and isort [cookbooks] - 10https://gerrit.wikimedia.org/r/679791 (owner: 10David Caro) [15:20:11] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1046.eqiad.wmnet', 'wtp1047.eqiad.wmnet', 'wtp1048.eqiad.wmnet'] ` and were **ALL** successful. [15:21:48] !log otto@deploy1002 Started deploy [analytics/refinery@497f6a5] (hadoop-test): (no justification provided) [15:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "Chatted on IRC, this certainly isn't ideal or long term but Good Enough™ for now" [puppet] - 10https://gerrit.wikimedia.org/r/679524 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [15:24:33] (03CR) 10Cwhite: [C: 03+2] elasticsearch: curator remove stdout redirect [puppet] - 10https://gerrit.wikimedia.org/r/679553 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [15:26:19] (03CR) 10Cwhite: [C: 03+2] logstash: clean up apifeatureusage curator job [puppet] - 10https://gerrit.wikimedia.org/r/679525 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [15:26:32] !log otto@deploy1002 Finished deploy [analytics/refinery@497f6a5] (hadoop-test): (no justification provided) (duration: 04m 44s) [15:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:21] (03PS2) 10Ahmon Dancy: enable delay_messageblobstore_purge feature flag in beta scap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/679522 (https://phabricator.wikimedia.org/T263872) [15:27:39] (03PS1) 10Muehlenhoff: Remove access for lexnasser [puppet] - 10https://gerrit.wikimedia.org/r/679854 [15:28:58] (03PS1) 10Ppchelko: Envoy: set per_try_timeout for eventgate-main. [deployment-charts] - 10https://gerrit.wikimedia.org/r/679855 (https://phabricator.wikimedia.org/T249745) [15:29:17] (03CR) 10Majavah: [C: 03+1] enable delay_messageblobstore_purge feature flag in beta scap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/679522 (https://phabricator.wikimedia.org/T263872) (owner: 10Ahmon Dancy) [15:30:46] (03CR) 10Jbond: [C: 03+2] P:pki::get_cert: Add new function to create certs and return there paths [puppet] - 10https://gerrit.wikimedia.org/r/679800 (owner: 10Jbond) [15:30:51] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:debmonitor::client: add scafolding to be able to use cfssl pki [puppet] - 10https://gerrit.wikimedia.org/r/679772 (owner: 10Jbond) [15:30:59] (03PS11) 10Jbond: P:debmonitor::client: enable cfssl certs on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/679773 [15:31:43] (03PS1) 10CDanis: Revert "prepend esams/knams" [homer/public] - 10https://gerrit.wikimedia.org/r/679833 [15:31:51] (03CR) 10Ottomata: [C: 03+1] "Interesting, +1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/679855 (https://phabricator.wikimedia.org/T249745) (owner: 10Ppchelko) [15:32:03] (03CR) 10Jbond: [C: 03+2] P:debmonitor::client: enable cfssl certs on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/679773 (owner: 10Jbond) [15:32:13] 10SRE, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10crusnov) >>! In T271136#6999009, @crusnov wrote: >>>! In T271136#6999003, @elukey wrote: >> @crusnov if you have time let's do it this week or the next! > > Yes... [15:33:20] (03CR) 10Ppchelko: "> Patch Set 1: Code-Review+1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/679855 (https://phabricator.wikimedia.org/T249745) (owner: 10Ppchelko) [15:34:16] (03CR) 10Effie Mouzeli: [C: 03+2] enable delay_messageblobstore_purge feature flag in beta scap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/679522 (https://phabricator.wikimedia.org/T263872) (owner: 10Ahmon Dancy) [15:37:13] (03PS2) 10Nray: Add mediawiki.pref_diff stream to wgEventLoggingStreamNames/wgEventStreams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679613 (https://phabricator.wikimedia.org/T261842) [15:37:15] (03PS1) 10Arturo Borrero Gonzalez: cr/firewall: add kafka-logging servers to labs-in filters [homer/public] - 10https://gerrit.wikimedia.org/r/679862 (https://phabricator.wikimedia.org/T279342) [15:37:18] (03PS2) 10Nray: Add $wgWMEVectorPrefDiffSalt to private/readme [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679614 (https://phabricator.wikimedia.org/T261842) [15:37:55] (03CR) 10Ayounsi: [C: 03+1] Revert "prepend esams/knams" [homer/public] - 10https://gerrit.wikimedia.org/r/679833 (owner: 10CDanis) [15:39:05] (03PS1) 10Jcrespo: bacula: Exclude certain directories from mailman backups [puppet] - 10https://gerrit.wikimedia.org/r/679863 (https://phabricator.wikimedia.org/T279237) [15:39:07] (03CR) 10CDanis: [C: 03+2] Revert "prepend esams/knams" [homer/public] - 10https://gerrit.wikimedia.org/r/679833 (owner: 10CDanis) [15:39:20] (03PS2) 10Jcrespo: bacula: Exclude certain directories from mailman backups [puppet] - 10https://gerrit.wikimedia.org/r/679863 (https://phabricator.wikimedia.org/T279237) [15:39:42] (03PS3) 10Nray: Add mediawiki.pref_diff stream to wgEventLoggingStreamNames/wgEventStreams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679613 (https://phabricator.wikimedia.org/T261842) [15:39:44] (03PS3) 10Nray: Add $wgWMEVectorPrefDiffSalt to private/readme [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679614 (https://phabricator.wikimedia.org/T261842) [15:40:59] (03CR) 10Nray: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679613 (https://phabricator.wikimedia.org/T261842) (owner: 10Nray) [15:41:34] (03CR) 10RLazarus: [C: 03+1] bacula: Exclude certain directories from mailman backups [puppet] - 10https://gerrit.wikimedia.org/r/679863 (https://phabricator.wikimedia.org/T279237) (owner: 10Jcrespo) [15:41:46] (03CR) 10Jcrespo: [C: 03+2] bacula: Exclude certain directories from mailman backups [puppet] - 10https://gerrit.wikimedia.org/r/679863 (https://phabricator.wikimedia.org/T279237) (owner: 10Jcrespo) [15:42:27] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10jijiki) [15:42:31] (03PS8) 10Legoktm: Add shellbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 [15:43:09] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10jijiki) [15:43:28] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10jijiki) 05Open→03Resolved a:03jijiki [15:43:57] (03CR) 10jerkins-bot: [V: 04-1] Add shellbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 (owner: 10Legoktm) [15:45:05] (03PS4) 10Nray: Add mediawiki.pref_diff stream to wgEventLoggingStreamNames/wgEventStreams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679613 (https://phabricator.wikimedia.org/T261842) [15:45:07] (03PS4) 10Nray: Add $wgWMEVectorPrefDiffSalt to private/readme [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679614 (https://phabricator.wikimedia.org/T261842) [15:47:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,netbox_device_statistics} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:47:53] ottomata: one question about the ad blocker issue: why exaclty are requests to intake-analytics.wikimedia.org intercepted by ad blockers? because of different domain or because of different datacenter? [15:48:27] mforns: different domain iiuc [15:48:35] ad blockers just put that domain name in their list of thingsg to block [15:49:06] ottomata: is that a deny-list or an allow-list? [15:49:14] i think we could solve this problem by adding some kind of internal routing from the wiki domain to intake-analytics aka eventgate-analytics-external [15:49:20] I thought *.wikipedia.org was allow-listed in ad blockers [15:49:27] but *.wikimedia.org was not [15:49:32] i think a deny list? [15:49:37] oh [15:49:40] i don't know, bearloga i think found it? [15:49:54] mforns: people who make ad blocker lists are incredibly suspicious of anything with 'analytics' in the name [15:49:56] bearloga: you had a link to some adblocker code that blocked intake-analyics, right? [15:50:06] cdanis: I see [15:51:36] (03PS1) 10Dave Pifke: varnish: add anti-FLoC header to responses [puppet] - 10https://gerrit.wikimedia.org/r/679866 (https://phabricator.wikimedia.org/T279804) [15:55:31] (03PS1) 10Jbond: P:debmonitor::server: add Wikimedia Root CA [puppet] - 10https://gerrit.wikimedia.org/r/679868 [15:56:23] RECOVERY - Wikitech and wt-static content in sync on labweb1001 is OK: wikitech-static OK - wikitech and wikitech-static in sync (51186 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [15:56:45] (03PS5) 10Nray: Add $wgWMEVectorPrefDiffSalt to private/readme [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679614 (https://phabricator.wikimedia.org/T261842) [15:56:57] (03CR) 10Jbond: [C: 03+2] P:debmonitor::server: add Wikimedia Root CA [puppet] - 10https://gerrit.wikimedia.org/r/679868 (owner: 10Jbond) [15:58:07] (03PS1) 10Jcrespo: bacula: Exclude the whole list of backups, as defined on profile::lists [puppet] - 10https://gerrit.wikimedia.org/r/679869 (https://phabricator.wikimedia.org/T279237) [15:58:23] (03PS2) 10Jcrespo: bacula: Exclude the whole list of backups, as defined on profile::lists [puppet] - 10https://gerrit.wikimedia.org/r/679869 (https://phabricator.wikimedia.org/T279237) [15:59:03] (03PS1) 10Arturo Borrero Gonzalez: cinderutils: ensure: fix path to helper script [puppet] - 10https://gerrit.wikimedia.org/r/679870 [15:59:38] (03CR) 10jerkins-bot: [V: 04-1] bacula: Exclude the whole list of backups, as defined on profile::lists [puppet] - 10https://gerrit.wikimedia.org/r/679869 (https://phabricator.wikimedia.org/T279237) (owner: 10Jcrespo) [15:59:50] (03CR) 10Bstorm: [C: 03+1] cinderutils: ensure: fix path to helper script [puppet] - 10https://gerrit.wikimedia.org/r/679870 (owner: 10Arturo Borrero Gonzalez) [16:00:02] 10SRE, 10Privacy Engineering, 10Traffic, 10Patch-For-Review, 10Privacy: Visits to Wikimedia properties should not be used for Google ad targeting (FLoC) - https://phabricator.wikimedia.org/T279804 (10dpifke) [16:00:04] jbond42 and cdanis: My dear minions, it's time we take the moon! Just kidding. Time for [[Puppet request window]]
'''''' deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210415T1600). [16:00:25] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal_80: Servers wdqs1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:00:35] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal_80: Servers wdqs1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:00:39] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1008.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [16:01:11] PROBLEM - WDQS SPARQL on wdqs1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:01:27] (03PS3) 10Jcrespo: bacula: Exclude the whole list of backups, as defined on profile::lists [puppet] - 10https://gerrit.wikimedia.org/r/679869 (https://phabricator.wikimedia.org/T279237) [16:01:35] (03PS4) 10Jcrespo: bacula: Exclude the whole list of backups, as defined on profile::lists [puppet] - 10https://gerrit.wikimedia.org/r/679869 (https://phabricator.wikimedia.org/T279237) [16:02:39] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1008.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [16:02:59] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1025.eqiad.wmnet [16:02:59] (03PS3) 10Ryan Kemper: wdqs: result is list not dictionary [cookbooks] - 10https://gerrit.wikimedia.org/r/679702 (https://phabricator.wikimedia.org/T280108) [16:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:29] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:03:41] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:04:13] RECOVERY - WDQS SPARQL on wdqs1008 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.061 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:04:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cinderutils: ensure: fix path to helper script [puppet] - 10https://gerrit.wikimedia.org/r/679870 (owner: 10Arturo Borrero Gonzalez) [16:04:49] (03CR) 10Jcrespo: "I am running puppet catalog compiler, as I don't trust my syntax." [puppet] - 10https://gerrit.wikimedia.org/r/679869 (https://phabricator.wikimedia.org/T279237) (owner: 10Jcrespo) [16:06:17] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:08:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] gridengine: set grid-configurator source files to use new domain name [puppet] - 10https://gerrit.wikimedia.org/r/678043 (https://phabricator.wikimedia.org/T277653) (owner: 10Bstorm) [16:08:18] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1025.eqiad.wmnet [16:08:23] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:52] (03CR) 10Elukey: "Arturo: I just realized that you probably also need the codfw servers too, kafka-logging200{1,2,3}, and deploy the changes to also cr1/2-c" [homer/public] - 10https://gerrit.wikimedia.org/r/679862 (https://phabricator.wikimedia.org/T279342) (owner: 10Arturo Borrero Gonzalez) [16:10:13] (03PS4) 10Ryan Kemper: wdqs: result is list not dictionary [cookbooks] - 10https://gerrit.wikimedia.org/r/679702 (https://phabricator.wikimedia.org/T280108) [16:11:23] (03CR) 10Ryan Kemper: [C: 03+1] wdqs: result is list not dictionary [cookbooks] - 10https://gerrit.wikimedia.org/r/679702 (https://phabricator.wikimedia.org/T280108) (owner: 10Ryan Kemper) [16:12:07] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1026.eqiad.wmnet [16:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:14] (03PS5) 10Jcrespo: bacula: Exclude the whole list of backups, as defined on profile::lists [puppet] - 10https://gerrit.wikimedia.org/r/679869 (https://phabricator.wikimedia.org/T279237) [16:13:14] 10SRE, 10DC-Ops, 10SRE-tools, 10Sustainability (Incident Followup): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10LSobanski) @Volans Both the short and long term actions make sense to me, thanks for... [16:13:20] (03PS6) 10Jcrespo: bacula: Exclude the whole list of backups, as defined on profile::lists [puppet] - 10https://gerrit.wikimedia.org/r/679869 (https://phabricator.wikimedia.org/T279237) [16:15:36] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: result is list not dictionary [cookbooks] - 10https://gerrit.wikimedia.org/r/679702 (https://phabricator.wikimedia.org/T280108) (owner: 10Ryan Kemper) [16:17:13] !log T280108 T267927 Merged https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/679702 and ran puppet-agent on `cumin2001` before next round of wdqs `data-transfer`s [16:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:23] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [16:17:23] T280108: WDQS data-transfer cookbook needs to wait for updater to catchup on lag - https://phabricator.wikimedia.org/T280108 [16:17:39] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [16:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:52] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1026.eqiad.wmnet [16:17:53] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [16:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:42] (03CR) 10Jcrespo: [C: 03+1] "Puppet compiler:" [puppet] - 10https://gerrit.wikimedia.org/r/679869 (https://phabricator.wikimedia.org/T279237) (owner: 10Jcrespo) [16:18:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:20:17] (03PS7) 10Jcrespo: bacula: Exclude the whole list of backups, as defined on profile::lists [puppet] - 10https://gerrit.wikimedia.org/r/679869 (https://phabricator.wikimedia.org/T279237) [16:21:16] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1027.eqiad.wmnet [16:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:44] !log T280108 T267927 Current wdqs transfers in progress: `wqds1004`->`wdqs1005`, `wdqs2008`->`wdqs2001` [16:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:45] PROBLEM - Thanos compact is halted on alert1001 is CRITICAL: cluster=thanos instance=thanos-fe2001 job=thanos-compact prometheus=ops site=codfw https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [16:23:58] (03PS2) 10Jbond: puppetdb::app Add GC parameters to ro-host [puppet] - 10https://gerrit.wikimedia.org/r/674004 [16:24:30] (03CR) 10jerkins-bot: [V: 04-1] puppetdb::app Add GC parameters to ro-host [puppet] - 10https://gerrit.wikimedia.org/r/674004 (owner: 10Jbond) [16:25:00] (03Abandoned) 10Jbond: puppetdb::app Add GC parameters to ro-host [puppet] - 10https://gerrit.wikimedia.org/r/674004 (owner: 10Jbond) [16:26:42] (03PS6) 10Nray: Add $wgWMEVectorPrefDiffSalt to private/readme [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679614 (https://phabricator.wikimedia.org/T261842) [16:26:52] (03CR) 10Ottomata: [C: 03+2] ttest/refine_sanitize - use refinery 0.1.5 [puppet] - 10https://gerrit.wikimedia.org/r/679848 (owner: 10Ottomata) [16:27:28] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1027.eqiad.wmnet [16:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:57] (03CR) 10RLazarus: [C: 03+1] bacula: Exclude the whole list of backups, as defined on profile::lists [puppet] - 10https://gerrit.wikimedia.org/r/679869 (https://phabricator.wikimedia.org/T279237) (owner: 10Jcrespo) [16:30:16] (03PS5) 10Ottomata: refine - use 0.1.5 and lowercase table names in include/exclude regexes [puppet] - 10https://gerrit.wikimedia.org/r/679376 (https://phabricator.wikimedia.org/T273789) [16:30:19] (03PS6) 10Ottomata: refine - use 0.1.5 and lowercase table names in include/exclude regexes [puppet] - 10https://gerrit.wikimedia.org/r/679376 (https://phabricator.wikimedia.org/T273789) [16:31:15] (03CR) 10Elukey: "elukey@flerovium:~$ sudo lsblk -i" [puppet] - 10https://gerrit.wikimedia.org/r/679607 (https://phabricator.wikimedia.org/T278421) (owner: 10Razzi) [16:31:19] (03PS1) 10Herron: kafka-logging: migrake broker logstash1012 to kafka-logging1003 [puppet] - 10https://gerrit.wikimedia.org/r/679875 (https://phabricator.wikimedia.org/T279342) [16:32:02] (03PS1) 10Herron: kafka-logging1003: disable notifications during setup [puppet] - 10https://gerrit.wikimedia.org/r/679876 [16:33:10] (03PS1) 10Herron: Revert "kafka-logging1002: disable notifications during setup" [puppet] - 10https://gerrit.wikimedia.org/r/679836 [16:33:19] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:34:29] (03PS1) 10BryanDavis: toolforge: opt-out of Google FLoC tracking [puppet] - 10https://gerrit.wikimedia.org/r/679877 (https://phabricator.wikimedia.org/T279804) [16:34:31] (03PS1) 10BryanDavis: cloud-vps: opt-out of Google FLoC tracking [puppet] - 10https://gerrit.wikimedia.org/r/679878 (https://phabricator.wikimedia.org/T279804) [16:34:36] (03CR) 10Herron: [C: 03+2] Revert "kafka-logging1002: disable notifications during setup" [puppet] - 10https://gerrit.wikimedia.org/r/679836 (owner: 10Herron) [16:34:43] oh dear that's me [16:34:55] !log crusnov@cumin1001 START - Cookbook sre.dns.netbox [16:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:24] (03CR) 10Elukey: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/603961" [puppet] - 10https://gerrit.wikimedia.org/r/679607 (https://phabricator.wikimedia.org/T278421) (owner: 10Razzi) [16:35:35] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps: opt-out of Google FLoC tracking [puppet] - 10https://gerrit.wikimedia.org/r/679878 (https://phabricator.wikimedia.org/T279804) (owner: 10BryanDavis) [16:35:39] (03CR) 10Herron: [C: 03+2] kafka-logging1003: disable notifications during setup [puppet] - 10https://gerrit.wikimedia.org/r/679876 (owner: 10Herron) [16:35:47] (03PS2) 10Herron: kafka-logging1003: disable notifications during setup [puppet] - 10https://gerrit.wikimedia.org/r/679876 [16:35:50] (03CR) 10Jcrespo: [C: 03+2] bacula: Exclude the whole list of backups, as defined on profile::lists [puppet] - 10https://gerrit.wikimedia.org/r/679869 (https://phabricator.wikimedia.org/T279237) (owner: 10Jcrespo) [16:37:20] (03PS2) 10BryanDavis: cloud-vps: opt-out of Google FLoC tracking [puppet] - 10https://gerrit.wikimedia.org/r/679878 (https://phabricator.wikimedia.org/T279804) [16:37:57] (03CR) 10Elukey: [C: 03+1] refine - use 0.1.5 and lowercase table names in include/exclude regexes [puppet] - 10https://gerrit.wikimedia.org/r/679376 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [16:38:11] (03CR) 10Ottomata: [C: 03+2] refine - use 0.1.5 and lowercase table names in include/exclude regexes [puppet] - 10https://gerrit.wikimedia.org/r/679376 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [16:40:31] 10SRE, 10Wikimedia-Incident: Uncached wiki requests partially unavailable due to excessive request rates from a bot - https://phabricator.wikimedia.org/T280232 (10Legoktm) >>! In T280232#7002156, @Ladsgroup wrote: > Requests the bot made are not that bad but probably had a terrible regression in wmf.1 (action=... [16:41:12] chaomodus: I don't see anything weird in kafka-main200x metrics and logs! [16:41:17] seems good [16:41:23] are all the AAAA records added? [16:41:56] elukey: I made a mistake about deploying them, so main2001 is getting deployed right now [16:42:26] !log crusnov@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:34] elukey: there we go :) [16:45:28] 10SRE, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee Manuel - https://phabricator.wikimedia.org/T280162 (10KFrancis) @Manuel Thanks! I'll send the document for e-signatures as soon as it's approved. [16:46:39] 10SRE, 10ops-eqiad, 10DC-Ops: ps1-a7-eqiad power over threshold alerts - https://phabricator.wikimedia.org/T276743 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr I'm going to reassign this over to @Jclark-ctr, since he's working on refreshing some mw servers, which will the @Dzahn and the Service-Ops team the... [16:47:57] (03CR) 10Elukey: [C: 03+2] Decommission hue-next.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/679847 (https://phabricator.wikimedia.org/T280262) (owner: 10Elukey) [16:48:23] chaomodus: ah ack! will recheck [16:50:40] (03PS1) 10Jbond: P:pki::multirootca: add nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/679890 [16:56:16] (03PS2) 10Jbond: P:pki::multirootca: add nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/679890 [16:57:18] 10SRE, 10Analytics, 10netops: Audit analytics firewall filters - https://phabricator.wikimedia.org/T279429 (10fdans) a:03razzi [16:59:58] (03PS3) 10Jbond: P:pki::multirootca: add nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/679890 [17:00:04] chrisalbon and accraze: That opportune time is upon us again. Time for a [[mw:Services|Services]] – [[mw:Extension:Graph|Graphoid]] / [[ORES]] deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210415T1700). [17:00:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29081/console" [puppet] - 10https://gerrit.wikimedia.org/r/679890 (owner: 10Jbond) [17:01:17] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:30] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::multirootca: add nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/679890 (owner: 10Jbond) [17:04:21] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:49] (03PS1) 10Elukey: Decommission analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/679892 (https://phabricator.wikimedia.org/T280262) [17:09:13] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:02] (03PS1) 10Jbond: P:pki::multirootca: fix sudo rule [puppet] - 10https://gerrit.wikimedia.org/r/679894 [17:13:22] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29082/console" [puppet] - 10https://gerrit.wikimedia.org/r/679892 (https://phabricator.wikimedia.org/T280262) (owner: 10Elukey) [17:14:41] (03CR) 10CRusnov: "> Patch Set 1: Code-Review+1" (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/679403 (owner: 10CRusnov) [17:17:23] (03PS2) 10Elukey: Decommission analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/679892 (https://phabricator.wikimedia.org/T280262) [17:17:28] (03PS2) 10Jbond: P:pki::multirootca: fix sudo rule [puppet] - 10https://gerrit.wikimedia.org/r/679894 [17:17:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:18:04] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:44] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29084/console" [puppet] - 10https://gerrit.wikimedia.org/r/679892 (https://phabricator.wikimedia.org/T280262) (owner: 10Elukey) [17:20:35] (03CR) 10Jforrester: [C: 03+1] Fix error message if MWScript.php is run without arguments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679517 (owner: 10Ahmon Dancy) [17:22:27] (03CR) 10Bstorm: [C: 03+1] "If nobody objects in a bit, I'll merge it." [puppet] - 10https://gerrit.wikimedia.org/r/679878 (https://phabricator.wikimedia.org/T279804) (owner: 10BryanDavis) [17:23:46] (03CR) 10Jbond: [C: 03+2] P:pki::multirootca: fix sudo rule [puppet] - 10https://gerrit.wikimedia.org/r/679894 (owner: 10Jbond) [17:27:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) [17:28:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) [17:29:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) mc1041-50 netbox and network ports updated have been completed, need to go on-site and setup idrac [17:30:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:35:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:38:08] (03PS1) 10Jbond: P:pki::multirootca: fix sudo script [puppet] - 10https://gerrit.wikimedia.org/r/679897 [17:40:09] (03CR) 10Jbond: [C: 03+2] P:pki::multirootca: fix sudo script [puppet] - 10https://gerrit.wikimedia.org/r/679897 (owner: 10Jbond) [17:45:08] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:47:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:49:38] (03CR) 10Volans: [C: 03+1] "> Patch Set 1:" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/679403 (owner: 10CRusnov) [17:51:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:53:36] (03CR) 10Elukey: [C: 03+1] "Let's ping people in #wikimedia-sre before merging" [puppet] - 10https://gerrit.wikimedia.org/r/670013 (https://phabricator.wikimedia.org/T276868) (owner: 10Razzi) [18:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy [[Backport windows|Morning backport window]]
'''''' (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210415T1800). Please do the needful. [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:01:59] o/ I'm here for the backport window [18:03:06] 10SRE, 10Privacy Engineering, 10Traffic, 10Patch-For-Review, 10Privacy: Visits to Wikimedia properties should not be used for Google ad targeting (FLoC) - https://phabricator.wikimedia.org/T279804 (10ori) Is the header needed at all? https://github.com/WICG/floc/issues/45#issuecomment-781042491 says: >... [18:07:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:08:33] is anyone available for the backport window? [18:15:45] nray: I can deploy the config changes [18:16:32] thanks jan_drewniak [18:17:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:19:31] nray: wait... sorry something is messed up with my ssh config right now [18:20:36] jan_drewniak its okay, I might be able to make the deploy window 5 hours from now [18:20:56] thank you [18:21:18] Oh I missed an email about updating bast1002... [18:23:41] nray: ok, I'm in :P [18:23:48] nice! [18:24:14] (03CR) 10Jdrewniak: [C: 03+2] Add mediawiki.pref_diff stream to wgEventLoggingStreamNames/wgEventStreams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679613 (https://phabricator.wikimedia.org/T261842) (owner: 10Nray) [18:25:15] (03Merged) 10jenkins-bot: Add mediawiki.pref_diff stream to wgEventLoggingStreamNames/wgEventStreams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679613 (https://phabricator.wikimedia.org/T261842) (owner: 10Nray) [18:26:48] nray: is this check-able on mwdebug? [18:26:54] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-fe100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T275511 (10RobH) 05Open→03Resolved [18:27:19] jan_drewniak I would like to check if opt in opt outs still work [18:27:33] ok, it's up then [18:27:37] what server? [18:27:46] mwdebug1002 [18:27:50] thank you, checking now [18:29:42] btw I"m doing this one first https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/679613/ [18:29:43] jan_drewniak: looks good to me, you can proceed! [18:30:08] jan_drewniak: I understand [18:30:27] (03CR) 10Razzi: [C: 03+2] motd: Use heredoc to allow expanding description with apostrophe [puppet] - 10https://gerrit.wikimedia.org/r/670013 (https://phabricator.wikimedia.org/T276868) (owner: 10Razzi) [18:32:01] !log jdrewniak@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:679613|Add mediawiki.pref_diff stream to wgEventLoggingStreamNames/wgEventStreams (T261842)]] (duration: 01m 18s) [18:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:10] T261842: Create schema to track users opting in/out of desktop improvements - https://phabricator.wikimedia.org/T261842 [18:32:49] nray: oky doke, that's one down. [18:32:56] nice! [18:34:59] (03CR) 10Jdrewniak: [C: 03+2] Add $wgWMEVectorPrefDiffSalt to private/readme [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679614 (https://phabricator.wikimedia.org/T261842) (owner: 10Nray) [18:35:37] I'll test the readme one the same way [18:35:48] (03Merged) 10jenkins-bot: Add $wgWMEVectorPrefDiffSalt to private/readme [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679614 (https://phabricator.wikimedia.org/T261842) (owner: 10Nray) [18:36:49] nray: do you want the second one on mwdebug? [18:36:59] I mean, it's already there :P [18:37:07] yes please, just as a check [18:37:16] Okay, I will check now [18:38:25] jan_drewniak: that one looks good as well. You can proceed! [18:39:46] !log jdrewniak@deploy1002 Synchronized private/readme.php: Config: [[gerrit:679614|Add $wgWMEVectorPrefDiffSalt to private/readme (T261842)]] (duration: 01m 08s) [18:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:55] T261842: Create schema to track users opting in/out of desktop improvements - https://phabricator.wikimedia.org/T261842 [18:41:08] nray: alright that's both of them :) [18:41:30] \o/ Thanks so much jan_drewniak , I really appreciate it! :) [18:41:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:47:31] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:47:52] !log andrew@deploy1002 Started deploy [horizon/deploy@ec37c43]: test deploy of trove dashboard to codfw1dev [18:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:58] thanks jan_drewniak :) [18:49:50] !log andrew@deploy1002 Finished deploy [horizon/deploy@ec37c43]: test deploy of trove dashboard to codfw1dev (duration: 01m 58s) [18:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10Cmjohnson) a:05Cmjohnson→03RobH @robh the 2nd interface was added to these, can you try the install again please. [18:52:53] (03CR) 10CRusnov: "> Patch Set 1:" (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/679403 (owner: 10CRusnov) [18:58:23] 10SRE, 10Privacy Engineering, 10Traffic, 10Patch-For-Review, 10Privacy: Visits to Wikimedia properties should not be used for Google ad targeting (FLoC) - https://phabricator.wikimedia.org/T279804 (10dpifke) Are the Google heuristics for "ads-related resources" public? Are changes to those heuristics (o... [19:00:04] longma and marxarelli: Time to snap out of that daydream and deploy Mediawiki train - American Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210415T1900). [19:07:36] 10SRE, 10Privacy Engineering, 10Traffic, 10Patch-For-Review, 10Privacy: Visits to Wikimedia properties should not be used for Google ad targeting (FLoC) - https://phabricator.wikimedia.org/T279804 (10dpifke) I considered modifying the Varnish patch to not send the header to privacy-respecting user agents... [19:08:16] (03PS1) 10Andrew Bogott: Install trove services in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/679924 (https://phabricator.wikimedia.org/T212595) [19:08:57] (03CR) 10Andrew Bogott: [C: 03+2] Install trove services in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/679924 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [19:09:36] (03CR) 10MacFan4000: [C: 03+1] [wikitech] Update logo to mirror the new MediaWiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678342 (https://phabricator.wikimedia.org/T279087) (owner: 10Jforrester) [19:12:11] (03PS1) 10Jeena Huneidi: all wikis to 1.37.0-wmf.1 refs T278345 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679925 [19:12:13] (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.37.0-wmf.1 refs T278345 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679925 (owner: 10Jeena Huneidi) [19:12:55] (03Merged) 10jenkins-bot: all wikis to 1.37.0-wmf.1 refs T278345 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679925 (owner: 10Jeena Huneidi) [19:14:16] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.37.0-wmf.1 refs T278345 [19:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:25] T278345: 1.37.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T278345 [19:14:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10RobH) 05Open→03Resolved >>! In T272403#7004396, @Cmjohnson wrote: > @robh the 2nd interface was added to these, can you try... [19:15:31] longma: o/ [19:15:34] (03CR) 10Krinkle: [C: 03+1] [wikitech] Update logo to mirror the new MediaWiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678342 (https://phabricator.wikimedia.org/T279087) (owner: 10Jforrester) [19:15:52] logs look good so far [19:16:04] * marxarelli crosses fingers and toes [19:18:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10RobH) a:05RobH→03None Dell forgot to send out the NIC for the seed server, this was updated two days ago and supposedly its goin... [19:18:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:23:27] (03PS1) 10Jdlrobson: Adjust floating override [skins/Vector] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/679842 (https://phabricator.wikimedia.org/T280260) [19:24:38] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:12] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:50] (03CR) 10Gehel: "Minor comment inline. As volans suggested, it would make sense to move to the class API in this CR (or in a following CR if you prefer)." (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [19:40:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs100[123] - https://phabricator.wikimedia.org/T276644 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr netbox script ran for wcqs1001 and 1002. I'm not sure why 1003 is in C4, that's a 10G r... [19:42:09] !log reindexing commons and wikidata on elastic@eqiad (T274200) [19:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:17] T274200: Reindex English and Italian wikis to enable homoglyph plugin - https://phabricator.wikimedia.org/T274200 [19:43:02] (03PS9) 10Cwhite: logstash: provision per-target apifeatureusage jobs [puppet] - 10https://gerrit.wikimedia.org/r/679524 (https://phabricator.wikimedia.org/T274394) [19:43:04] !log reindexing wikidata on cloudelastic (T274200) [19:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:25] !log reindexing wikidata on cloudelastic finished/failed (T274200) [19:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:34] T274200: Reindex English and Italian wikis to enable homoglyph plugin - https://phabricator.wikimedia.org/T274200 [19:59:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:02:35] (03CR) 10Ottomata: [C: 03+2] hadoop::directory - Remove dependency on namenode [puppet] - 10https://gerrit.wikimedia.org/r/678870 (owner: 10Ottomata) [20:03:34] !log migrating kafka-logging broker logstash1012 to kafka-logging1003 T279342 [20:03:34] 10SRE, 10Privacy Engineering, 10Traffic, 10Patch-For-Review, 10Privacy: Visits to Wikimedia properties should not be used for Google ad targeting (FLoC) - https://phabricator.wikimedia.org/T279804 (10ori) >>! In T279804#7005424, @dpifke wrote: > Are the Google heuristics for "ads-related resources" publi... [20:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:44] T279342: Migrate colocated kafka-logging brokers to dedicated kafka-logging hosts - https://phabricator.wikimedia.org/T279342 [20:03:46] (03PS5) 10Ottomata: Refactor EventLoggingSanitization using RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/678941 (https://phabricator.wikimedia.org/T273789) [20:04:06] (03CR) 10Cwhite: [C: 03+2] "PCC checks out: https://puppet-compiler.wmflabs.org/compiler1002/29085/" [puppet] - 10https://gerrit.wikimedia.org/r/679524 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [20:04:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) firmware updated on all hosts to newest revision for idrac, bios, raid, and nic [20:06:43] (03PS1) 10RobH: cloudcephosd10[16-20] setup info [puppet] - 10https://gerrit.wikimedia.org/r/679935 (https://phabricator.wikimedia.org/T274945) [20:07:23] (03CR) 10RobH: [C: 03+2] cloudcephosd10[16-20] setup info [puppet] - 10https://gerrit.wikimedia.org/r/679935 (https://phabricator.wikimedia.org/T274945) (owner: 10RobH) [20:07:53] (03PS2) 10Herron: kafka-logging: migrake broker logstash1012 to kafka-logging1003 [puppet] - 10https://gerrit.wikimedia.org/r/679875 (https://phabricator.wikimedia.org/T279342) [20:09:12] (03CR) 10Herron: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/29086/" [puppet] - 10https://gerrit.wikimedia.org/r/679875 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [20:11:15] (03PS6) 10Ottomata: Refactor EventLoggingSanitization using RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/678941 (https://phabricator.wikimedia.org/T273789) [20:13:10] (03PS1) 10Gergő Tisza: flaggedrevs.php: Use MediaWikiServices, not an extension function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679938 [20:15:45] (03PS1) 10Ottomata: test/refine_sanitize - make salt rotation more generic [puppet] - 10https://gerrit.wikimedia.org/r/679939 (https://phabricator.wikimedia.org/T273789) [20:16:43] (03CR) 10Ottomata: "Work was originally done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/678941, but this change affects test cluster only." [puppet] - 10https://gerrit.wikimedia.org/r/679939 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [20:17:04] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29087/console" [puppet] - 10https://gerrit.wikimedia.org/r/679939 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [20:17:38] (03PS1) 10Jforrester: [wikitech] Enable VE desktop section edit links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679940 [20:18:00] (03CR) 10Ottomata: [V: 03+1 C: 03+2] test/refine_sanitize - make salt rotation more generic [puppet] - 10https://gerrit.wikimedia.org/r/679939 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [20:18:31] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:19:22] (03PS2) 10Jforrester: [wikitech] Enable VE desktop section edit links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679940 (https://phabricator.wikimedia.org/T280291) [20:23:40] (03PS1) 10Ottomata: refine_sanitize_salt_rotate - use local_salts_prefix properly [puppet] - 10https://gerrit.wikimedia.org/r/679941 (https://phabricator.wikimedia.org/T273789) [20:25:01] (03CR) 10Ottomata: [C: 03+2] refine_sanitize_salt_rotate - use local_salts_prefix properly [puppet] - 10https://gerrit.wikimedia.org/r/679941 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [20:30:33] (03PS1) 10Zabe: Create Draft namespace on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679947 (https://phabricator.wikimedia.org/T280289) [20:32:50] (03PS7) 10Ottomata: Refactor EventLoggingSanitization using RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/678941 (https://phabricator.wikimedia.org/T273789) [20:33:46] (03PS8) 10Ottomata: Refactor EventLoggingSanitization using RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/678941 (https://phabricator.wikimedia.org/T273789) [20:35:01] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29088/console" [puppet] - 10https://gerrit.wikimedia.org/r/678941 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [20:38:59] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:41:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['cloudcephosd1016.wikime... [20:47:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:54:50] (03CR) 10Multichill: [C: 04-1] "Please also update the comment. That's probably the reason this error got introduced in the first place. Wikidata is intentionally http, C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679327 (https://phabricator.wikimedia.org/T258590) (owner: 10Seddon) [20:55:26] (03PS1) 10Dzahn: DHCP: remove m1261 through mw1301 [puppet] - 10https://gerrit.wikimedia.org/r/679958 (https://phabricator.wikimedia.org/T280203) [20:55:46] (03PS2) 10Zabe: Create Draft namespace on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679947 (https://phabricator.wikimedia.org/T280289) [21:05:39] (03PS1) 10Ottomata: test/refine_sanitized - add a general purpose event_sanitized job [puppet] - 10https://gerrit.wikimedia.org/r/679961 [21:05:58] (03PS3) 10Dzahn: trafficserver: remove comment about a server that won't exist anymore [puppet] - 10https://gerrit.wikimedia.org/r/679530 (https://phabricator.wikimedia.org/T280203) [21:06:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:07:03] (03CR) 10jerkins-bot: [V: 04-1] test/refine_sanitized - add a general purpose event_sanitized job [puppet] - 10https://gerrit.wikimedia.org/r/679961 (owner: 10Ottomata) [21:09:00] (03CR) 10Dzahn: "yep, good like this?" [puppet] - 10https://gerrit.wikimedia.org/r/679530 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [21:17:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:18:11] (03PS1) 10RobH: fixing cloudcephosd entries [puppet] - 10https://gerrit.wikimedia.org/r/679966 (https://phabricator.wikimedia.org/T274945) [21:18:23] (03PS2) 10RobH: fixing cloudcephosd entries [puppet] - 10https://gerrit.wikimedia.org/r/679966 (https://phabricator.wikimedia.org/T274945) [21:18:40] (03CR) 10RobH: [C: 03+2] fixing cloudcephosd entries [puppet] - 10https://gerrit.wikimedia.org/r/679966 (https://phabricator.wikimedia.org/T274945) (owner: 10RobH) [21:24:10] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:28:17] (03PS2) 10Ottomata: test/refine_sanitized - add a general purpose event_sanitized job [puppet] - 10https://gerrit.wikimedia.org/r/679961 [21:38:32] PROBLEM - Check systemd state on ms-be1032 is CRITICAL: CRITICAL - degraded: The following units failed: session-72622.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:30:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:31:00] RECOVERY - Check systemd state on ms-be1032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:32:22] !log ryankemper@cumin2001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [22:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:28] !log T280108 T267927 Data transfers completed successfully; small issue with new `wait_for_updater` logic is preventing termination so I ctrl+c'd manually [22:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:36] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [22:33:37] T280108: WDQS data-transfer cookbook needs to wait for updater to catchup on lag - https://phabricator.wikimedia.org/T280108 [22:37:45] (03PS1) 10Razzi: superset: Enable superset http check [puppet] - 10https://gerrit.wikimedia.org/r/679985 (https://phabricator.wikimedia.org/T277729) [22:39:12] PROBLEM - puppet last run on wdqs2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:39:54] PROBLEM - puppet last run on wdqs2008 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:40:17] (03CR) 10Bstorm: [C: 03+1] toolforge: opt-out of Google FLoC tracking [puppet] - 10https://gerrit.wikimedia.org/r/679877 (https://phabricator.wikimedia.org/T279804) (owner: 10BryanDavis) [22:41:59] (03CR) 10Razzi: [C: 03+2] "How embarrasing! I forgot to enable this check!!" [puppet] - 10https://gerrit.wikimedia.org/r/679985 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [22:44:02] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_eventstreams_internal_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:46:40] !log T280108 T267927 Manually re-enabled and ran puppet on `wdqs1005` (had closed the tmux pane which terminated the cookbook without letting it do its final cleanup) [22:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:50] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [22:46:50] T280108: WDQS data-transfer cookbook needs to wait for updater to catchup on lag - https://phabricator.wikimedia.org/T280108 [22:48:47] !log T267927 pooled `wdqs1005` (all caught up on lag) [22:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:48] (03CR) 10Bstorm: [C: 03+2] toolforge: opt-out of Google FLoC tracking [puppet] - 10https://gerrit.wikimedia.org/r/679877 (https://phabricator.wikimedia.org/T279804) (owner: 10BryanDavis) [22:51:52] RECOVERY - puppet last run on wdqs2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:52:14] (03PS1) 10Ebernhardson: searchSatisfaction: Default userEditBucket back to 0 edits [extensions/WikimediaEvents] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/679845 (https://phabricator.wikimedia.org/T280294) [22:55:45] jouncebot: refresh [22:55:51] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [22:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:00] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [22:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:08] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [22:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:19] !log T267927 WDQS kicked off next round of `data-transfer`s: `wdqs1004`->`wdqs1006`, `wdqs2001`->`wdqs2002`, `wdqs2008`->`wdqs1003` [22:56:21] hmm, no jouncebot? [22:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:27] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [22:56:44] in fairness its quit message did say he'd be back :) [22:57:01] I'll see if I can find out where jouncebot runs from [22:57:28] thanks! [22:57:42] looks like tools, https://wikitech.wikimedia.org/wiki/Tool:Jouncebot [22:58:37] Not able to ssh in (`Connection closed by UNKNOWN port 65535`), presumably it's an ssh config thing on my end [22:58:52] RECOVERY - puppet last run on wdqs2008 is OK: OK: Puppet is currently disabled (transferring wikidata journal following reload from dumps - ryankemper@cumin2001 - T267927), not alerting. Last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:59:30] ebernhardson, ryankemper: jouncebot's k8s pod is in crashloopbackoff. I'll try to figure out why [23:00:04] "json.decoder.JSONDecodeError: Invalid control character at: line 1 column 526 (char 525)" -- bad json from the web page scrape [23:00:17] Did we change the page design again? [23:00:55] not it [23:01:00] * James_F grins. [23:01:09] :P [23:01:18] brennen|afk: You around for deploy training? [23:01:35] If not, I can sling the patches out and then run away. [23:01:43] James_F: nope, he's out today [23:01:48] Ah. [23:01:59] I'm sitting in the training meeting, but it's empty due to lack of advertising by me, likely [23:02:02] Shall I just comandeer it as a regular backport window? [23:02:18] please do [23:02:22] Ack. [23:02:26] ebernhardson: OK to deploy yours now? [23:02:52] im here [23:03:01] (03CR) 10Jforrester: [C: 03+2] Adjust floating override [skins/Vector] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/679842 (https://phabricator.wikimedia.org/T280260) (owner: 10Jdlrobson) [23:03:04] Jdlrobson: Cool. [23:03:05] didn't get a ping.. [23:03:09] sorry [23:03:15] Jdlrobson: Yeah, the bot broke. [23:03:23] James_F: yup [23:03:29] (03CR) 10Jforrester: [C: 03+2] searchSatisfaction: Default userEditBucket back to 0 edits [extensions/WikimediaEvents] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/679845 (https://phabricator.wikimedia.org/T280294) (owner: 10Ebernhardson) [23:03:31] (03CR) 10Jforrester: [C: 03+2] [wikitech] Update logo to mirror the new MediaWiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678342 (https://phabricator.wikimedia.org/T279087) (owner: 10Jforrester) [23:03:47] * James_F does the "don't try this at home" process of merging multiple things at once and remembering which is where. [23:04:19] so.... the new deployment page parser bits are failing very badly on the wikitech side [23:04:27] (03Merged) 10jenkins-bot: [wikitech] Update logo to mirror the new MediaWiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678342 (https://phabricator.wikimedia.org/T279087) (owner: 10Jforrester) [23:04:50] bd808: what do you mean parser bits? [23:04:59] thcipriani: https://phabricator.wikimedia.org/P15371 [23:05:09] the json being extracted is not json :) [23:05:36] Ouch. [23:05:58] oh good [23:06:18] !log jforrester@deploy1002 Synchronized static/images/project-logos/wikitech.png: Config: [[gerrit:678342|[wikitech] Update logo to mirror the new MediaWiki logo (T279087)]] (duration: 00m 57s) [23:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:27] T279087: Update the logo for wikitech.wikimedia.org to mirror the new MediaWiki logo - https://phabricator.wikimedia.org/T279087 [23:06:53] (03CR) 10Bstorm: [C: 03+2] cloud-vps: opt-out of Google FLoC tracking [puppet] - 10https://gerrit.wikimedia.org/r/679878 (https://phabricator.wikimedia.org/T279804) (owner: 10BryanDavis) [23:07:34] !log jforrester@deploy1002 Synchronized static/images/project-logos/wikitech-1.5x.png: Config: [[gerrit:678342|[wikitech] Update logo to mirror the new MediaWiki logo (T279087)]] (duration: 00m 57s) [23:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:08:33] !log jforrester@deploy1002 Synchronized static/images/project-logos/wikitech-2x.png: Config: [[gerrit:678342|[wikitech] Update logo to mirror the new MediaWiki logo (T279087)]] (duration: 00m 56s) [23:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:33] !log jforrester@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:678342|[wikitech] Update logo to mirror the new MediaWiki logo (T279087)]] (duration: 00m 56s) [23:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:32] I'm here too [23:12:44] (03PS3) 10Jforrester: Create Draft namespace on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679947 (https://phabricator.wikimedia.org/T280289) (owner: 10Zabe) [23:13:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['cloudcephosd1016.wikime... [23:13:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) [23:14:22] (03CR) 10Jforrester: [C: 03+2] Create Draft namespace on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679947 (https://phabricator.wikimedia.org/T280289) (owner: 10Zabe) [23:14:27] (03Merged) 10jenkins-bot: Create Draft namespace on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679947 (https://phabricator.wikimedia.org/T280289) (owner: 10Zabe) [23:14:35] wikibugs is rather behind. [23:17:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:17:43] !log jforrester@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:679947|Create Draft namespace on itwiki (T280289)]] (duration: 00m 56s) [23:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:52] T280289: Create Draft namespace on itwiki - https://phabricator.wikimedia.org/T280289 [23:18:03] Zabe: All deployed. [23:18:17] James_F: thanks :) [23:18:27] Of course. [23:19:58] (03CR) 10Jforrester: [C: 04-1] "Waiting for team feedback." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679940 (https://phabricator.wikimedia.org/T280291) (owner: 10Jforrester) [23:21:22] * James_F twiddles thumbs waiting for Vector's endless selenium run. [23:22:26] James_F: hmm, was mine synced? [23:22:59] (i don't see it from mwmaint1002 in php-1.37.0-wmf.1) [23:23:40] oh, wow. [23:23:43] its still running [23:24:07] ebernhardson: Yeah, waiting behind Vector. :-( [23:26:41] :/ [23:27:15] (03Merged) 10jenkins-bot: Adjust floating override [skins/Vector] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/679842 (https://phabricator.wikimedia.org/T280260) (owner: 10Jdlrobson) [23:27:16] (03Merged) 10jenkins-bot: searchSatisfaction: Default userEditBucket back to 0 edits [extensions/WikimediaEvents] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/679845 (https://phabricator.wikimedia.org/T280294) (owner: 10Ebernhardson) [23:27:47] Finally. [23:27:55] I'll do ebernhardson first as that's simpler. [23:28:47] ebernhardson: Worth testing on mwdebug1002 or should we just go for it? [23:29:56] Jdlrobson: Yours is live on mwdebug1002 if you want to test? [23:30:11] James_F: just go for it [23:30:23] mostly i have to watch the validation failure graphs [23:30:28] Going right now. [23:31:16] !log jforrester@deploy1002 Synchronized php-1.37.0-wmf.1/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: Backport: [[gerrit:679845|searchSatisfaction: Default userEditBucket back to 0 edits (T280294)]] (duration: 00m 57s) [23:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:24] Done. [23:31:25] T280294: Big increase in eventlogging_SearchSatisfaction validation errors after this week's MW train - https://phabricator.wikimedia.org/T280294 [23:35:10] (03CR) 10BryanDavis: [C: 03+2] jouncebot_preview: Add basic preview mode for quick testing [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677329 (owner: 10Krinkle) [23:35:43] (03CR) 10BryanDavis: [C: 03+2] configloader: Fix YAMLLoadWarning, use SafeLoader. [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677330 (owner: 10Krinkle) [23:35:52] !log jforrester@deploy1002 Synchronized php-1.37.0-wmf.1/skins/Vector/resources/skins.vector.styles.legacy/layouts/screen.less: Backport: [[gerrit:679842|Adjust floating override (T280260)]] (duration: 00m 56s) [23:35:59] (03Merged) 10jenkins-bot: jouncebot_preview: Add basic preview mode for quick testing [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677329 (owner: 10Krinkle) [23:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:01] T280260: mw-empty-elt element is not hidden in old Vector - https://phabricator.wikimedia.org/T280260 [23:36:33] (03Merged) 10jenkins-bot: configloader: Fix YAMLLoadWarning, use SafeLoader. [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677330 (owner: 10Krinkle) [23:36:35] (03CR) 10BryanDavis: [C: 03+2] Let MW parse from page instead of downloading/uploading wikitext [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677331 (owner: 10Krinkle) [23:37:13] (03Merged) 10jenkins-bot: Let MW parse from page instead of downloading/uploading wikitext [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677331 (owner: 10Krinkle) [23:37:31] !log jforrester@deploy1002 Synchronized php-1.37.0-wmf.1/skins/Vector/skin.json: Backport: [[gerrit:679842|Adjust floating override (T280260)]] (duration: 00m 56s) [23:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:43] Jdlrobson: All deployed, seems fixed to me. [23:37:56] * James_F stays around in case ebernhardson needs him to revert. [23:37:58] (03CR) 10BryanDavis: [C: 03+2] deploypage: Migrate to css selectors (fix window raw wikitext bug) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677332 (https://phabricator.wikimedia.org/T279391) (owner: 10Krinkle) [23:38:48] (03Merged) 10jenkins-bot: deploypage: Migrate to css selectors (fix window raw wikitext bug) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677332 (https://phabricator.wikimedia.org/T279391) (owner: 10Krinkle) [23:39:16] James_F: thx! [23:39:24] sorry again missed the ping.. [23:39:26] Happy to help. [23:39:28] so there is something wrong with my client [23:40:13] James_F: nope, metric for errors is tanking. Looks all sane [23:41:16] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [23:41:51] Hurrah. [23:42:04] I'm declaring this window closed. Godspeed. [23:42:25] after falling down the rabbit hole a bit: I don't get why there are LFs in the json. Scribunto seems to call the core formatjson class which claims: PHP already escapes LF and CR according to RFC 4627 ... but somehow only 1 LF : x5cx6ex0a¯\_(ツ)_/¯ [23:43:06] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 88535 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [23:43:09] thcipriani: Krinkle had made a patch that removes the json bits. I'm deploying that now [23:43:15] Ha. [23:43:39] jouncebot: now [23:43:39] For the next 1 hour(s) and 16 minute(s): US Backport and Config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210415T2300) [23:43:39] one step behind: story of my life :) [23:43:43] jouncebot: next [23:43:44] In 7 hour(s) and 16 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210416T0700) [23:44:08] neat [23:44:16] thcipriani: there are LFs there because the json encoder does not see the actual text, it sees a parser placeholder. We don't allow users' to output raw html or modify the raw html. So the Lua code that "calls" the template actually just gets given a placeholder token. [23:44:19] happiness restored. the "now" output was the record failing with the json mess [23:44:30] thcipriani: then after Lua is said and done, the parser sees the token in the JSON string and performs the replacement on that. [23:44:44] that's in part why I changed it :) [23:44:56] Hurrah [23:45:49] nice [23:46:15] there are a surprising amount of layers for jouncebot [23:46:37] it was less complicated 6 years ago :) [23:46:51] Yeah yeah. [23:47:57] Honestly if I was still actively doing deploy things (which may return soonish) I would replace the whole wiki page with a dedicated app [23:48:43] that sounds interesting [23:48:47] most of the ugly is in trying to make a readable wiki page that is easy to edit and also to scrape with a horrible python screen reader [23:48:49] * James_F would rather we spent the effort on on-wiki / DB-based config, instead. [23:49:00] So that we can have 10x fewer deploys. [23:49:31] But maybe that' [23:49:38] s a pipedream. [23:49:47] somebody keeps telling me that mediawiki on kubernetes with continuous deploy will happen "soon" [23:49:56] Also that. [23:50:09] Well, the first bit. The CD bit will take longer. [23:50:13] James_F: Hm.. shouldn't the +2 about wmf/* be mentioned here by gerrit bot? [23:50:22] I got the ping in -perf-bots instead [23:50:23] also, a flask app to track patch windows is a lot less work than replacing multiversion with a db [23:50:34] Krinkle: If IRC was working, yes. [23:50:41] > 00:03 <•wikibugs> (CR) Jforrester: [C: +2] searchSatisfaction: Default userEditBucket back to 0 edits [extensions/WikimediaEvents] (wmf/1.37.0-wmf.1) - https://gerrit.wikimedia.org/r/679845 (https://phabricator.wikimedia.org/T280294) (owner: Ebernhardson) [23:50:43] Krinkle: But half the updates didn't reach here. [23:51:07] or maybe it's configured incorrectly [23:51:32] I fiddled with it recently to get some things in -perf-bots that weren't getting there that I thoguht should [23:52:45] looks like its configured correctly to me [23:53:10] Or it's just race-condition-ing at times. [23:53:47] (03CR) 10Bstorm: [C: 03+2] gridengine: set grid-configurator source files to use new domain name [puppet] - 10https://gerrit.wikimedia.org/r/678043 (https://phabricator.wikimedia.org/T277653) (owner: 10Bstorm) [23:54:05] (03CR) 10Krinkle: [C: 03+1] "Thanks, I should have caught this in CR." [extensions/WikimediaEvents] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/679845 (https://phabricator.wikimedia.org/T280294) (owner: 10Ebernhardson) [23:54:15] yeah, this time it got it ^ [23:54:28] * Krinkle shrugs [23:54:31] computers! [23:54:32] Maybe its just rate limit [23:58:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:58:21] Or maybe it got stuck at the WMCS end; it was very behind/lossy for a bit. [23:58:44] I'd already pulled a patch into deploy and onto debug before it mentioned it had merged here, for example.