[00:00:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service,systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:52] (03PS2) 10Legoktm: mailman3: Configure ferm [puppet] - 10https://gerrit.wikimedia.org/r/674393 (https://phabricator.wikimedia.org/T277286) (owner: 10Ladsgroup) [00:00:54] (03PS1) 10Legoktm: role::lists3: Include profile::standard and profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/674456 [00:01:39] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2377.codfw.wmnet ` The log can be found in `/... [00:04:36] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2378.codfw.wmnet ` The log can be found in `/... [00:05:11] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28731/console" [puppet] - 10https://gerrit.wikimedia.org/r/674456 (owner: 10Legoktm) [00:06:35] (03CR) 10Legoktm: [V: 03+1 C: 03+2] role::lists3: Include profile::standard and profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/674456 (owner: 10Legoktm) [00:07:47] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28732/console" [puppet] - 10https://gerrit.wikimedia.org/r/674393 (https://phabricator.wikimedia.org/T277286) (owner: 10Ladsgroup) [00:08:02] (03CR) 10Legoktm: [V: 03+1 C: 03+2] mailman3: Configure ferm [puppet] - 10https://gerrit.wikimedia.org/r/674393 (https://phabricator.wikimedia.org/T277286) (owner: 10Ladsgroup) [00:16:11] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2377.codfw.wmnet with reason: REIMAGE [00:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:52] (03PS4) 10Dzahn: httpd: add parameter to allow listening on 80-only regardless of mod_ssl [puppet] - 10https://gerrit.wikimedia.org/r/674169 (https://phabricator.wikimedia.org/T277989) [00:17:21] (03CR) 10jerkins-bot: [V: 04-1] httpd: add parameter to allow listening on 80-only regardless of mod_ssl [puppet] - 10https://gerrit.wikimedia.org/r/674169 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn) [00:18:03] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2377.codfw.wmnet with reason: REIMAGE [00:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:52] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2378.codfw.wmnet with reason: REIMAGE [00:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:49] (03PS4) 10Legoktm: mailman3: Parameterize MariaDB dbname and username [puppet] - 10https://gerrit.wikimedia.org/r/673632 (https://phabricator.wikimedia.org/T256536) [00:21:51] (03PS3) 10Legoktm: mailman3: Use acme-chief, unify Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/673641 (https://phabricator.wikimedia.org/T256536) [00:21:53] (03PS4) 10Legoktm: [WIP] Configure lists1002 [puppet] - 10https://gerrit.wikimedia.org/r/673636 [00:22:17] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2378.codfw.wmnet with reason: REIMAGE [00:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:23] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28733/console" [puppet] - 10https://gerrit.wikimedia.org/r/673632 (https://phabricator.wikimedia.org/T256536) (owner: 10Legoktm) [00:23:44] (03CR) 10Legoktm: [V: 03+1 C: 03+2] mailman3: Parameterize MariaDB dbname and username [puppet] - 10https://gerrit.wikimedia.org/r/673632 (https://phabricator.wikimedia.org/T256536) (owner: 10Legoktm) [00:25:07] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2377.codfw.wmnet'] ` and were **ALL** successful. [00:26:09] 10Puppet, 10SRE, 10Wikimedia-Mailing-lists: Make puppet for mailman3 ready for production - https://phabricator.wikimedia.org/T277286 (10Legoktm) [00:26:30] 10Puppet, 10SRE, 10Wikimedia-Mailing-lists: Make puppet for mailman3 ready for production - https://phabricator.wikimedia.org/T277286 (10Legoktm) [00:28:07] (03PS1) 10Dzahn: webperf: use new http_only parameter with httpd in processors_and_site [puppet] - 10https://gerrit.wikimedia.org/r/674458 [00:28:59] (03PS5) 10Dzahn: httpd: add parameter to allow listening on 80-only regardless of mod_ssl [puppet] - 10https://gerrit.wikimedia.org/r/674169 (https://phabricator.wikimedia.org/T277989) [00:29:02] 10SRE, 10Wikimedia-Mailing-lists: Setup monitoring for mailman3 - https://phabricator.wikimedia.org/T278280 (10Legoktm) [00:30:02] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2378.codfw.wmnet'] ` and were **ALL** successful. [00:32:36] 10SRE, 10Wikimedia-Mailing-lists, 10observability: Setup monitoring for mailman3 - https://phabricator.wikimedia.org/T278280 (10Legoktm) [00:33:59] 10Puppet, 10SRE, 10Wikimedia-Mailing-lists: Make puppet for mailman3 ready for production - https://phabricator.wikimedia.org/T277286 (10Legoktm) I filed {T278280} for monitoring, I don't think that's a blocker for this and can be done iteratively. [00:35:39] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Papaul) @Dzahn mw2377 and mw2378 are ready for service [00:36:12] (03PS1) 10Legoktm: mailman3: Disable django admin module [puppet] - 10https://gerrit.wikimedia.org/r/674459 [00:40:20] (03CR) 10Legoktm: [C: 04-1] "It disables it by breaking it, sigh." [puppet] - 10https://gerrit.wikimedia.org/r/674459 (owner: 10Legoktm) [00:49:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={netbox_device_statistics,pdu_sentry4} site={codfw,eqsin} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:50:22] (03CR) 10Dzahn: "see again now. much simpler" [puppet] - 10https://gerrit.wikimedia.org/r/674169 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn) [00:51:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:55:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:56:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:15:56] (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch: combined plugin upgrade + reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/673343 (https://phabricator.wikimedia.org/T277792) (owner: 10Ryan Kemper) [01:19:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:21:27] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:37:48] (03PS1) 10Reedy: Use namespaced PoolCounter Client class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674465 [01:48:38] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-upgrade-reboot [01:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:53] !log T274204 `sudo -i cookbook sre.elasticsearch.rolling-upgrade-reboot relforge "relforge cluster restarts" --task-id T274204 --nodes-per-run 3 --start-datetime 2021-03-24T01:45:59+00:00` on `ryankemper@cumin1001` tmux session `elasticsearch_rolling_upgrade_reboots` [01:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:01] T274204: Deploy new version of Extra Plugin (with Khmer filter) to Elasticsearch cluster - https://phabricator.wikimedia.org/T274204 [01:58:07] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-upgrade-reboot (exit_code=97) [01:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:58:58] !log T274204 `ctrl+c`'d out of run; relforge is relying on outdated config that is trying to talk to `relforge1002` which no longer exists. Need to refactor so that config no longer lives in spicerack [01:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:59:06] T274204: Deploy new version of Extra Plugin (with Khmer filter) to Elasticsearch cluster - https://phabricator.wikimedia.org/T274204 [01:59:24] !log T274204 For now I'll proceed to the reboots of `codfw` [01:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:51] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-upgrade [02:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:46] !log T274204 `sudo -i cookbook sre.elasticsearch.rolling-upgrade search_codfw "codfw cluster reboot" --task-id T274204 --nodes-per-run 3 --start-datetime 2021-03-24T02:29:39` on `ryankemper@cumin1001` tmux session `elasticsearch_rolling_upgrade_reboots` [02:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:54] T274204: Deploy new version of Extra Plugin (with Khmer filter) to Elasticsearch cluster - https://phabricator.wikimedia.org/T274204 [03:16:27] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:18:39] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.068 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:38:21] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=99) [03:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:39:45] !log T274204 Timed out waiting for write queues to empty: `[59/60, retrying in 60.00s] Attempt to run 'spicerack.elasticsearch_cluster.ElasticsearchClusters.wait_for_all_write_queues_empty' raised: Write queue not empty (had value of 241631) for partition 0 of topic codfw.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.` [03:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:39:53] T274204: Deploy new version of Extra Plugin (with Khmer filter) to Elasticsearch cluster - https://phabricator.wikimedia.org/T274204 [03:40:11] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-upgrade [03:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:41:08] !log T274204 Restarting `codfw` restart; the timestamp argument should prevent it from wasting time on nodes that have been rebooted already [03:41:10] !log T274204 `sudo -i cookbook sre.elasticsearch.rolling-upgrade search_codfw "codfw cluster reboot" --task-id T274204 --nodes-per-run 3 --start-datetime 2021-03-24T02:29:39` on `ryankemper@cumin1001` tmux session `elasticsearch_rolling_upgrade_reboots` [03:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:27] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=99) [04:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:15] (03CR) 10CRusnov: "This change is ready for review." [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [04:57:26] (03PS3) 10CRusnov: Add CAS authentication support [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) [04:58:11] (03PS4) 10CRusnov: Add CAS authentication support [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) [05:14:42] (03PS4) 10ArielGlenn: dumpwikibasejson: Make segment separation more robust [puppet] - 10https://gerrit.wikimedia.org/r/673679 (https://phabricator.wikimedia.org/T277300) (owner: 10Hoo man) [05:16:30] (03CR) 10ArielGlenn: [C: 03+2] dumpwikibasejson: Make segment separation more robust [puppet] - 10https://gerrit.wikimedia.org/r/673679 (https://phabricator.wikimedia.org/T277300) (owner: 10Hoo man) [05:44:05] (03PS2) 10Legoktm: mailman3: Disallow access to django admin interface [puppet] - 10https://gerrit.wikimedia.org/r/674459 [05:44:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:49:09] (03CR) 10Legoktm: [C: 03+2] mailman3: Use acme-chief, unify Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/673641 (https://phabricator.wikimedia.org/T256536) (owner: 10Legoktm) [05:49:53] (03PS3) 10Legoktm: mailman3: Disallow access to django admin interface [puppet] - 10https://gerrit.wikimedia.org/r/674459 [05:50:40] (03CR) 10Legoktm: [C: 03+2] mailman3: Disallow access to django admin interface [puppet] - 10https://gerrit.wikimedia.org/r/674459 (owner: 10Legoktm) [05:52:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1141', diff saved to https://phabricator.wikimedia.org/P15056 and previous config saved to /var/cache/conftool/dbconfig/20210324-055246-marostegui.json [05:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:13] (03PS1) 10Legoktm: mailman3: Enable mod_headers [puppet] - 10https://gerrit.wikimedia.org/r/674474 [05:55:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:56:30] (03CR) 10Legoktm: [C: 03+2] mailman3: Enable mod_headers [puppet] - 10https://gerrit.wikimedia.org/r/674474 (owner: 10Legoktm) [05:56:32] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1086 - https://phabricator.wikimedia.org/T278226 (10Marostegui) Thank you! The RAID is now back in optimal state ` logicaldrive 1 (3.6 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, Solid State SATA, 800 GB, OK) physicaldrive 1I:... [05:59:15] 10Puppet, 10SRE, 10Wikimedia-Mailing-lists: Make puppet for mailman3 ready for production - https://phabricator.wikimedia.org/T277286 (10Legoktm) [06:00:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:05:10] 10SRE, 10Wikimedia-Mailing-lists, 10observability: Implement central logging for mailman3 - https://phabricator.wikimedia.org/T276697 (10Legoktm) Here's what the logs to disk look like: ` root@mailman-mailman02:~# tree /var/log/mailman3/ /var/log/mailman3/ ├── bounce.log ├── debug.log ├── mailman.log ├── mai... [06:10:38] (03PS1) 10Marostegui: mariadb: Decommission db1084 [puppet] - 10https://gerrit.wikimedia.org/r/674475 (https://phabricator.wikimedia.org/T276302) [06:12:11] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:14:20] !log root@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1084.eqiad.wmnet [06:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:33] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:15:58] (03PS1) 10Ladsgroup: hyperkitty: Install whoosh [puppet] - 10https://gerrit.wikimedia.org/r/674476 (https://phabricator.wikimedia.org/T256536) [06:16:00] (03PS1) 10Ladsgroup: mailman3: Clean up the web frontend [puppet] - 10https://gerrit.wikimedia.org/r/674477 (https://phabricator.wikimedia.org/T256536) [06:22:08] (03PS2) 10Ladsgroup: mailman3: Clean up the web frontend [puppet] - 10https://gerrit.wikimedia.org/r/674477 (https://phabricator.wikimedia.org/T256536) [06:23:49] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1084 [puppet] - 10https://gerrit.wikimedia.org/r/674475 (https://phabricator.wikimedia.org/T276302) (owner: 10Marostegui) [06:24:11] (03CR) 10Legoktm: mailman3: Clean up the web frontend (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674477 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup) [06:24:18] (03CR) 10Legoktm: [C: 03+2] hyperkitty: Install whoosh [puppet] - 10https://gerrit.wikimedia.org/r/674476 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup) [06:24:34] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1084.eqiad.wmnet [06:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:02] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1084.eqiad.wmnet - https://phabricator.wikimedia.org/T276302 (10Marostegui) [06:26:10] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:26:42] (03PS3) 10Ladsgroup: mailman3: Clean up the web frontend [puppet] - 10https://gerrit.wikimedia.org/r/674477 (https://phabricator.wikimedia.org/T256536) [06:28:11] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28734/console" [puppet] - 10https://gerrit.wikimedia.org/r/674477 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup) [06:28:24] (03CR) 10Ladsgroup: mailman3: Clean up the web frontend (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674477 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup) [06:28:31] (03PS1) 10Marostegui: instances.yaml: Add db1181 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/674479 (https://phabricator.wikimedia.org/T275633) [06:29:04] (03CR) 10Legoktm: [V: 03+1 C: 03+2] mailman3: Clean up the web frontend [puppet] - 10https://gerrit.wikimedia.org/r/674477 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup) [06:32:35] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1181 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/674479 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [06:33:56] (03PS2) 10Muehlenhoff: Failover tendril to dbmonitor1002 [dns] - 10https://gerrit.wikimedia.org/r/674303 (https://phabricator.wikimedia.org/T224589) [06:34:35] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Puppetize mailman3 - https://phabricator.wikimedia.org/T256536 (10Ladsgroup) 05Open→03Resolved This is done. Some follow up stuff for production are being done in {T277286} [06:34:41] 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Ladsgroup) [06:34:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1181 to dbctl, depooled T275633', diff saved to https://phabricator.wikimedia.org/P15057 and previous config saved to /var/cache/conftool/dbconfig/20210324-063459-marostegui.json [06:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:08] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [06:36:18] (03CR) 10Muehlenhoff: [C: 03+2] Failover tendril to dbmonitor1002 [dns] - 10https://gerrit.wikimedia.org/r/674303 (https://phabricator.wikimedia.org/T224589) (owner: 10Muehlenhoff) [06:36:32] (03PS1) 10Marostegui: install_server: Do not reimage db2150 [puppet] - 10https://gerrit.wikimedia.org/r/674480 (https://phabricator.wikimedia.org/T275633) [06:37:13] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2150 [puppet] - 10https://gerrit.wikimedia.org/r/674480 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [06:40:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:41:50] (03PS1) 10Marostegui: install_server: Do not format db1161 [puppet] - 10https://gerrit.wikimedia.org/r/674481 (https://phabricator.wikimedia.org/T258361) [06:42:35] (03PS2) 10Marostegui: install_server: Do not format db1159 [puppet] - 10https://gerrit.wikimedia.org/r/674481 (https://phabricator.wikimedia.org/T258361) [06:43:16] (03CR) 10Marostegui: [C: 03+2] install_server: Do not format db1159 [puppet] - 10https://gerrit.wikimedia.org/r/674481 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:47:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:49:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:49:59] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:59:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10ayounsi) @Cmjohnson there are outstanding changes in Homer for cloudsw1-c8 Also something is not right in Netbox, https://netbox... [07:01:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1084.eqiad.wmnet - https://phabricator.wikimedia.org/T276302 (10Marostegui) a:05Marostegui→03wiki_willy [07:02:27] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:09:09] !log installing squid security updates [07:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:39] 10SRE, 10Patch-For-Review: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10MoritzMuehlenhoff) [07:10:21] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission for hosts ml-etcd2002.codfw.wmnet [07:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:38] 10SRE, 10Patch-For-Review: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10MoritzMuehlenhoff) tendril.w.o and dbtree.w.o are now served from dbmonitor1002.wikimedia.org running Buster. If there are any issues, we can fallback to dbmonitor1001 by reverting https://gerrit.wikimed... [07:15:04] 10SRE, 10Patch-For-Review: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) Hey, @MoritzMuehlenhoff, I see no ongoing issues, but I see some things running 10x faster now! [07:16:29] 10SRE, 10Patch-For-Review: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) [07:19:49] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:20:40] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ml-etcd2002.codfw.wmnet [07:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 25%: Slowly repool db1141', diff saved to https://phabricator.wikimedia.org/P15058 and previous config saved to /var/cache/conftool/dbconfig/20210324-072214-root.json [07:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:45] (03PS1) 10Elukey: Remove ml-etcd2002 SRV record for decom [dns] - 10https://gerrit.wikimedia.org/r/674486 (https://phabricator.wikimedia.org/T278238) [07:23:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1086 for schema change', diff saved to https://phabricator.wikimedia.org/P15059 and previous config saved to /var/cache/conftool/dbconfig/20210324-072319-marostegui.json [07:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:28] (03CR) 10Elukey: [C: 03+2] Remove ml-etcd2002 SRV record for decom [dns] - 10https://gerrit.wikimedia.org/r/674486 (https://phabricator.wikimedia.org/T278238) (owner: 10Elukey) [07:23:40] (03CR) 10Muehlenhoff: [C: 03+2] Remove deployment-logstash2 [puppet] - 10https://gerrit.wikimedia.org/r/674392 (https://phabricator.wikimedia.org/T238707) (owner: 10Muehlenhoff) [07:27:37] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm for new host ml-etcd2002.codfw.wmnet [07:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:31:44] (03CR) 10Jcrespo: "The general idea looks ok, I think a patch like this should be merged. However:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/674319 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [07:32:31] (03PS1) 10Filippo Giunchedi: alertmanager: repeat Performance alerts after 48h [puppet] - 10https://gerrit.wikimedia.org/r/674514 (https://phabricator.wikimedia.org/T278210) [07:33:19] (03CR) 10Jcrespo: "Also see below." (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/674319 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [07:33:26] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: repeat Performance alerts after 48h [puppet] - 10https://gerrit.wikimedia.org/r/674514 (https://phabricator.wikimedia.org/T278210) (owner: 10Filippo Giunchedi) [07:37:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 50%: Slowly repool db1141', diff saved to https://phabricator.wikimedia.org/P15060 and previous config saved to /var/cache/conftool/dbconfig/20210324-073718-root.json [07:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:29] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: read alerts from /srv/alerts [puppet] - 10https://gerrit.wikimedia.org/r/674321 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi) [07:38:55] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:40:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1086 (re)pooling @ 25%: Slowly repool db1086 after schema change', diff saved to https://phabricator.wikimedia.org/P15061 and previous config saved to /var/cache/conftool/dbconfig/20210324-074050-root.json [07:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:47] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-etcd2002.codfw.wmnet [07:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:05] (03PS1) 10Elukey: Revert "Remove ml-etcd2002 SRV record for decom" [dns] - 10https://gerrit.wikimedia.org/r/674380 [07:43:55] (03CR) 10Elukey: [C: 03+2] Revert "Remove ml-etcd2002 SRV record for decom" [dns] - 10https://gerrit.wikimedia.org/r/674380 (owner: 10Elukey) [07:45:48] (03PS1) 10Marostegui: install_server: Do not format db1161 [puppet] - 10https://gerrit.wikimedia.org/r/674517 (https://phabricator.wikimedia.org/T258361) [07:48:20] (03CR) 10Abijeet Patro: [C: 03+1] MassMessage: Unbreak remote content fetching [extensions/MassMessage] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/674367 (https://phabricator.wikimedia.org/T276936) (owner: 10Nikerabbit) [07:48:24] (03PS1) 10Muehlenhoff: cloud VPS/Toolforge: Remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/674518 (https://phabricator.wikimedia.org/T232677) [07:50:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:50:28] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=dnsdisc=zotero [07:50:29] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=dnsdisc=eventgate-logging-external [07:50:29] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=dnsdisc=eventgate-main [07:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 75%: Slowly repool db1141', diff saved to https://phabricator.wikimedia.org/P15062 and previous config saved to /var/cache/conftool/dbconfig/20210324-075221-root.json [07:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:52:36] (03PS1) 10Muehlenhoff: openstack: Remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/674519 (https://phabricator.wikimedia.org/T232677) [07:55:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1086 (re)pooling @ 50%: Slowly repool db1086 after schema change', diff saved to https://phabricator.wikimedia.org/P15063 and previous config saved to /var/cache/conftool/dbconfig/20210324-075553-root.json [07:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:06] (03PS1) 10Muehlenhoff: toolforge:k8s:etcd: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/674522 [07:59:27] (03CR) 10Marostegui: [C: 03+2] install_server: Do not format db1161 [puppet] - 10https://gerrit.wikimedia.org/r/674517 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [08:01:22] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=zotero [08:01:23] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=eventgate-logging-external [08:01:23] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=eventgate-main [08:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:56] (03PS1) 10Elukey: install_server: change mac address for ml-etcd2002 [puppet] - 10https://gerrit.wikimedia.org/r/674523 (https://phabricator.wikimedia.org/T278238) [08:02:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1149 for schema change', diff saved to https://phabricator.wikimedia.org/P15064 and previous config saved to /var/cache/conftool/dbconfig/20210324-080223-marostegui.json [08:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:46] (03CR) 10Elukey: [C: 03+2] install_server: change mac address for ml-etcd2002 [puppet] - 10https://gerrit.wikimedia.org/r/674523 (https://phabricator.wikimedia.org/T278238) (owner: 10Elukey) [08:02:50] (03CR) 10JMeybohm: [C: 03+2] proton: Remove unused nodePort, enable telemetry [deployment-charts] - 10https://gerrit.wikimedia.org/r/673932 (owner: 10JMeybohm) [08:04:10] (03Merged) 10jenkins-bot: proton: Remove unused nodePort, enable telemetry [deployment-charts] - 10https://gerrit.wikimedia.org/r/673932 (owner: 10JMeybohm) [08:05:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 100%: Slowly repool db1141', diff saved to https://phabricator.wikimedia.org/P15065 and previous config saved to /var/cache/conftool/dbconfig/20210324-080725-root.json [08:07:30] (03CR) 10DCausse: [C: 03+1] Revert "[wdqs] switch reporting topic to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/674118 (owner: 10DCausse) [08:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:17] (03PS2) 10Gehel: Revert "[wdqs] switch reporting topic to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/674118 (owner: 10DCausse) [08:10:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:10:25] (03CR) 10Gehel: [C: 03+2] Revert "[wdqs] switch reporting topic to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/674118 (owner: 10DCausse) [08:10:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1086 (re)pooling @ 75%: Slowly repool db1086 after schema change', diff saved to https://phabricator.wikimedia.org/P15066 and previous config saved to /var/cache/conftool/dbconfig/20210324-081057-root.json [08:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:49] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=eventgate-analytics-external [08:14:49] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=eventgate-analytics [08:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:12] !log restarting wdqs updater on all nodes for config change [08:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:40] (03PS1) 10Marostegui: install_server: Do not format db1164 [puppet] - 10https://gerrit.wikimedia.org/r/674527 (https://phabricator.wikimedia.org/T258361) [08:19:30] (03CR) 10Marostegui: [C: 03+2] install_server: Do not format db1164 [puppet] - 10https://gerrit.wikimedia.org/r/674527 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [08:20:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:23:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:53] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=sessionstore [08:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:35] (03PS1) 10Kosta Harlan: [WIP] linkrecommendation: Vary gunicorn timeout by environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/674529 (https://phabricator.wikimedia.org/T277297) [08:42:28] 10SRE, 10netops: Move management routers ssh port - https://phabricator.wikimedia.org/T277438 (10ayounsi) This is not supported by 3/5 of our management routers. They will need to be replaced by more recent gear anyway. [08:52:43] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Same comment as john, but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/672773 (owner: 10Effie Mouzeli) [08:52:49] PROBLEM - Mediawiki CirrusSearch update rate - codfw on alert1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [08:53:45] 10SRE, 10netops: Move management routers ssh port - https://phabricator.wikimedia.org/T277438 (10ayounsi) [08:58:14] (03CR) 10Giuseppe Lavagetto: "Add one more test, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [08:58:45] (03CR) 10Giuseppe Lavagetto: "Not sure we need a stretch host at this point, otherwise LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/663796 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [08:58:56] (03CR) 10Giuseppe Lavagetto: [C: 03+1] hieradata: enable memcached socket mwdebug1003, mwdebug2001 [puppet] - 10https://gerrit.wikimedia.org/r/663796 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [08:59:48] (03CR) 10Giuseppe Lavagetto: [C: 03+1] hieradata: enable ipv6 on envoy services on all mw servers [puppet] - 10https://gerrit.wikimedia.org/r/673061 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli) [09:00:20] (03Abandoned) 10David Caro: tests: Improve the mocking of logs [software/spicerack] - 10https://gerrit.wikimedia.org/r/659295 (owner: 10David Caro) [09:00:35] PROBLEM - Mediawiki CirrusSearch update rate - eqiad on alert1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [09:00:56] looking ^ [09:04:37] (03PS1) 10David Caro: wmcs: Fix callable type hint [cookbooks] - 10https://gerrit.wikimedia.org/r/674534 (https://phabricator.wikimedia.org/T278286) [09:06:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "The change LGTM. If you want to minimize disruption, you can do as follows:" [puppet] - 10https://gerrit.wikimedia.org/r/674282 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli) [09:07:57] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Same comment as the previous patch in terms of managing the change, otherwise LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/674290 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli) [09:08:39] RECOVERY - HTTPS-planet on en.planet.wikimedia.org is OK: SSL OK - Certificate *.wikipedia.org valid until 2021-06-15 08:42:19 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [09:09:49] RECOVERY - HTTPS-wmfusercontent on phab.wmfusercontent.org is OK: SSL OK - Certificate *.wikipedia.org valid until 2021-06-15 08:42:19 +0000 (expires in 82 days) https://phabricator.wikimedia.org/tag/phabricator/ [09:10:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 25%: Slowly repool db1149 after schema change', diff saved to https://phabricator.wikimedia.org/P15068 and previous config saved to /var/cache/conftool/dbconfig/20210324-091055-root.json [09:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:21] 10SRE, 10Prod-Kubernetes, 10SRE-tools, 10serviceops: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10JMeybohm) [09:25:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 50%: Slowly repool db1149 after schema change', diff saved to https://phabricator.wikimedia.org/P15069 and previous config saved to /var/cache/conftool/dbconfig/20210324-092558-root.json [09:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:45] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [09:28:45] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [09:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:09] (03PS1) 10Marostegui: install_server: Do not reimage db1165 [puppet] - 10https://gerrit.wikimedia.org/r/674540 (https://phabricator.wikimedia.org/T258361) [09:33:10] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1165 [puppet] - 10https://gerrit.wikimedia.org/r/674540 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [09:33:28] 10SRE, 10vm-requests: eqiad: 1 VMs requested for ircd - https://phabricator.wikimedia.org/T278292 (10MoritzMuehlenhoff) [09:33:40] 10SRE, 10vm-requests: eqiad: 1 VMs requested for ircd - https://phabricator.wikimedia.org/T278292 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [09:34:23] 10SRE, 10vm-requests: eqiad: 1 of VMs requested for tendril/buster - https://phabricator.wikimedia.org/T277657 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This has been created and is being used. [09:35:20] 10SRE, 10vm-requests: eqiad: 1 VMs requested for ircd - https://phabricator.wikimedia.org/T278292 (10Majavah) [09:41:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 75%: Slowly repool db1149 after schema change', diff saved to https://phabricator.wikimedia.org/P15070 and previous config saved to /var/cache/conftool/dbconfig/20210324-094102-root.json [09:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:28] (03PS7) 10ArielGlenn: [WIP] needs more tests [dumps] - 10https://gerrit.wikimedia.org/r/646998 (https://phabricator.wikimedia.org/T252396) [09:45:48] (03PS1) 10Hoo man: hoo bash_profile: Add proxy convenience function [puppet] - 10https://gerrit.wikimedia.org/r/674545 [09:46:48] (03PS1) 10Volans: tests: test the import of all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/674546 (https://phabricator.wikimedia.org/T278286) [09:47:30] (03CR) 10Volans: "This should currently fail CI and then pass CI once Ia5eb7b87897e001bca5f508d601d0297f2eb00eb is merged." [cookbooks] - 10https://gerrit.wikimedia.org/r/674546 (https://phabricator.wikimedia.org/T278286) (owner: 10Volans) [09:49:40] (03CR) 10jerkins-bot: [V: 04-1] tests: test the import of all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/674546 (https://phabricator.wikimedia.org/T278286) (owner: 10Volans) [09:50:19] (03CR) 10Volans: [C: 03+1] "LGTM, see Ia9a688a18d0921a8ca651e4f62de160e616863db for the addition of CI checks that should pass once this is merged." [cookbooks] - 10https://gerrit.wikimedia.org/r/674534 (https://phabricator.wikimedia.org/T278286) (owner: 10David Caro) [09:51:31] !log restart db1116 T271913 [09:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:43] (03CR) 10Volans: [C: 03+2] wmcs: Fix callable type hint [cookbooks] - 10https://gerrit.wikimedia.org/r/674534 (https://phabricator.wikimedia.org/T278286) (owner: 10David Caro) [09:56:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 100%: Slowly repool db1149 after schema change', diff saved to https://phabricator.wikimedia.org/P15071 and previous config saved to /var/cache/conftool/dbconfig/20210324-095606-root.json [09:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1160 for schema change', diff saved to https://phabricator.wikimedia.org/P15072 and previous config saved to /var/cache/conftool/dbconfig/20210324-095655-marostegui.json [09:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:12] 10SRE, 10DBA, 10Epic, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10Kormat) a:05Kormat→03Marostegui Assigning to @Marostegui (but he might not get to it until he's back f... [09:58:15] (03Merged) 10jenkins-bot: wmcs: Fix callable type hint [cookbooks] - 10https://gerrit.wikimedia.org/r/674534 (https://phabricator.wikimedia.org/T278286) (owner: 10David Caro) [09:58:21] (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/674546 (https://phabricator.wikimedia.org/T278286) (owner: 10Volans) [10:00:35] PROBLEM - configured eth on ml-etcd2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.48: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [10:01:02] elukey: FYI ^^^ [10:01:26] ack thanks [10:02:33] (03PS1) 10Alexandros Kosiaris: jobqueue: Bump memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/674547 [10:03:11] !log restart db1139 T271913 [10:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:44] (03CR) 10JMeybohm: [C: 03+1] jobqueue: Bump memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/674547 (owner: 10Alexandros Kosiaris) [10:04:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] jobqueue: Bump memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/674547 (owner: 10Alexandros Kosiaris) [10:05:44] (03Merged) 10jenkins-bot: jobqueue: Bump memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/674547 (owner: 10Alexandros Kosiaris) [10:06:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:28] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [10:06:28] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [10:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:51] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:09:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, thanks for the cleanup!" [puppet] - 10https://gerrit.wikimedia.org/r/674518 (https://phabricator.wikimedia.org/T232677) (owner: 10Muehlenhoff) [10:14:11] !log restart db1145 T271913 [10:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:55] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [10:14:55] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [10:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:44] (03PS1) 10Hashar: gerrit: restoring a change adds #Patch-For-Review [puppet] - 10https://gerrit.wikimedia.org/r/674548 (https://phabricator.wikimedia.org/T277597) [10:16:03] (03PS1) 10Volans: icinga: add new IcingaHosts class [software/spicerack] - 10https://gerrit.wikimedia.org/r/674549 (https://phabricator.wikimedia.org/T277740) [10:24:00] (03CR) 10jerkins-bot: [V: 04-1] icinga: add new IcingaHosts class [software/spicerack] - 10https://gerrit.wikimedia.org/r/674549 (https://phabricator.wikimedia.org/T277740) (owner: 10Volans) [10:28:00] (03CR) 10Ammarpad: [C: 04-1] gerrit: restoring a change adds #Patch-For-Review (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674548 (https://phabricator.wikimedia.org/T277597) (owner: 10Hashar) [10:31:45] RECOVERY - configured eth on ml-etcd2002 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [10:31:58] !log restart db1171 T271913 [10:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:16] (03PS2) 10Volans: icinga: add new IcingaHosts class [software/spicerack] - 10https://gerrit.wikimedia.org/r/674549 (https://phabricator.wikimedia.org/T277740) [10:36:44] !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm for new host irc1001.wikimedia.org [10:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:39:08] (03PS40) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [10:39:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs: add new admin script to create a new base image based off of upstream [puppet] - 10https://gerrit.wikimedia.org/r/674184 (https://phabricator.wikimedia.org/T278051) (owner: 10Andrew Bogott) [10:41:03] (03PS6) 10Effie Mouzeli: profile::mcrouter_wancache: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/672773 [10:41:17] RECOVERY - Mediawiki CirrusSearch update rate - eqiad on alert1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [10:42:07] (03CR) 10jerkins-bot: [V: 04-1] profile::mcrouter_wancache: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/672773 (owner: 10Effie Mouzeli) [10:43:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:44:12] !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [10:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:44] (03CR) 10Jbond: P:netbase: parse the service catalogue and inject the service ports (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond) [10:45:29] !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [10:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:03] RECOVERY - Mediawiki CirrusSearch update rate - codfw on alert1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [10:47:04] (03PS1) 10Phuedx: Inform anonymous A/B test by tracking time from navigationStart [skins/Vector] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674382 [10:47:36] (03PS1) 10Hashar: gerrit: reported owner is actually patchset author [puppet] - 10https://gerrit.wikimedia.org/r/674552 (https://phabricator.wikimedia.org/T224262) [10:48:08] !log mbsantos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [10:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:20] (03CR) 10Muehlenhoff: [C: 03+2] cloud VPS/Toolforge: Remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/674518 (https://phabricator.wikimedia.org/T232677) (owner: 10Muehlenhoff) [10:50:20] !log jmm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host irc1001.wikimedia.org [10:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:57:56] (03PS1) 10Muehlenhoff: Add irc1001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/674556 (https://phabricator.wikimedia.org/T278255) [10:58:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: How many deployers does it take to do European mid-day backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210324T1100). [11:00:04] Andrew-WMDE, Nikerabbit, and phuedx: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:01:12] I'm going to do the deployment today [11:01:52] (03PS2) 10Andrew-WMDE: Enable bracket matching on group0 and wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673312 (https://phabricator.wikimedia.org/T273591) [11:02:16] (03CR) 10Muehlenhoff: [C: 03+2] Add irc1001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/674556 (https://phabricator.wikimedia.org/T278255) (owner: 10Muehlenhoff) [11:02:17] o/ [11:02:40] hello [11:04:21] (03CR) 10Andrew-WMDE: [C: 03+2] Enable bracket matching on group0 and wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673312 (https://phabricator.wikimedia.org/T273591) (owner: 10Andrew-WMDE) [11:05:11] (03PS1) 10Hashar: gerrit: escape remarkup for Phabricator comments [puppet] - 10https://gerrit.wikimedia.org/r/674558 (https://phabricator.wikimedia.org/T93331) [11:05:30] (03Merged) 10jenkins-bot: Enable bracket matching on group0 and wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673312 (https://phabricator.wikimedia.org/T273591) (owner: 10Andrew-WMDE) [11:06:48] (03CR) 10Hashar: "I haven't tested it :( I have just made a bunch of assumptions based on Phabricator remarkup doc at https://secure.phabricator.com/book/p" [puppet] - 10https://gerrit.wikimedia.org/r/674558 (https://phabricator.wikimedia.org/T93331) (owner: 10Hashar) [11:08:19] (03PS1) 10Jcrespo: Revert "db2089,db2137,db2139: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/674385 [11:08:35] (03CR) 10jerkins-bot: [V: 04-1] Revert "db2089,db2137,db2139: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/674385 (owner: 10Jcrespo) [11:09:46] (03PS2) 10Jcrespo: Revert "db2089,db2137,db2139: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/674385 [11:09:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:55] (03PS3) 10Jcrespo: Revert "db2089,db2137,db2139: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/674385 [11:11:03] (03PS1) 10Hnowlan: api-gateway: make discovery service timeouts configurable per service [deployment-charts] - 10https://gerrit.wikimedia.org/r/674562 (https://phabricator.wikimedia.org/T277297) [11:12:28] (03PS4) 10Jcrespo: Revert "db2089,db2137,db2139: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/674385 (https://phabricator.wikimedia.org/T277632) [11:12:40] (03PS5) 10Jcrespo: Revert "db2089,db2137,db2139: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/674385 (https://phabricator.wikimedia.org/T277632) [11:13:11] (03CR) 10Kosta Harlan: [C: 03+1] api-gateway: make discovery service timeouts configurable per service [deployment-charts] - 10https://gerrit.wikimedia.org/r/674562 (https://phabricator.wikimedia.org/T277297) (owner: 10Hnowlan) [11:13:27] (03CR) 10Jcrespo: [C: 03+2] Revert "db2089,db2137,db2139: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/674385 (https://phabricator.wikimedia.org/T277632) (owner: 10Jcrespo) [11:14:08] (03PS2) 10Andrew-WMDE: Enable CodeMirror accessibility colors on initial wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673326 (https://phabricator.wikimedia.org/T276346) [11:14:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:14:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 25%: Slowly repool db1160 after schema change', diff saved to https://phabricator.wikimedia.org/P15073 and previous config saved to /var/cache/conftool/dbconfig/20210324-111429-root.json [11:14:30] (03Abandoned) 10Hnowlan: api-gateway: make discovery service timeouts configurable per service [deployment-charts] - 10https://gerrit.wikimedia.org/r/674562 (https://phabricator.wikimedia.org/T277297) (owner: 10Hnowlan) [11:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:58] !log andrew-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:673312|Enable bracket matching on group0 and wikitech (T273591)]] (duration: 01m 25s) [11:15:04] (03PS2) 10Volans: tests: test the import of all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/674546 (https://phabricator.wikimedia.org/T278286) [11:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:05] T273591: Enable bracket matching on more wikis - https://phabricator.wikimedia.org/T273591 [11:15:08] !log restart serially db2097 db2098 db2099 db2100 T271913 [11:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:00] (03PS1) 10Filippo Giunchedi: Add Debian packaging for 21.3.0 [software/librenms] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/674563 (https://phabricator.wikimedia.org/T278309) [11:16:18] (03CR) 10Andrew-WMDE: [C: 03+2] Enable CodeMirror accessibility colors on initial wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673326 (https://phabricator.wikimedia.org/T276346) (owner: 10Andrew-WMDE) [11:17:09] (03Merged) 10jenkins-bot: Enable CodeMirror accessibility colors on initial wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673326 (https://phabricator.wikimedia.org/T276346) (owner: 10Andrew-WMDE) [11:19:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:22:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:22:15] !log andrew-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:673326|Enable CodeMirror accessibility colors on initial wikis (T276346)]] (duration: 01m 08s) [11:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:24] T276346: Enable syntax highlighting colour scheme update on first wikis - https://phabricator.wikimedia.org/T276346 [11:24:05] Nikerabbit would you like to self deploy or should I deploy your changes? [11:25:30] Andrew-WMDE_: please deploy if you don't mind. I think both branches can be merged simultaneously. No testing on canary is possible because of job queue. [11:25:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 50%: Slowly repool db1160 after schema change', diff saved to https://phabricator.wikimedia.org/P15074 and previous config saved to /var/cache/conftool/dbconfig/20210324-112932-root.json [11:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:43] (03CR) 10Andrew-WMDE: [C: 03+2] MassMessage: Unbreak remote content fetching [extensions/MassMessage] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674366 (https://phabricator.wikimedia.org/T276936) (owner: 10Nikerabbit) [11:29:48] (03CR) 10Andrew-WMDE: [C: 03+2] MassMessage: Unbreak remote content fetching [extensions/MassMessage] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/674367 (https://phabricator.wikimedia.org/T276936) (owner: 10Nikerabbit) [11:30:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:04] (03CR) 10Andrew-WMDE: [C: 03+2] "Backport" [extensions/MassMessage] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674366 (https://phabricator.wikimedia.org/T276936) (owner: 10Nikerabbit) [11:31:11] (03CR) 10Andrew-WMDE: [C: 03+2] "Backport" [extensions/MassMessage] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/674367 (https://phabricator.wikimedia.org/T276936) (owner: 10Nikerabbit) [11:33:03] (03PS7) 10Effie Mouzeli: profile::mcrouter_wancache: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/672773 [11:33:52] (03CR) 10David Caro: [C: 03+2] "ping @Muehlenhoff" [puppet] - 10https://gerrit.wikimedia.org/r/674074 (https://phabricator.wikimedia.org/T274565) (owner: 10David Caro) [11:34:31] (03CR) 10Esanders: "This should be a no-op change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674300 (owner: 10Esanders) [11:36:54] (03Merged) 10jenkins-bot: MassMessage: Unbreak remote content fetching [extensions/MassMessage] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674366 (https://phabricator.wikimedia.org/T276936) (owner: 10Nikerabbit) [11:37:19] (03Merged) 10jenkins-bot: MassMessage: Unbreak remote content fetching [extensions/MassMessage] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/674367 (https://phabricator.wikimedia.org/T276936) (owner: 10Nikerabbit) [11:41:13] (03PS8) 10Jbond: profile::mcrouter_wancache: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/672773 (owner: 10Effie Mouzeli) [11:41:21] (03CR) 10Jbond: profile::mcrouter_wancache: add spec tests (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/672773 (owner: 10Effie Mouzeli) [11:41:25] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:39] effie fyi ^^^ [11:43:30] jbond42: I was about to upload a patch haha, thank you ! [11:43:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:45] I will merge your version though [11:44:19] effie: ahh sorry for stepping on toes was just going through queue and thught it was easy enough to just fix :) [11:44:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 75%: Slowly repool db1160 after schema change', diff saved to https://phabricator.wikimedia.org/P15075 and previous config saved to /var/cache/conftool/dbconfig/20210324-114436-root.json [11:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:11] !log andrew-wmde@deploy1002 Synchronized php-1.36.0-wmf.36/extensions/MassMessage/: Backport: [[gerrit:674366|MassMessage: Unbreak remote content fetching (T276936)]] (duration: 01m 07s) [11:45:16] jbond42: it took me a little be to understand why it was failing, I was about to ask for help :p [11:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:18] T276936: Improve wrapping of page contents delivered using MassMessage - https://phabricator.wikimedia.org/T276936 [11:45:24] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "First round of comments, will add more once I've ran the chart locally." (039 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 (owner: 10Legoktm) [11:45:45] (03PS10) 10Effie Mouzeli: mediawiki::mcrouter: add onhost memcached unix socket support [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) [11:46:09] effie: ahh cool, let me know if you need any more info? [11:46:23] :D [11:46:37] (03CR) 10Jbond: [C: 03+1] gitlab: add gitlab.wm.org service IP, with lookup from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/674446 (https://phabricator.wikimedia.org/T276148) (owner: 10Dzahn) [11:47:01] (03CR) 10Effie Mouzeli: [C: 03+2] profile::mcrouter_wancache: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/672773 (owner: 10Effie Mouzeli) [11:47:15] Andrew-WMDE_: is 1.36.0-wmf.35 still pending? [11:47:35] Yes, 1.36.0-wmf.36 is done [11:48:31] Andrew-WMDE_: gotcha, will test after both are deployed [11:49:08] !log disable puppet on all hosts running mediawiki+memcached to merge 674282 [11:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:50] (03CR) 10Palak199: "Thanks for the review," [software/transferpy] - 10https://gerrit.wikimedia.org/r/674319 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [11:51:09] !log andrew-wmde@deploy1002 Synchronized php-1.36.0-wmf.35/extensions/MassMessage/: Backport: [[gerrit:674367|MassMessage: Unbreak remote content fetching (T276936)]] (duration: 01m 08s) [11:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:17] T276936: Improve wrapping of page contents delivered using MassMessage - https://phabricator.wikimedia.org/T276936 [11:52:18] Nikerabbit both are now deployed, please check to see if they are working [11:52:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:08] !log restart dbprov100[12] T271913 [11:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:16] Andrew-WMDE_: doing [11:53:31] (03PS1) 10Muehlenhoff: Extend access for Amrutha Varshini Chandra [puppet] - 10https://gerrit.wikimedia.org/r/674570 [11:54:45] Andrew-WMDE_: looks good! thank you [11:55:15] Great! [11:56:20] (03CR) 10Effie Mouzeli: [C: 03+2] hiera/nodes: put mc1037 into our session redis and memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/674282 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli) [11:56:24] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for Amrutha Varshini Chandra [puppet] - 10https://gerrit.wikimedia.org/r/674570 (owner: 10Muehlenhoff) [11:56:54] (03PS3) 10Effie Mouzeli: hiera/nodes: put mc1037 into our session redis and memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/674282 (https://phabricator.wikimedia.org/T278225) [11:57:01] !log EU deploys done [11:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:17] Andrew-WMDE_: :( [11:57:28] Not enough time left in the window? [11:57:35] (03PS1) 10David Caro: wmcs.backup: skip wcqs-beta-01 from backups [puppet] - 10https://gerrit.wikimedia.org/r/674571 [11:58:49] phuedx Sorry, I there isn't enough time left [11:58:55] (03CR) 10Jcrespo: "> Patch Set 1:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/674319 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [11:59:15] Andrew-WMDE_: No worries. I'll schedule it for the next window [11:59:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 100%: Slowly repool db1160 after schema change', diff saved to https://phabricator.wikimedia.org/P15076 and previous config saved to /var/cache/conftool/dbconfig/20210324-115940-root.json [11:59:40] (03CR) 10Bartosz Dziewoński: [C: 03+1] Remove redundant wgDiscussionToolsEnable overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674300 (owner: 10Esanders) [11:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210324T1200) [12:00:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:53] (03CR) 10QChris: [C: 03+1] gerrit: reported owner is actually patchset author [puppet] - 10https://gerrit.wikimedia.org/r/674552 (https://phabricator.wikimedia.org/T224262) (owner: 10Hashar) [12:01:57] (03CR) 10Jbond: [C: 04-1] gitlab: open firewall on 22,80,443. use drange to limit to service IP (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670331 (https://phabricator.wikimedia.org/T276144) (owner: 10Dzahn) [12:03:08] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: wmde-toolkit-analyzer-build.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:04:56] (03CR) 10David Caro: [C: 03+1] tests: test the import of all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/674546 (https://phabricator.wikimedia.org/T278286) (owner: 10Volans) [12:05:46] 10SRE, 10User-MoritzMuehlenhoff: Evaluate/integrate eatmydata in d-i - https://phabricator.wikimedia.org/T278312 (10MoritzMuehlenhoff) [12:05:55] 10SRE, 10User-MoritzMuehlenhoff: Evaluate/integrate eatmydata in d-i - https://phabricator.wikimedia.org/T278312 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:10:53] !log restart dbprov200[12] T271913 [12:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:12:28] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Add shellbox chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 (owner: 10Legoktm) [12:13:18] (03PS1) 10Ayounsi: Option 82: use-vlan-id [homer/public] - 10https://gerrit.wikimedia.org/r/674574 (https://phabricator.wikimedia.org/T221388) [12:13:21] (03CR) 10Muehlenhoff: [C: 03+2] hoo bash_profile: Add proxy convenience function [puppet] - 10https://gerrit.wikimedia.org/r/674545 (owner: 10Hoo man) [12:13:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:13:36] (03Restored) 10Hnowlan: api-gateway: make discovery service timeouts configurable per service [deployment-charts] - 10https://gerrit.wikimedia.org/r/674562 (https://phabricator.wikimedia.org/T277297) (owner: 10Hnowlan) [12:14:01] 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10mmodell) @epriestley: Thanks for the tips. I think we may have already enabled persistant connections, I will double check. As for caching, our phabricator s... [12:14:50] (03CR) 10Jbond: "follow up note" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674545 (owner: 10Hoo man) [12:15:29] (03PS1) 10Arturo Borrero Gonzalez: toolforge: docker registry: deploy dhparams.pem [puppet] - 10https://gerrit.wikimedia.org/r/674575 [12:17:34] 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10mmodell) Although I thought we were using `cluster.databases` to configure mysql connections, it appears that we are not. [12:19:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: docker registry: deploy dhparams.pem [puppet] - 10https://gerrit.wikimedia.org/r/674575 (owner: 10Arturo Borrero Gonzalez) [12:22:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:23:03] (03PS2) 10Hnowlan: api-gateway: make discovery service timeouts configurable per service [deployment-charts] - 10https://gerrit.wikimedia.org/r/674562 (https://phabricator.wikimedia.org/T277297) [12:23:17] 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10jcrespo) >>! In T274228#6940987, @mmodell wrote: > Although I thought we were using `cluster.databases` to configure mysql connections, it appears that we are... [12:23:19] (03PS1) 10Arturo Borrero Gonzalez: toolforge: docker registry: don't use labs_lvm on buster [puppet] - 10https://gerrit.wikimedia.org/r/674576 (https://phabricator.wikimedia.org/T278303) [12:23:38] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: make discovery service timeouts configurable per service [deployment-charts] - 10https://gerrit.wikimedia.org/r/674562 (https://phabricator.wikimedia.org/T277297) (owner: 10Hnowlan) [12:24:03] PROBLEM - Aggregate IPsec Tunnel Status eqiad on alert1001 is CRITICAL: instance=mc1035 site=eqiad tunnel=mc2027_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [12:24:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: docker registry: don't use labs_lvm on buster [puppet] - 10https://gerrit.wikimedia.org/r/674576 (https://phabricator.wikimedia.org/T278303) (owner: 10Arturo Borrero Gonzalez) [12:24:52] 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10mmodell) @jcrespo: I think we can use cluster.databases in phabricator's config even with a single mysql server name/password. Do you know if the proxy suppor... [12:27:56] (03PS3) 10Hnowlan: api-gateway: make discovery service timeouts configurable per service [deployment-charts] - 10https://gerrit.wikimedia.org/r/674562 (https://phabricator.wikimedia.org/T277297) [12:28:28] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: make discovery service timeouts configurable per service [deployment-charts] - 10https://gerrit.wikimedia.org/r/674562 (https://phabricator.wikimedia.org/T277297) (owner: 10Hnowlan) [12:28:42] !log enabling puppet on mediawiki and memcached servers [12:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:29] (03PS1) 10Palak199: Modify:: The parsing function in transfer.py [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 [12:29:50] (03PS4) 10Hnowlan: api-gateway: make discovery service timeouts configurable per service [deployment-charts] - 10https://gerrit.wikimedia.org/r/674562 (https://phabricator.wikimedia.org/T277297) [12:30:01] PROBLEM - Memcached on mc1037 is CRITICAL: connect to address 10.64.0.124 and port 11211: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [12:30:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:47] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet [12:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:01] PROBLEM - Host kubetcd2006 is DOWN: PING CRITICAL - Packet loss = 100% [12:32:17] PROBLEM - Host ml-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [12:32:47] kubetcd2006 and ml-etcd2001 are expected [12:35:05] 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10jcrespo) There is HTTP-reuse functionality: https://cbonte.github.io/haproxy-dconv/1.8/configuration.html#4.2-http-reuse but I don't think there is something l... [12:35:15] <_joe_> effie: have you seen the alert on mc1037? [12:36:16] <_joe_> I think the problem is the memcached configuration [12:36:21] <_joe_> systemctl says -l 127.0.0.1 [12:37:23] yes ah ! [12:37:43] <_joe_> and indeed, /etc/memcached.conf seems untouched [12:37:54] <_joe_> I see -m 64 -l 127.0.0.1 [12:38:18] <_joe_> and somehow the override for systemd we run wasn't loaded [12:38:32] what we have done is that when we change stuff [12:38:39] puppet does not restart memcached [12:38:40] <_joe_> I think it's enough to just restart memcached, but there is a bug in the puppetization [12:38:50] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet [12:38:51] and this is something I remember when I need it [12:38:51] <_joe_> ok than that's the issue [12:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:03] <_joe_> you need to restart it now :) [12:39:12] yes I did so :p [12:39:27] RECOVERY - Memcached on mc1037 is OK: TCP OK - 0.000 second response time on 10.64.0.124 port 11211 https://wikitech.wikimedia.org/wiki/Memcached [12:39:35] <_joe_> ok now it looks better :) [12:40:39] I am always happy when I know memcached wont restart itself, but let down when it doesnt :p [12:40:41] RECOVERY - Host kubetcd2006 is UP: PING OK - Packet loss = 0%, RTA = 31.84 ms [12:40:43] RECOVERY - Host ml-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 32.00 ms [12:46:42] (03CR) 10Volans: [C: 03+1] "LGTM, thanks" [homer/public] - 10https://gerrit.wikimedia.org/r/674574 (https://phabricator.wikimedia.org/T221388) (owner: 10Ayounsi) [12:47:51] (03CR) 10Volans: [C: 03+2] tests: test the import of all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/674546 (https://phabricator.wikimedia.org/T278286) (owner: 10Volans) [12:49:24] (03CR) 10Hoo man: "Thanks for having a look, but I doubt that's worth submitting yet another change," (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674545 (owner: 10Hoo man) [12:50:38] (03Merged) 10jenkins-bot: tests: test the import of all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/674546 (https://phabricator.wikimedia.org/T278286) (owner: 10Volans) [12:51:03] (03CR) 10Alexandros Kosiaris: [C: 03+1] admin_ng: Switch to cluster internal DNS name for API [deployment-charts] - 10https://gerrit.wikimedia.org/r/674008 (owner: 10JMeybohm) [12:51:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:52:16] etherpad seems borderline unusable (reloading every ten seconds or so), not sure if it's a problem with my connection or something server-side [12:52:19] (03PS2) 10Kosta Harlan: [WIP] linkrecommendation: Add environment variable for gunicorn timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/674529 (https://phabricator.wikimedia.org/T277297) [12:53:13] (03CR) 10jerkins-bot: [V: 04-1] [WIP] linkrecommendation: Add environment variable for gunicorn timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/674529 (https://phabricator.wikimedia.org/T277297) (owner: 10Kosta Harlan) [12:56:11] RECOVERY - Aggregate IPsec Tunnel Status eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [13:00:04] hashar and dancy: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - European+American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210324T1300). [13:02:43] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:23] !log drain ganeti2020 [13:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:18:51] 10Puppet, 10SRE, 10User-jbond: global http_proxy setting - https://phabricator.wikimedia.org/T278315 (10jbond) p:05Triage→03Medium [13:18:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:20:28] (03CR) 10Jbond: "> Patch Set 2:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674545 (owner: 10Hoo man) [13:21:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:25:30] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Archive/Remove deprecated calico gerrit repositories - https://phabricator.wikimedia.org/T267539 (10JMeybohm) The first two repos where already read-only with an archived description. I've done so for the third one as well. @akosiaris do you have broad... [13:28:17] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Archive/Remove deprecated calico gerrit repositories - https://phabricator.wikimedia.org/T267539 (10akosiaris) 05Open→03Resolved a:03akosiaris >>! In T267539#6941218, @JMeybohm wrote: > The first two repos where already read-only with an archived... [13:28:20] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: Upgrade Calico - https://phabricator.wikimedia.org/T207804 (10akosiaris) [13:29:29] !log installing irc1001 [13:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:51] (03PS2) 10Filippo Giunchedi: Add Debian packaging for 21.3.0 [software/librenms] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/674563 (https://phabricator.wikimedia.org/T278309) [13:34:08] 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10Aklapper) [13:34:28] 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10Aklapper) [13:39:04] (03CR) 10Jbond: "minor nit but otherwise looks good to me" (031 comment) [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [13:40:24] 10Puppet, 10SRE, 10User-jbond: global http_proxy setting - https://phabricator.wikimedia.org/T278315 (10Volans) Let me play the devil's advocate here, I hope to not have misunderstood your intentions, correct me if I'm wrong. What are the use cases for using the proxy manually on a production host? Isn't th... [13:42:27] (03CR) 10Andrew Bogott: [C: 03+1] openstack: Remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/674519 (https://phabricator.wikimedia.org/T232677) (owner: 10Muehlenhoff) [13:42:38] (03CR) 10Andrew Bogott: [C: 03+1] labs_bootstrapvz: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668005 (owner: 10Muehlenhoff) [13:43:23] (03CR) 10Andrew Bogott: [C: 03+2] wmcs: add new admin script to create a new base image based off of upstream [puppet] - 10https://gerrit.wikimedia.org/r/674184 (https://phabricator.wikimedia.org/T278051) (owner: 10Andrew Bogott) [13:43:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:23] (03PS1) 10JMeybohm: Migrate two k8s users to groups syntax [labs/private] - 10https://gerrit.wikimedia.org/r/674585 (https://phabricator.wikimedia.org/T269461) [13:47:41] (03PS1) 10Ladsgroup: statistics: Clean up absented crons [puppet] - 10https://gerrit.wikimedia.org/r/674606 (https://phabricator.wikimedia.org/T273673) [13:50:25] 10Puppet, 10SRE, 10User-jbond: global http_proxy setting - https://phabricator.wikimedia.org/T278315 (10Kormat) I'm in favour of not having the env have internet access by default, for safety reasons. I'm one of the users that has a similar bash function defined for this purpose. The usecase is that sometim... [13:50:42] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Migrate two k8s users to groups syntax [labs/private] - 10https://gerrit.wikimedia.org/r/674585 (https://phabricator.wikimedia.org/T269461) (owner: 10JMeybohm) [13:51:45] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet [13:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:22] (03PS1) 10JMeybohm: k8s users: Remove special case for migration, use list of groups [puppet] - 10https://gerrit.wikimedia.org/r/674607 (https://phabricator.wikimedia.org/T269461) [13:54:23] 10Puppet, 10SRE, 10User-jbond: global http_proxy setting - https://phabricator.wikimedia.org/T278315 (10jbond) [13:54:40] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:54:50] 10Puppet, 10SRE, 10User-jbond: global http_proxy setting - https://phabricator.wikimedia.org/T278315 (10akosiaris) I have it as a shortcut function on my home dir and not set somewhere globally and easily accessible by everyone for more or less the same reason, that is to not encourage it's ab(use). If every... [13:55:45] 10Puppet, 10SRE, 10User-jbond: global http_proxy setting - https://phabricator.wikimedia.org/T278315 (10Ottomata) FWIW, users of analytics would love this, but I understand the reasons not to enable it automatically. [13:57:08] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28735/console" [puppet] - 10https://gerrit.wikimedia.org/r/674607 (https://phabricator.wikimedia.org/T269461) (owner: 10JMeybohm) [13:57:40] (03PS1) 10Volans: docs: remove obsolete configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/674611 [13:57:42] (03PS1) 10Volans: docs: fix documentation checker for sub-packages [software/spicerack] - 10https://gerrit.wikimedia.org/r/674612 [13:59:26] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet [13:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:13] 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10mmodell) @jcrespo: sure, no argument from me. I was just curious if we could easily implement epriestley's advice (or whether we already had) [14:01:27] (03PS2) 10Volans: docs: fix documentation checker for sub-packages [software/spicerack] - 10https://gerrit.wikimedia.org/r/674612 [14:01:47] (03CR) 10Alexandros Kosiaris: [C: 03+1] k8s users: Remove special case for migration, use list of groups [puppet] - 10https://gerrit.wikimedia.org/r/674607 (https://phabricator.wikimedia.org/T269461) (owner: 10JMeybohm) [14:01:55] (03PS1) 10Volans: docs: add documentation for the toolforge package [software/spicerack] - 10https://gerrit.wikimedia.org/r/674614 [14:02:56] (03CR) 10jerkins-bot: [V: 04-1] docs: fix documentation checker for sub-packages [software/spicerack] - 10https://gerrit.wikimedia.org/r/674612 (owner: 10Volans) [14:03:09] (03CR) 10jerkins-bot: [V: 04-1] docs: remove obsolete configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/674611 (owner: 10Volans) [14:03:17] (03CR) 10Volans: "This check was not detecting sub-packages. It should fail until I468e052fee2b115ab8b196098b74e6ee46092308 is merged, and then if rebased s" [software/spicerack] - 10https://gerrit.wikimedia.org/r/674612 (owner: 10Volans) [14:03:18] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [14:05:14] !log failover Ganeti master in codfw to ganeti2019 [14:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:25] (03PS2) 10Volans: docs: remove obsolete configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/674611 [14:05:27] (03PS3) 10Volans: docs: fix documentation checker for sub-packages [software/spicerack] - 10https://gerrit.wikimedia.org/r/674612 [14:05:28] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [14:07:10] (03CR) 10jerkins-bot: [V: 04-1] docs: add documentation for the toolforge package [software/spicerack] - 10https://gerrit.wikimedia.org/r/674614 (owner: 10Volans) [14:09:30] (03CR) 10jerkins-bot: [V: 04-1] docs: fix documentation checker for sub-packages [software/spicerack] - 10https://gerrit.wikimedia.org/r/674612 (owner: 10Volans) [14:10:08] PROBLEM - ganeti-wconfd running on ganeti2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:10:17] 10Puppet, 10SRE, 10User-jbond: global http_proxy setting - https://phabricator.wikimedia.org/T278315 (10fgiunchedi) I also have a shell alias (`proxy-on` / `proxy-off`) for convenience and use it in a few cases when building packages require internet. +1 to have a shared alias available (since in practice we... [14:11:12] (03PS1) 10Muehlenhoff: Point irc.wikimedia.org to irc2001 [dns] - 10https://gerrit.wikimedia.org/r/674617 (https://phabricator.wikimedia.org/T224579) [14:11:15] 10Puppet, 10SRE, 10User-jbond: global http_proxy setting - https://phabricator.wikimedia.org/T278315 (10jbond) > What are the use cases for using the proxy manually on a production host? I have added this to the description > Isn't the whole purpose of having internet access blocked by default Im unsure wh... [14:11:48] (03CR) 10jerkins-bot: [V: 04-1] docs: fix documentation checker for sub-packages [software/spicerack] - 10https://gerrit.wikimedia.org/r/674612 (owner: 10Volans) [14:12:14] ^ ganeti-wconfd is expected, fallout from the master failover [14:12:42] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:02] 10SRE, 10vm-requests: eqiad: 1 VMs requested for ircd - https://phabricator.wikimedia.org/T278292 (10MoritzMuehlenhoff) 05Open→03Resolved VM has been created/install, further setup via T278255 once kraz is gone. [14:15:18] PROBLEM - Check unit status of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:18:04] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Igor Lidin from Speed & Function - https://phabricator.wikimedia.org/T278327 (10debt) [14:18:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:19:02] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Igor Lidin from Speed & Function - https://phabricator.wikimedia.org/T278327 (10debt) [14:19:10] (03CR) 10Volans: "This is failing CI because we're adding documentation for modules not recognized by the checker that is fixed in I14f6d1804a90efdedeccf4a2" [software/spicerack] - 10https://gerrit.wikimedia.org/r/674614 (owner: 10Volans) [14:19:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:22:00] (03CR) 10Muehlenhoff: [C: 03+2] labs_bootstrapvz: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668005 (owner: 10Muehlenhoff) [14:23:49] 10SRE, 10Prod-Kubernetes, 10SRE-tools, 10Patch-For-Review: Support downtiming services in our cookbooks - https://phabricator.wikimedia.org/T277740 (10Volans) @JMeybohm with the above patch, once merged and deployed, you'll be able to use `icinga_hosts(["foo.bar.baz", ...], verbatim_hosts=True)` to get an... [14:25:42] RECOVERY - Check unit status of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:28:27] (03CR) 10Muehlenhoff: [C: 03+2] openstack: Remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/674519 (https://phabricator.wikimedia.org/T232677) (owner: 10Muehlenhoff) [14:31:43] !log disable puppet on all mediawiki servers + memcached for 674290 [14:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:14] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:04] (03PS1) 10Andrew Bogott: nova vendordata first boot script: avoid a race with puppet agents [puppet] - 10https://gerrit.wikimedia.org/r/674622 (https://phabricator.wikimedia.org/T278051) [14:34:59] !log drain ganeti2021 [14:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:08] (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata first boot script: avoid a race with puppet agents [puppet] - 10https://gerrit.wikimedia.org/r/674622 (https://phabricator.wikimedia.org/T278051) (owner: 10Andrew Bogott) [14:42:39] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephmon2004-dev - https://phabricator.wikimedia.org/T276509 (10Papaul) [14:45:06] (03CR) 10Jbond: [C: 03+1] "lgtm" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/674549 (https://phabricator.wikimedia.org/T277740) (owner: 10Volans) [14:48:46] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cumin2002.codfw.wmnet - https://phabricator.wikimedia.org/T276587 (10Papaul) [14:50:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:50:16] (03PS1) 10Andrew Bogott: nova vendordata first boot script: tidy up more firstboot races [puppet] - 10https://gerrit.wikimedia.org/r/674625 (https://phabricator.wikimedia.org/T278051) [14:51:35] (03PS2) 10Andrew Bogott: nova vendordata first boot script: tidy up more firstboot races [puppet] - 10https://gerrit.wikimedia.org/r/674625 (https://phabricator.wikimedia.org/T278051) [14:52:02] (03PS3) 10Andrew Bogott: nova vendordata first boot script: tidy up more firstboot races [puppet] - 10https://gerrit.wikimedia.org/r/674625 (https://phabricator.wikimedia.org/T278051) [14:52:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:52:52] (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata first boot script: tidy up more firstboot races [puppet] - 10https://gerrit.wikimedia.org/r/674625 (https://phabricator.wikimedia.org/T278051) (owner: 10Andrew Bogott) [14:54:09] (03CR) 10Effie Mouzeli: [C: 03+2] hiera/nodes: put mc1038 into our session redis and memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/674290 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli) [14:54:18] (03PS2) 10Effie Mouzeli: hiera/nodes: put mc1038 into our session redis and memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/674290 (https://phabricator.wikimedia.org/T278225) [14:55:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:57:05] (03CR) 10Jbond: [C: 03+2] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/674606 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [14:57:21] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-fe200[12].codfw.wmnet - https://phabricator.wikimedia.org/T275513 (10Papaul) [14:57:57] (03CR) 10Jbond: [C: 03+1] Option 82: use-vlan-id [homer/public] - 10https://gerrit.wikimedia.org/r/674574 (https://phabricator.wikimedia.org/T221388) (owner: 10Ayounsi) [15:00:33] (03CR) 10Muehlenhoff: "Looks good, two nits inline." (032 comments) [software/librenms] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/674563 (https://phabricator.wikimedia.org/T278309) (owner: 10Filippo Giunchedi) [15:01:54] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:56] (03PS1) 10Zabe: Add autopatrol to autoreviewers in en.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674626 (https://phabricator.wikimedia.org/T278300) [15:10:52] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti2021.codfw.wmnet [15:10:57] (03PS1) 10JMeybohm: calico: Clean up code from calico 2.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/674628 (https://phabricator.wikimedia.org/T207804) [15:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:09] (03PS1) 10Arturo Borrero Gonzalez: cinderutils: refresh helper script [puppet] - 10https://gerrit.wikimedia.org/r/674629 [15:11:58] (03CR) 10jerkins-bot: [V: 04-1] cinderutils: refresh helper script [puppet] - 10https://gerrit.wikimedia.org/r/674629 (owner: 10Arturo Borrero Gonzalez) [15:12:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:18] (03PS2) 10Arturo Borrero Gonzalez: cinderutils: refresh helper script [puppet] - 10https://gerrit.wikimedia.org/r/674629 [15:13:32] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-fe200[12].codfw.wmnet - https://phabricator.wikimedia.org/T275513 (10Papaul) [15:13:40] (03CR) 10Arturo Borrero Gonzalez: "this is only 90% me doing bike-shedding @andrew :-)" [puppet] - 10https://gerrit.wikimedia.org/r/674629 (owner: 10Arturo Borrero Gonzalez) [15:14:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:09] (03CR) 10DannyS712: [C: 03+1] Add autopatrol to autoreviewers in en.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674626 (https://phabricator.wikimedia.org/T278300) (owner: 10Zabe) [15:15:13] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Igor Lidin from Speed & Function - https://phabricator.wikimedia.org/T278327 (10Helga_sf) [15:16:13] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T276625 (10jbond) @Robh <3 thanks [15:16:18] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [15:17:03] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Igor Lidin from Speed & Function - https://phabricator.wikimedia.org/T278327 (10Sfigor) [15:18:22] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:19:51] (03CR) 10Muehlenhoff: [C: 03+1] "Ack, let's give it a shot" [puppet] - 10https://gerrit.wikimedia.org/r/674074 (https://phabricator.wikimedia.org/T274565) (owner: 10David Caro) [15:20:01] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2021.codfw.wmnet [15:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:59] !log drain ganeti2022 [15:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:04] (03PS1) 10Effie Mouzeli: site: Fix typo for mc1038 [puppet] - 10https://gerrit.wikimedia.org/r/674632 [15:24:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:40] (03CR) 10Effie Mouzeli: [C: 03+2] site: Fix typo for mc1038 [puppet] - 10https://gerrit.wikimedia.org/r/674632 (owner: 10Effie Mouzeli) [15:25:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs200[123] - https://phabricator.wikimedia.org/T276647 (10Papaul) @Gehel each server has 4x1.9TB disks. I want to make sure we are doing both HW RAID 10 and SW RAID 10 . Thanks [15:27:12] (03PS1) 10JMeybohm: Add kubemster dummy keys [labs/private] - 10https://gerrit.wikimedia.org/r/674633 [15:27:18] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs200[123] - https://phabricator.wikimedia.org/T276647 (10Gehel) >>! In T276647#6941847, @Papaul wrote: > @Gehel each server has 4x1.9TB disks. I want to make sure we are doing both HW RAID 10 and S... [15:27:28] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add kubemster dummy keys [labs/private] - 10https://gerrit.wikimedia.org/r/674633 (owner: 10JMeybohm) [15:27:47] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Igor Lidin from Speed & Function - https://phabricator.wikimedia.org/T278327 (10Helga_sf) a:03Sfigor [15:30:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs200[123] - https://phabricator.wikimedia.org/T276647 (10Papaul) thank you [15:31:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:45] (03CR) 10Jdlrobson: [C: 03+1] Inform anonymous A/B test by tracking time from navigationStart [skins/Vector] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674382 (owner: 10Phuedx) [15:31:56] (03PS6) 10Ahmon Dancy: Include patches in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) [15:34:51] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28742/console" [puppet] - 10https://gerrit.wikimedia.org/r/674628 (https://phabricator.wikimedia.org/T207804) (owner: 10JMeybohm) [15:35:28] !log enable puppet on all mediawiki + memcached hosts [15:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:47] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs200[123] - https://phabricator.wikimedia.org/T276647 (10Papaul) [15:36:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs200[123] - https://phabricator.wikimedia.org/T276647 (10Papaul) [15:38:27] (03PS1) 10Alexandros Kosiaris: mediawiki: Include profile::prometheus::cadvisor_exporter [puppet] - 10https://gerrit.wikimedia.org/r/674634 (https://phabricator.wikimedia.org/T278220) [15:40:18] PROBLEM - Aggregate IPsec Tunnel Status eqiad on alert1001 is CRITICAL: instance=mc1036 site=eqiad tunnel=mc2024_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:40:22] (03CR) 10Alexandros Kosiaris: "Just adding ema as a thank you for having already worked on this so I can reuse it" [puppet] - 10https://gerrit.wikimedia.org/r/674634 (https://phabricator.wikimedia.org/T278220) (owner: 10Alexandros Kosiaris) [15:42:12] !log reduce RAM for irc2001 to 2G, was originally created with 8 G T224579 [15:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:20] T224579: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 [15:43:21] akosiaris: you're welcome! :) [15:45:36] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime for 1:00:00 on irc2001.wikimedia.org with reason: adapt RAM [15:45:37] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on irc2001.wikimedia.org with reason: adapt RAM [15:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:52] RECOVERY - Aggregate IPsec Tunnel Status eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:50:02] (03PS5) 10CRusnov: Add CAS authentication support [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) [15:50:38] (03CR) 10CRusnov: "FWIW this is deployed to netbox-next." (031 comment) [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [15:52:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:00] (03CR) 10Legoktm: [C: 03+2] logspam.pl: Update execution time limit regexp [puppet] - 10https://gerrit.wikimedia.org/r/674387 (owner: 10Ahmon Dancy) [16:00:43] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:50] (03PS6) 10CDanis: VCL: don't serve Set-Cookies for domains that aren't ours [puppet] - 10https://gerrit.wikimedia.org/r/630865 [16:04:33] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you for the heads-up, let's try it and see how many metrics this is." [puppet] - 10https://gerrit.wikimedia.org/r/674634 (https://phabricator.wikimedia.org/T278220) (owner: 10Alexandros Kosiaris) [16:06:03] (03PS3) 10Filippo Giunchedi: Add Debian packaging for 21.3.0 [software/librenms] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/674563 (https://phabricator.wikimedia.org/T278309) [16:06:41] (03CR) 10Filippo Giunchedi: Add Debian packaging for 21.3.0 (032 comments) [software/librenms] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/674563 (https://phabricator.wikimedia.org/T278309) (owner: 10Filippo Giunchedi) [16:07:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1084.eqiad.wmnet - https://phabricator.wikimedia.org/T276302 (10wiki_willy) a:05wiki_willy→03Cmjohnson [16:11:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:12:49] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:32] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/librenms] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/674563 (https://phabricator.wikimedia.org/T278309) (owner: 10Filippo Giunchedi) [16:14:18] could someone please lookup the stack trace and full error message for https://phabricator.wikimedia.org/T278350? thx [16:19:07] Majavah: added to task [16:21:27] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3175 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:22:46] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10Papaul) Received the test PDU . I tested both colored cables, the one we bought first (red on image) and the one that was sent to us for testing( blue on image) all fits well with th... [16:24:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:22] well I have no idea what to do with that :/ [16:29:38] (03CR) 10Palak199: "I have added another patch to modify the parsing function. Please review:)" [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (owner: 10Palak199) [16:30:17] 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10epriestley) It's perfectly fine to configure `cluster.databases` with one service, and Phabricator internally (in effect) builds a one-service `cluster.databas... [16:32:40] (03CR) 10Andrew Bogott: [C: 03+1] "Thank you for the symlink!" [puppet] - 10https://gerrit.wikimedia.org/r/674629 (owner: 10Arturo Borrero Gonzalez) [16:33:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "thanks for the cleanup!" [puppet] - 10https://gerrit.wikimedia.org/r/674522 (owner: 10Muehlenhoff) [16:33:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.backup: skip wcqs-beta-01 from backups [puppet] - 10https://gerrit.wikimedia.org/r/674571 (owner: 10David Caro) [16:34:03] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09524 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:34:31] 10SRE, 10Wikimedia-Mailing-lists: Figure out dkim for mailman3 - https://phabricator.wikimedia.org/T278352 (10Legoktm) [16:34:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cinderutils: refresh helper script [puppet] - 10https://gerrit.wikimedia.org/r/674629 (owner: 10Arturo Borrero Gonzalez) [16:39:07] (03CR) 10David Caro: [C: 03+2] wmcs.backup: skip wcqs-beta-01 from backups [puppet] - 10https://gerrit.wikimedia.org/r/674571 (owner: 10David Caro) [16:40:36] (03PS7) 10Ahmon Dancy: Include patches in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) [16:40:39] (03PS2) 10Hnowlan: aqs_next: add additional hiera config for new nodes [puppet] - 10https://gerrit.wikimedia.org/r/674291 (https://phabricator.wikimedia.org/T274119) [16:41:12] (03CR) 10Hnowlan: [C: 03+2] aqs_next: add additional hiera config for new nodes [puppet] - 10https://gerrit.wikimedia.org/r/674291 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan) [16:41:39] (03PS8) 10Dduvall: Include patches in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy) [16:41:55] (03CR) 10David Caro: [C: 03+1] "LGTM, isn't it missing also the wmcs.vps module docs?" [software/spicerack] - 10https://gerrit.wikimedia.org/r/674614 (owner: 10Volans) [16:43:40] (03CR) 10Volans: "> Patch Set 1: Code-Review+1" [software/spicerack] - 10https://gerrit.wikimedia.org/r/674614 (owner: 10Volans) [16:44:46] (03CR) 10David Caro: [C: 03+1] "LGTM" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/674612 (owner: 10Volans) [16:45:48] (03CR) 10David Caro: [C: 03+1] "> Patch Set 1:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/674614 (owner: 10Volans) [16:50:25] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10Papaul) First issue : - The mounting buttons don't align with the mounting bracket in the rack [16:51:04] (03PS9) 10Dduvall: Include patches in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy) [16:52:10] (03PS1) 10Elukey: Remove the dns_canonicalize=false option for Kerberos in Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/674643 (https://phabricator.wikimedia.org/T278353) [16:52:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:02] (03CR) 10Elukey: [C: 03+2] Remove the dns_canonicalize=false option for Kerberos in Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/674643 (https://phabricator.wikimedia.org/T278353) (owner: 10Elukey) [16:55:20] (03PS10) 10Dduvall: Include patches in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy) [16:56:21] (03PS1) 10Legoktm: mailman3: Configure DKIM [puppet] - 10https://gerrit.wikimedia.org/r/674645 (https://phabricator.wikimedia.org/T278352) [17:00:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:00] (03PS4) 10Volans: docs: fix documentation checker for sub-packages [software/spicerack] - 10https://gerrit.wikimedia.org/r/674612 [17:04:10] (03CR) 10Hnowlan: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/674529 (https://phabricator.wikimedia.org/T277297) (owner: 10Kosta Harlan) [17:05:39] (03CR) 10Volans: docs: fix documentation checker for sub-packages (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/674612 (owner: 10Volans) [17:05:50] (03PS1) 10Arturo Borrero Gonzalez: cinderutils: wmcs-prepare-cinder-volume: fix perms again [puppet] - 10https://gerrit.wikimedia.org/r/674646 [17:07:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cinderutils: wmcs-prepare-cinder-volume: fix perms again [puppet] - 10https://gerrit.wikimedia.org/r/674646 (owner: 10Arturo Borrero Gonzalez) [17:07:31] (03CR) 10Volans: [V: 03+2 C: 03+2] "Merging bypassing CI because the check is not able to detect correctly that the documentation has been added. The fix for that check is in" [software/spicerack] - 10https://gerrit.wikimedia.org/r/674614 (owner: 10Volans) [17:09:13] (03PS3) 10Volans: docs: remove obsolete configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/674611 [17:09:17] (03PS5) 10Volans: docs: fix documentation checker for sub-packages [software/spicerack] - 10https://gerrit.wikimedia.org/r/674612 [17:09:24] (03CR) 10jerkins-bot: [V: 04-1] docs: fix documentation checker for sub-packages [software/spicerack] - 10https://gerrit.wikimedia.org/r/674612 (owner: 10Volans) [17:14:15] (03PS2) 10Palak199: Modify:: The parsing function in transfer.py [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) [17:14:54] (03CR) 10jerkins-bot: [V: 04-1] docs: remove obsolete configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/674611 (owner: 10Volans) [17:19:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:19:20] (03PS6) 10Volans: docs: fix documentation checker for sub-packages [software/spicerack] - 10https://gerrit.wikimedia.org/r/674612 [17:19:22] (03PS4) 10Volans: docs: remove obsolete configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/674611 [17:25:54] !log upgrade memcached on mc-gp* hosts [17:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:20] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jijiki: Upgrade memcached to version 1.6.x - https://phabricator.wikimedia.org/T270315 (10jijiki) [17:33:56] 10SRE, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki) [17:34:02] 10SRE, 10Performance-Team, 10serviceops, 10Patch-For-Review, 10User-jijiki: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10jijiki) [17:34:05] 10SRE, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [17:34:54] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad total VRPs alert, total VRPs alert, valid ROAs alert, valid ROAs alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [17:35:01] (03PS11) 10Dduvall: Include patches in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy) [17:42:20] (03CR) 10David Caro: [C: 03+1] docs: fix documentation checker for sub-packages (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/674612 (owner: 10Volans) [17:44:27] (03CR) 10Volans: [C: 03+2] docs: fix documentation checker for sub-packages [software/spicerack] - 10https://gerrit.wikimedia.org/r/674612 (owner: 10Volans) [17:44:53] (03CR) 10Volans: [C: 03+2] "Trivial housekeeping, self-merging" [software/spicerack] - 10https://gerrit.wikimedia.org/r/674611 (owner: 10Volans) [17:47:34] (03PS3) 10Palak199: Modify:: The parsing function in transfer.py Add:: Whitespace between = and [ parentheses in parsing function The parsing function is modified to parse the host and path as the earlier method crashed the script. [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) [17:48:57] (03CR) 10Jcrespo: "You will want a blank line between title and commit body message :-)" [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [17:49:53] (03PS4) 10Palak199: Modify:: The parsing function in transfer.py Add:: Whitespace between = and [ parentheses in parsing function [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) [17:51:15] (03CR) 10Dzahn: [C: 03+2] gitlab: add gitlab.wm.org service IP, with lookup from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/674446 (https://phabricator.wikimedia.org/T276148) (owner: 10Dzahn) [17:52:01] (03CR) 10Jcrespo: "Some initial, -not super important- but I hope useful style comments. The style checker will likely complain otherwise." (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [17:52:04] (03Merged) 10jenkins-bot: docs: fix documentation checker for sub-packages [software/spicerack] - 10https://gerrit.wikimedia.org/r/674612 (owner: 10Volans) [17:52:13] (03CR) 10Palak199: "> Patch Set 3:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [17:53:26] (03Merged) 10jenkins-bot: docs: remove obsolete configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/674611 (owner: 10Volans) [17:54:17] (03PS2) 10Legoktm: Add lists-next.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/673638 [17:54:48] (03CR) 10Palak199: "> Patch Set 4:" (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [17:54:54] (03CR) 10Dzahn: gitlab: open firewall on 22,80,443. use drange to limit to service IP (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670331 (https://phabricator.wikimedia.org/T276144) (owner: 10Dzahn) [17:57:10] (03CR) 10Jcrespo: "> Patch Set 4:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [17:57:53] (03CR) 10Jcrespo: "> Patch Set 4:" (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [17:58:35] (03PS12) 10Dduvall: Include patches in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy) [18:00:01] (03CR) 10Jcrespo: "BTW, you can remove the "Add:: Whitespace between = and [ parentheses in parsing function" from the title (there is a max lengh limitation" [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [18:00:04] hashar and dancy: How many deployers does it take to do Train log triage with CPT deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210324T1800). [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210324T1800). [18:00:04] Zabe: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:19] o/ [18:00:25] (03CR) 10Ahmon Dancy: [C: 03+1] Include patches in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy) [18:02:32] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [18:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:29] (03CR) 10Jcrespo: "I am also unsure about the test- it only checks that it returns 0. Do you think it would be possible to check something else, like logging" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [18:08:05] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:27] 10SRE, 10GitLab (Initialization), 10Patch-For-Review, 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Dzahn) gitlab100 has 2 IPs (2x v4, 2 x v6) now: ` @gitlab1001:~# ip a s | grep 208 i... [18:08:33] (03CR) 10H.krishna123: "> Patch Set 2:" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [18:09:53] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cumin2002.codfw.wmnet - https://phabricator.wikimedia.org/T276587 (10Papaul) [18:10:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:10:50] (03PS1) 10Hoo man: Remove dumpwikibasejson/dumpwikibaserdf -D/--dryrun [puppet] - 10https://gerrit.wikimedia.org/r/674651 [18:12:20] (03CR) 10Jcrespo: "> the current code base already has f-strings for the print statement" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [18:14:17] (03PS5) 10Palak199: Modify:: The parsing function in transfer.py [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) [18:15:57] Zabe: it doesn't sound anyone is deploying, right? [18:16:03] so...i can deploy today [18:16:16] yes [18:16:18] great [18:16:22] (03CR) 10Urbanecm: [C: 03+2] Add autopatrol to autoreviewers in en.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674626 (https://phabricator.wikimedia.org/T278300) (owner: 10Zabe) [18:16:27] (03CR) 10H.krishna123: "> Patch Set 2:" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [18:16:30] (03CR) 10Palak199: "> Please go ahead when you can, and so I can request a CI run after it." [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [18:16:57] (03CR) 10Jcrespo: "recheck" [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [18:17:43] (03Merged) 10jenkins-bot: Add autopatrol to autoreviewers in en.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674626 (https://phabricator.wikimedia.org/T278300) (owner: 10Zabe) [18:17:54] (03CR) 10jerkins-bot: [V: 04-1] Modify:: The parsing function in transfer.py [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [18:18:16] Zabe: please test on mwdebug1001 [18:19:04] it works the supposed way [18:19:09] great, syncing [18:20:45] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 333393dfe59deb0ec4d7df6dd92372a705f65b85: Add autopatrol to autoreviewers in en.wikibooks (T278300) (duration: 01m 09s) [18:20:48] Zabe: should be live [18:20:50] anything else? [18:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:53] T278300: Add autopatrol to autoreviewers in en.wikibooks - https://phabricator.wikimedia.org/T278300 [18:21:06] no, thx for your help :) [18:21:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:21:45] any time :) [18:22:52] (03CR) 10Palak199: "> Patch Set 5:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [18:23:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:24:43] (03PS13) 10Ahmon Dancy: Include patches in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) [18:25:03] (03PS1) 10Urbanecm: Promote several Growth target wikis out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674652 (https://phabricator.wikimedia.org/T277491) [18:26:24] (03CR) 10Urbanecm: [C: 03+2] Promote several Growth target wikis out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674652 (https://phabricator.wikimedia.org/T277491) (owner: 10Urbanecm) [18:27:07] (03Merged) 10jenkins-bot: Promote several Growth target wikis out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674652 (https://phabricator.wikimedia.org/T277491) (owner: 10Urbanecm) [18:27:30] (03CR) 10Dduvall: [C: 03+1] "The build against this change succeeded with an image with patches applied being published to the restricted repo. Very nice!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy) [18:27:53] (03CR) 10Jcrespo: "> > > the current code base already has f-strings for the print statement" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [18:28:52] (03CR) 10H.krishna123: "> Patch Set 2:" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [18:30:54] (03CR) 10H.krishna123: "> Patch Set 2:" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [18:31:21] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 5aa050602954a3cab0c7e0c4b10efb0f957efb59: Promote several Growth target wikis out of dark mode (T277491; T276830; T276123; T276816; T275550; T276450) (duration: 01m 08s) [18:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:38] T277491: Request to implement Growth experiments on Telugu Wikipedia (Tewiki) - https://phabricator.wikimedia.org/T277491 [18:31:38] T276123: Deploy Growth features on Esperanto Wikipedia - https://phabricator.wikimedia.org/T276123 [18:31:39] T276816: Deploy Growth features on Norwegian Bokmål Wikipedia - https://phabricator.wikimedia.org/T276816 [18:31:39] T276450: Deploy Growth features on Hindi Wikipedia - https://phabricator.wikimedia.org/T276450 [18:31:39] T276830: Deploy Growth features on Japanese Wikipedia - https://phabricator.wikimedia.org/T276830 [18:31:39] T275550: Deploy Growth features on Albanian Wikipedia - https://phabricator.wikimedia.org/T275550 [18:34:47] (03CR) 10Legoktm: [C: 03+1] Point irc.wikimedia.org to irc2001 [dns] - 10https://gerrit.wikimedia.org/r/674617 (https://phabricator.wikimedia.org/T224579) (owner: 10Muehlenhoff) [18:36:43] (03PS1) 10Urbanecm: Enable Growth features on eswiki in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674653 (https://phabricator.wikimedia.org/T278235) [18:39:55] (03CR) 10Jcrespo: "> Patch Set 5:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [18:40:06] !log legoktm@cumin1001 START - Cookbook sre.dns.netbox [18:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:33] (03PS1) 10Urbanecm: shwiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674654 (https://phabricator.wikimedia.org/T278240) [18:42:37] (03CR) 10Urbanecm: [C: 03+2] Enable Growth features on eswiki in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674653 (https://phabricator.wikimedia.org/T278235) (owner: 10Urbanecm) [18:42:47] !log legoktm@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:55] (03CR) 10Jcrespo: "Allow me to rebase on top of HEAD, so we don't get here the errors from your previous patch, so it is not confusing." [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [18:42:58] (03CR) 10Jeena Huneidi: [C: 03+1] Include patches in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy) [18:43:00] (03PS6) 10Jcrespo: Modify:: The parsing function in transfer.py [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [18:43:36] (03Merged) 10jenkins-bot: Enable Growth features on eswiki in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674653 (https://phabricator.wikimedia.org/T278235) (owner: 10Urbanecm) [18:43:51] (03CR) 10jerkins-bot: [V: 04-1] Modify:: The parsing function in transfer.py [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [18:45:00] (03PS1) 10Gergő Tisza: LinkRecommendation: Modify path args for calls to API [extensions/GrowthExperiments] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674605 (https://phabricator.wikimedia.org/T277865) [18:45:28] !log legoktm@cumin1001 START - Cookbook sre.dns.netbox [18:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:46:36] (03CR) 10Dzahn: "gitlab.wikimedia.org has address 208.80.154.14" [puppet] - 10https://gerrit.wikimedia.org/r/674446 (https://phabricator.wikimedia.org/T276148) (owner: 10Dzahn) [18:47:02] (03CR) 10Jcrespo: "So, after rebase, only 4 style issues:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [18:47:22] (03CR) 10Jcrespo: "> Patch Set 6:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [18:47:25] (03PS2) 10Urbanecm: shwiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674654 (https://phabricator.wikimedia.org/T278240) [18:47:32] (03CR) 10Dzahn: [C: 03+2] gerrit: reported owner is actually patchset author [puppet] - 10https://gerrit.wikimedia.org/r/674552 (https://phabricator.wikimedia.org/T224262) (owner: 10Hashar) [18:48:02] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-import-page-current-dumps.service,refinery-import-page-history-dumps.service,refinery-import-siteinfo-dumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:49:38] ah nice! --^ [18:49:42] list of failed units [18:49:48] anyway it should clear soon [18:49:49] !log legoktm@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:49:53] yes, it's new and great :) [18:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:01] that you see it right on IRC [18:50:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:50:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:51:53] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: ced092071a9638d1e1c04602bd5bbed5cc3812e3: Enable Growth features on eswiki in dark mode (T278235; 1/3) (duration: 01m 08s) [18:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:01] T278235: Deploy Growth features on Spanish Wikipedia - https://phabricator.wikimedia.org/T278235 [18:52:08] !log urbanecm@deploy1002 sync-file aborted: ced092071a9638d1e1c04602bd5bbed5cc3812e3: Enable Growth features on eswiki in dark mode (2/3) (duration: 00m 01s) [18:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:53] (03PS2) 10Legoktm: mailman3: Configure DKIM [puppet] - 10https://gerrit.wikimedia.org/r/674645 (https://phabricator.wikimedia.org/T278352) [18:52:55] (03PS5) 10Legoktm: [WIP] Configure lists1002 [puppet] - 10https://gerrit.wikimedia.org/r/673636 [18:53:00] (03CR) 10Jcrespo: "> Patch Set 2:" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [18:53:23] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: ced092071a9638d1e1c04602bd5bbed5cc3812e3: Enable Growth features on eswiki in dark mode (T278235; 2/3) (duration: 01m 07s) [18:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:30] (03CR) 10Legoktm: [C: 03+2] mailman3: Configure DKIM [puppet] - 10https://gerrit.wikimedia.org/r/674645 (https://phabricator.wikimedia.org/T278352) (owner: 10Legoktm) [18:54:36] (03CR) 10Palak199: "> A similar number of test errors, from its output, I can see the problem in a single character. I have to go, but please have a look and " [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [18:54:48] !log urbanecm@deploy1002 Synchronized wmf-config/config/eswiki.yaml: ced092071a9638d1e1c04602bd5bbed5cc3812e3: Enable Growth features on eswiki in dark mode (T278235; 3/3) (duration: 01m 06s) [18:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:17] (03CR) 10Dzahn: [C: 04-1] "-1 per Ammarpad's inline comment, which seems right" [puppet] - 10https://gerrit.wikimedia.org/r/674548 (https://phabricator.wikimedia.org/T277597) (owner: 10Hashar) [18:55:34] (03CR) 10Urbanecm: [C: 03+2] shwiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674654 (https://phabricator.wikimedia.org/T278240) (owner: 10Urbanecm) [18:56:34] (03Merged) 10jenkins-bot: shwiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674654 (https://phabricator.wikimedia.org/T278240) (owner: 10Urbanecm) [18:59:24] jouncebot: next [18:59:24] In 0 hour(s) and 0 minute(s): Mediawiki train - European+American Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210324T1900) [19:00:00] Urbanecm: hi, I am aobut to run the train ;) [19:00:05] hashar and dancy: I, the Bot under the Fountain, allow thee, The Deployer, to do Mediawiki train - European+American Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210324T1900). [19:00:05] (03PS1) 10Hashar: group1 wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674657 [19:00:07] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674657 (owner: 10Hashar) [19:00:10] hashar: please wait for a second [19:00:17] (03CR) 10Urbanecm: [C: 04-2] group1 wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674657 (owner: 10Hashar) [19:00:25] ;D [19:00:26] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 0f3aa7278d17c88f27b7d58ceede82730fd4ddcd: shwiki: Enable Growth features in dark mode (T278240; 1/3) (duration: 01m 07s) [19:00:28] hashar: I'm finishing sync of the last patch :D [19:00:33] just two scap sync files [19:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:36] T278240: Deploy Growth features on Serbo-Croatian Wikipedia - https://phabricator.wikimedia.org/T278240 [19:00:39] oh sure take your time [19:00:50] I just assumed that the single patch scheduled for backport got already completed [19:00:56] should have asked [19:01:22] it did, but i merged few non-scheduled patches as well [19:01:32] ok no worries [19:01:59] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 0f3aa7278d17c88f27b7d58ceede82730fd4ddcd: shwiki: Enable Growth features in dark mode (T278240; 2/3) (duration: 01m 06s) [19:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:52] (03CR) 10Urbanecm: group1 wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674657 (owner: 10Hashar) [19:03:18] (03PS1) 10Legoktm: Add mailman3 secrets [labs/private] - 10https://gerrit.wikimedia.org/r/674659 [19:03:21] (03PS1) 10Legoktm: Add dkim private keys [labs/private] - 10https://gerrit.wikimedia.org/r/674660 [19:03:25] !log urbanecm@deploy1002 Synchronized wmf-config/config/shwiki.yaml: 0f3aa7278d17c88f27b7d58ceede82730fd4ddcd: shwiki: Enable Growth features in dark mode (T278240; 3/3) (duration: 01m 08s) [19:03:25] 10SRE, 10Analytics: wmf-auto-restart.py + lsof + /mnt/hdfs may need to tuned - https://phabricator.wikimedia.org/T278371 (10elukey) [19:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:34] hashar: the floor is yours :) [19:04:12] 10SRE, 10Analytics: wmf-auto-restart.py + lsof + /mnt/hdfs may need to be tuned - https://phabricator.wikimedia.org/T278371 (10elukey) [19:04:18] Urbanecm: awesome. Thank you! [19:04:46] (03PS5) 10Dzahn: gitlab: open firewall for 80,443. use drange to limit to service IP [puppet] - 10https://gerrit.wikimedia.org/r/670331 (https://phabricator.wikimedia.org/T276144) [19:05:47] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.36 [19:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:58] (03CR) 10jerkins-bot: [V: 04-1] gitlab: open firewall for 80,443. use drange to limit to service IP [puppet] - 10https://gerrit.wikimedia.org/r/670331 (https://phabricator.wikimedia.org/T276144) (owner: 10Dzahn) [19:07:09] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.36.0-wmf.36 (duration: 01m 21s) [19:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:00] (03PS6) 10Dzahn: gitlab: open firewall for 80,443. use drange to limit to service IP [puppet] - 10https://gerrit.wikimedia.org/r/670331 (https://phabricator.wikimedia.org/T276144) [19:10:39] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Add mailman3 secrets [labs/private] - 10https://gerrit.wikimedia.org/r/674659 (owner: 10Legoktm) [19:10:54] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Add dkim private keys [labs/private] - 10https://gerrit.wikimedia.org/r/674660 (owner: 10Legoktm) [19:11:16] and [19:11:21] it is rollback time :/ [19:12:10] domage [19:12:55] hashar: :/ [19:12:59] what happeend this time? [19:13:40] [{reqId}] {exception_url} TypeError: Argument 1 passed to ProofreadPage\Index\IndexTemplateStyles::__construct() must be an instance of Title, null given, called in /srv/mediawiki/php-1.36.0-wmf.36/extensions/ProofreadPage/includes/Page/PageContent.php [19:13:51] which prevent pages from being submitted on wikisources wiki [19:13:51] fun [19:13:56] (03CR) 10Dzahn: [V: 03+1] "@Jbond How about now https://puppet-compiler.wmflabs.org/compiler1002/28756/gitlab1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/670331 (https://phabricator.wikimedia.org/T276144) (owner: 10Dzahn) [19:14:20] it is the usual type hint not supporting null as a Title [19:14:35] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: Revert group 1 to 1.36.0-wmf.35 [19:14:41] funny [19:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:30] (03PS1) 10Hashar: Revert "group1 wikis to 1.36.0-wmf.36" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674662 (https://phabricator.wikimedia.org/T274940) [19:15:32] (03CR) 10Hashar: [C: 03+2] Revert "group1 wikis to 1.36.0-wmf.36" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674662 (https://phabricator.wikimedia.org/T274940) (owner: 10Hashar) [19:16:20] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.36.0-wmf.36" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674662 (https://phabricator.wikimedia.org/T274940) (owner: 10Hashar) [19:17:00] ACKNOWLEDGEMENT - IPMI Sensor Status on elastic1042 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] Ryan Kemper https://phabricator.wikimedia.org/T278185 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:17:08] Error: Class 'GlobalUsageHooks' not found [19:17:09] (03PS6) 10Legoktm: Configure lists1002 [puppet] - 10https://gerrit.wikimedia.org/r/673636 [19:17:11] is a fun one as well [19:17:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:17:22] (03CR) 10Dzahn: "@legoktm This should do it https://puppet-compiler.wmflabs.org/compiler1003/28757/people1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/674169 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn) [19:17:35] hashar: this week is fun [19:17:54] hashar: the Hooks class missing is probably Reedy's fault [19:17:59] I don't even understand how such an issue manages to pass all the tests / static analysis etc we have [19:19:10] (03CR) 10Legoktm: [C: 03+1] "Yep, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/674169 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn) [19:19:18] I was really worried about flagged revs given its amazing test coverage but it seems I should not cut the line [19:19:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:19:53] (03CR) 10Legoktm: webperf: use new http_only parameter with httpd in processors_and_site (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674458 (owner: 10Dzahn) [19:20:43] me neither tbh [19:20:53] https://gerrit.wikimedia.org/g/mediawiki/extensions/GlobalUsage/+/6bc251c88ad523ebe6c36da7c40ccdd26462dab3/includes/Hooks.php#32 [19:21:03] sigh, we don't need to keep BC in this extension anyways [19:21:20] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-upgrade [19:21:20] * legoktm fixes [19:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:21] Constructing RevisionRecord for a page that can't exist: Special:MyLanguage/Main Page [19:22:22] hehe [19:22:33] cause well... of course a Special page does not exist ;] [19:22:35] this again? [19:22:51] that function is documented to return null :/ [19:23:18] (03PS1) 10Palak199: Modify:: Modify path variable in parse function of transfer.py [software/transferpy] - 10https://gerrit.wikimedia.org/r/674663 (https://phabricator.wikimedia.org/T268258) [19:25:14] hashar: 12:25:00 (PS2) Legoktm: Fix hook registration after class was namespaced [extensions/GlobalUsage] - https://gerrit.wikimedia.org/r/674664 (https://phabricator.wikimedia.org/T278375) [19:26:40] (03CR) 10Herron: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/673594 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [19:26:49] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1003/28755/mwlog1001.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/673594 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [19:29:51] (03CR) 10Palak199: "So out of 4 unit tests that were failing, 3 have been corrected. Regarding the last one" [software/transferpy] - 10https://gerrit.wikimedia.org/r/674663 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [19:31:44] so a few minutes and 3 tasks worth rolling back bah [19:31:53] I am sending the mail referring to the task [19:32:01] s/task/tasks/ [19:32:42] (03PS1) 10Papaul: DHCP: Add MAC address for cumin2002 [puppet] - 10https://gerrit.wikimedia.org/r/674687 (https://phabricator.wikimedia.org/T276587) [19:33:40] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/28758/" [puppet] - 10https://gerrit.wikimedia.org/r/674169 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn) [19:34:00] (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address for cumin2002 [puppet] - 10https://gerrit.wikimedia.org/r/674687 (https://phabricator.wikimedia.org/T276587) (owner: 10Papaul) [19:34:03] (03PS1) 10Andrew Bogott: nova vendordata: change config block type to text/plain [puppet] - 10https://gerrit.wikimedia.org/r/674688 (https://phabricator.wikimedia.org/T278051) [19:34:20] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [19:34:23] Amir1: so yeah flagged revs did not have much chance to run unfortunately :/ [19:34:28] legoktm: thanks ;) [19:34:45] (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: change config block type to text/plain [puppet] - 10https://gerrit.wikimedia.org/r/674688 (https://phabricator.wikimedia.org/T278051) (owner: 10Andrew Bogott) [19:36:03] any idea why PageContent::getParserOutput runs with Main Page? [19:36:06] as the title? [19:37:31] Urbanecm context? Do you know where it was called from? I know that there is logic that adds a link to the main page on every page view, so perhaps thats why [19:38:01] DannyS712: I have sth better than context. I have traceback! https://www.irccloud.com/pastebin/6QVqpkpM/ [19:38:37] (03Restored) 10Jeena Huneidi: Rsync private mediawiki files to releases server [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [19:38:39] it's proofread's content handler for its own content type [19:38:47] it makes no sense to run it on main page [19:40:25] so ProofreadPage\Page\PageContent->getParserOutput(Title, NULL, ParserOptions, boolean) is called with the Title being the main page? [19:40:42] DannyS712: affirmative [19:41:11] because the code says to use the main page... [19:41:11] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/master/includes/content/ContentHandler.php#907 [19:41:22] wat? [19:41:56] that file has no recent change [19:42:09] !log T267927 Re-enabledpuppet on `wdqs2008` and ran puppet agent [19:42:14] (no...relevant change) [19:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:17] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [19:42:24] for some content, we want to find the type of change (eg adding or removing media) in order to add some change tags. We fetch the images for the content by using content->getParserOutput, and we use the context of the main page because it shouldn't matter what page the content is being shown on [19:42:38] its likely due to changes in the proofread page extension [19:45:03] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [19:45:03] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [19:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cumin2002.codfw.wmnet - https://phabricator.wikimedia.org/T276587 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cumin2002.codfw.wmnet ` The log can be found in `... [19:49:38] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [19:49:39] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [19:49:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={netbox_device_statistics,swagger_check_citoid_cluster_eqiad} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:59] (03PS1) 10Krinkle: multiversion: Move '@' operator in env.php closer to relevant statement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674693 [19:50:47] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [19:50:52] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [19:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:54:38] (03PS1) 10Papaul: Add cumin2002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/674694 (https://phabricator.wikimedia.org/T276587) [19:55:24] (03CR) 10jerkins-bot: [V: 04-1] Add cumin2002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/674694 (https://phabricator.wikimedia.org/T276587) (owner: 10Papaul) [19:55:37] (03PS2) 10Dzahn: webperf: use new http_only parameter with httpd in processors_and_site [puppet] - 10https://gerrit.wikimedia.org/r/674458 [19:55:41] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [19:55:45] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [19:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:01] (03CR) 10Dzahn: webperf: use new http_only parameter with httpd in processors_and_site (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674458 (owner: 10Dzahn) [19:56:58] !log T267927 Host key is missing for `wdqs2008` leading to `data-transfer` cookbook failing, looking into resolving [19:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:07] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [19:57:34] (03PS1) 10Jforrester: Fix hook registration after class was namespaced [extensions/GlobalUsage] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674670 (https://phabricator.wikimedia.org/T278375) [19:58:05] hashar: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GlobalUsage/+/674670 should be deployable to unblock the train if you're keen (or I can later, after meetings). [19:58:44] (03PS3) 10Dzahn: webperf: use new http_only parameter with httpd in processors_and_site [puppet] - 10https://gerrit.wikimedia.org/r/674458 (https://phabricator.wikimedia.org/T277989) [19:58:49] (03PS2) 10Papaul: Add cumin2002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/674694 (https://phabricator.wikimedia.org/T276587) [19:58:49] will let stuff settle a bit and check again later [19:58:56] in an hour or so [19:59:21] feel free to get cherry picks / CR+2 [19:59:26] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [19:59:30] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [19:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:34] Cool, will do. [19:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:39] will sync as needed [20:00:04] chrisalbon and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210324T2000). [20:00:36] (03CR) 10Papaul: [C: 03+2] Add cumin2002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/674694 (https://phabricator.wikimedia.org/T276587) (owner: 10Papaul) [20:05:02] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on cumin2002.codfw.wmnet with reason: REIMAGE [20:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:00] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cumin2002.codfw.wmnet with reason: REIMAGE [20:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:40] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [20:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:06] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/670331 (https://phabricator.wikimedia.org/T276144) (owner: 10Dzahn) [20:10:06] Ah, forgot `--without-lvs`, 7th time's the charm: [20:10:30] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [20:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:32] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1001/28760/webperf1001.eqiad.wmnet/change.webperf1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/674458 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn) [20:13:18] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [20:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:33] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [20:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:19] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cumin2002.codfw.wmnet - https://phabricator.wikimedia.org/T276587 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cumin2002.codfw.wmnet'] ` and were **ALL** successful. [20:15:02] (03PS3) 10Jeena Huneidi: Rsync private mediawiki files to releases server [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) [20:15:42] (03CR) 10Jeena Huneidi: "Based on Id4907d014d0a902cf215407e901593b8a029ce55 and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/674132, it seems lik" [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [20:15:53] (03PS4) 10Dzahn: webperf: use new http_only parameter with httpd in processors_and_site [puppet] - 10https://gerrit.wikimedia.org/r/674458 (https://phabricator.wikimedia.org/T277989) [20:16:46] (03CR) 10jerkins-bot: [V: 04-1] Rsync private mediawiki files to releases server [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [20:17:52] (03CR) 10Cwhite: [C: 03+1] "From the output of a cache node, cadvisor emits 87 unique metrics, and expands them with labels to 4951 metrics." [puppet] - 10https://gerrit.wikimedia.org/r/674634 (https://phabricator.wikimedia.org/T278220) (owner: 10Alexandros Kosiaris) [20:18:07] (03CR) 10Ahmon Dancy: [C: 03+2] "Thanks Krinkle!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674693 (owner: 10Krinkle) [20:18:16] (03PS4) 10Jeena Huneidi: Rsync private mediawiki files to releases server [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) [20:19:23] (03PS8) 10ArielGlenn: [WIP] decide what to do about aborted batches [dumps] - 10https://gerrit.wikimedia.org/r/646998 (https://phabricator.wikimedia.org/T252396) [20:19:35] (03Merged) 10jenkins-bot: multiversion: Move '@' operator in env.php closer to relevant statement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674693 (owner: 10Krinkle) [20:20:03] (03CR) 10Dzahn: "much better now, but maybe the \n are literal ?" [puppet] - 10https://gerrit.wikimedia.org/r/674458 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn) [20:20:05] (03PS3) 10Kosta Harlan: linkrecommendation: Add environment variable for gunicorn timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/674529 (https://phabricator.wikimedia.org/T277297) [20:22:26] (03PS7) 10Herron: initial grafana::grizzly module and profile [puppet] - 10https://gerrit.wikimedia.org/r/671283 [20:23:23] hasharAway: could we backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/674605 during this window? [20:23:38] (03CR) 10jerkins-bot: [V: 04-1] initial grafana::grizzly module and profile [puppet] - 10https://gerrit.wikimedia.org/r/671283 (owner: 10Herron) [20:23:46] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Add environment variable for gunicorn timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/674529 (https://phabricator.wikimedia.org/T277297) (owner: 10Kosta Harlan) [20:24:13] kostajh: yes feel free to do it nowish [20:24:40] (03PS8) 10Herron: initial grafana::grizzly module and profile [puppet] - 10https://gerrit.wikimedia.org/r/671283 [20:24:55] I had some errand at home with kid, now attending a meeting for the nxt half an hour or so [20:25:07] (03Merged) 10jenkins-bot: linkrecommendation: Add environment variable for gunicorn timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/674529 (https://phabricator.wikimedia.org/T277297) (owner: 10Kosta Harlan) [20:25:07] and will check again after [20:26:16] (03CR) 10Herron: initial grafana::grizzly module and profile (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/671283 (owner: 10Herron) [20:26:37] !log kharlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [20:26:42] (03CR) 10Hashar: [C: 03+1] "As the train runner this week, I am fine having this backported at anytime :)" [extensions/GrowthExperiments] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674605 (https://phabricator.wikimedia.org/T277865) (owner: 10Gergő Tisza) [20:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:50] (03CR) 10Gergő Tisza: [C: 03+2] LinkRecommendation: Modify path args for calls to API [extensions/GrowthExperiments] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674605 (https://phabricator.wikimedia.org/T277865) (owner: 10Gergő Tisza) [20:27:20] (03CR) 10Dave Pifke: webperf: use new http_only parameter with httpd in processors_and_site (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674458 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn) [20:27:28] (03PS9) 10Herron: initial grafana::grizzly module and profile [puppet] - 10https://gerrit.wikimedia.org/r/671283 [20:27:44] (03PS5) 10Dzahn: webperf: use new http_only parameter with httpd in processors_and_site [puppet] - 10https://gerrit.wikimedia.org/r/674458 (https://phabricator.wikimedia.org/T277989) [20:29:49] (03PS1) 10Jeena Huneidi: Include private folder in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674698 (https://phabricator.wikimedia.org/T276145) [20:30:00] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cumin2002.codfw.wmnet - https://phabricator.wikimedia.org/T276587 (10Papaul) [20:30:20] !log kharlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [20:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:28] (03CR) 10Dzahn: [V: 03+1] "needed double quotes instead of single quotes. looks good to me now: https://puppet-compiler.wmflabs.org/compiler1003/28762/webperf1001.eq" [puppet] - 10https://gerrit.wikimedia.org/r/674458 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn) [20:30:32] (03PS2) 10Jeena Huneidi: Include private folder in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674698 (https://phabricator.wikimedia.org/T276145) [20:30:46] (03PS3) 10Legoktm: Add lists-next.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/673638 [20:30:50] (03PS7) 10Legoktm: Configure lists1002 [puppet] - 10https://gerrit.wikimedia.org/r/673636 [20:30:54] (03CR) 10Dzahn: [V: 03+1 C: 03+2] gitlab: open firewall for 80,443. use drange to limit to service IP [puppet] - 10https://gerrit.wikimedia.org/r/670331 (https://phabricator.wikimedia.org/T276144) (owner: 10Dzahn) [20:31:36] (03CR) 10Legoktm: "minor thing, but LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674458 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn) [20:31:39] (03CR) 10jerkins-bot: [V: 04-1] Add lists-next.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/673638 (owner: 10Legoktm) [20:31:51] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cumin2002.codfw.wmnet - https://phabricator.wikimedia.org/T276587 (10Papaul) 05Open→03Resolved @MoritzMuehlenhoff all yours [20:32:37] (03PS3) 10Jeena Huneidi: Include private folder in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674698 (https://phabricator.wikimedia.org/T276145) [20:32:49] (03CR) 10Dzahn: "missing comma between IP addresses: Mar 24 20:32:06 gitlab1001 ferm[1047]: "," expected" [puppet] - 10https://gerrit.wikimedia.org/r/670331 (https://phabricator.wikimedia.org/T276144) (owner: 10Dzahn) [20:34:18] (03PS4) 10Legoktm: Add lists-next.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/673638 [20:35:03] (03CR) 10jerkins-bot: [V: 04-1] Add lists-next.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/673638 (owner: 10Legoktm) [20:37:03] (03PS5) 10Jeena Huneidi: Rsync private mediawiki files to releases server [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) [20:37:49] (03PS1) 10Dzahn: Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into review/dzahn/674169 [puppet] - 10https://gerrit.wikimedia.org/r/674702 [20:37:51] (03PS1) 10Dzahn: gitlab: add missing brackets in ferm rule drange [puppet] - 10https://gerrit.wikimedia.org/r/674703 (https://phabricator.wikimedia.org/T276144) [20:38:33] (03CR) 10jerkins-bot: [V: 04-1] Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into review/dzahn/674169 [puppet] - 10https://gerrit.wikimedia.org/r/674702 (owner: 10Dzahn) [20:39:19] (03PS4) 10Jeena Huneidi: Include private folder in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674698 (https://phabricator.wikimedia.org/T276145) [20:39:23] (03PS5) 10Legoktm: Add lists-next.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/673638 [20:40:23] PROBLEM - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:40:31] (03CR) 10Dzahn: [C: 03+2] gitlab: add missing brackets in ferm rule drange [puppet] - 10https://gerrit.wikimedia.org/r/674703 (https://phabricator.wikimedia.org/T276144) (owner: 10Dzahn) [20:40:36] (03PS2) 10Dzahn: gitlab: add missing brackets in ferm rule drange [puppet] - 10https://gerrit.wikimedia.org/r/674703 (https://phabricator.wikimedia.org/T276144) [20:40:41] (03PS6) 10Jeena Huneidi: Rsync private mediawiki files to releases server [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) [20:40:47] (03CR) 10jerkins-bot: [V: 04-1] Include private folder in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674698 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [20:42:46] (03Merged) 10jenkins-bot: LinkRecommendation: Modify path args for calls to API [extensions/GrowthExperiments] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674605 (https://phabricator.wikimedia.org/T277865) (owner: 10Gergő Tisza) [20:45:57] RECOVERY - Check systemd state on gitlab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:46:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:46:54] (03CR) 10Dzahn: "works fine after https://gerrit.wikimedia.org/r/c/operations/puppet/+/674703" [puppet] - 10https://gerrit.wikimedia.org/r/670331 (https://phabricator.wikimedia.org/T276144) (owner: 10Dzahn) [20:48:43] 10SRE, 10DNS, 10Traffic, 10serviceops, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Dzahn) [20:48:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:49:24] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) 05Stalled→03Resolved There are now firewall rules to open 80 and 443 but only to... [20:50:05] (03CR) 10Legoktm: [C: 03+2] Add lists-next.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/673638 (owner: 10Legoktm) [20:50:28] (03PS1) 10Hashar: Revert "Add default TemplateStyles for an Index" [extensions/ProofreadPage] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674672 (https://phabricator.wikimedia.org/T278379) [20:50:48] (03CR) 10Hashar: [C: 03+2] Revert "Add default TemplateStyles for an Index" [extensions/ProofreadPage] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674672 (https://phabricator.wikimedia.org/T278379) (owner: 10Hashar) [20:52:11] (03CR) 10Hashar: [C: 03+2] "Thank you both!" [extensions/GlobalUsage] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674670 (https://phabricator.wikimedia.org/T278375) (owner: 10Jforrester) [20:52:22] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/674708 [20:52:41] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/674708 (owner: 10Kosta Harlan) [20:52:48] nice [20:53:03] hashar: Can you deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/674693 while you're doing stuff? [20:53:05] all three blockers have codes proposed [20:53:18] two are fixed (well will as soon as wmf patches get merged by ci) [20:54:10] who knows what it might break :] [20:54:22] dancy: yeah I guess. I am still in a meeting though [20:54:29] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/674708 (owner: 10Kosta Harlan) [20:54:35] Me too. :-) [20:56:07] !log kharlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [20:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:36] (03PS1) 10Bstorm: quarry: remove the querykiller [puppet] - 10https://gerrit.wikimedia.org/r/674710 (https://phabricator.wikimedia.org/T264254) [20:56:52] (03Merged) 10jenkins-bot: Revert "Add default TemplateStyles for an Index" [extensions/ProofreadPage] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674672 (https://phabricator.wikimedia.org/T278379) (owner: 10Hashar) [20:57:07] (03Merged) 10jenkins-bot: Fix hook registration after class was namespaced [extensions/GlobalUsage] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674670 (https://phabricator.wikimedia.org/T278375) (owner: 10Jforrester) [20:57:19] dancy: deploying that mediawiki-config change [20:57:26] Merci [20:57:30] 10SRE, 10Wikimedia-Mailing-lists: Figure out dkim for mailman3 - https://phabricator.wikimedia.org/T278352 (10Legoktm) 05Open→03Resolved a:03Legoktm in the private repo: ` A modules/secret/secrets/dkim/lists-next.wikimedia.org-wikimedia.key A modules/secret/secrets/dkim/lists-next.wikimedia.org-wikimedia... [20:57:38] 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Legoktm) [20:58:26] and will sync the two other changes that just got merged [20:58:31] for ProofreadPage and GlobalUsage [20:59:07] !log hashar@deploy1002 Synchronized wmf-config/env.php: multiversion: Move '@' operator in env.php closer to relevant statement (duration: 01m 07s) [20:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:09] I will do the train next [21:02:52] !log hashar@deploy1002 Synchronized php-1.36.0-wmf.36/extensions/GlobalUsage: Fix hook registration after class was namespaced - T278375 (duration: 01m 07s) [21:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:59] T278375: Class 'GlobalUsageHooks' not found - https://phabricator.wikimedia.org/T278375 [21:04:12] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [21:04:12] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [21:04:13] hashar: there is still an unsynced change in extensions/GrowthExperiments, I can sync it if it's in the way [21:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:31] tgr_: I will do it ;) [21:04:42] testing got complicated, but there is no real risk to deploying it anyway, it's testwiki only [21:04:45] syncing proofreadpage right now [21:04:50] ahh [21:04:52] nice ! [21:05:03] I will just do it so [21:05:20] !log hashar@deploy1002 Synchronized php-1.36.0-wmf.36/extensions/ProofreadPage: Revert "Add default TemplateStyles for an Index" - T278379 (duration: 01m 07s) [21:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:28] T278379: Argument 1 passed to ProofreadPage\Index\IndexTemplateStyles::__construct() must be an instance of Title, null given, called in /srv/mediawiki/php-1.36.0-wmf.36/extensions/ProofreadPage/includes/Page/PageContent.php on line 284 - https://phabricator.wikimedia.org/T278379 [21:06:40] going to roll 1.36.0-wmf.36 to group 1 [21:06:45] dancy: ^ :) [21:06:51] (03PS1) 10Cwhite: Remove ecs cleanup filter generator. [software/ecs] - 10https://gerrit.wikimedia.org/r/674712 [21:07:13] (03CR) 10Hashar: "Deployed!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674693 (owner: 10Krinkle) [21:07:28] !log hashar@deploy1002 Synchronized php-1.36.0-wmf.36/extensions/GrowthExperiments: LinkRecommendation: Modify path args for calls to API - T277865 (duration: 01m 07s) [21:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:37] T277865: Modify API URL structure - https://phabricator.wikimedia.org/T277865 [21:08:03] after that I will monitor the logs, close the tasks [21:08:06] and I guess get to bed [21:08:20] (03PS1) 10Hashar: group1 wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674713 [21:08:21] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [21:08:21] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [21:08:23] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674713 (owner: 10Hashar) [21:08:25] (03CR) 10Ahmon Dancy: Rsync private mediawiki files to releases server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [21:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:12] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674713 (owner: 10Hashar) [21:09:52] Thanks hashar. [21:10:12] :D [21:10:36] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.36 [21:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:48] 10Puppet, 10SRE, 10Wikimedia-Mailing-lists: Make puppet for mailman3 ready for production - https://phabricator.wikimedia.org/T277286 (10Legoktm) [21:11:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10Jclark-ctr) cloudcephosd1016. rack A4. U40 port 11,13 ID5344,5345 cloudcephosd1017. rack A4. U41. port 22,23 ID5346,5347 cl... [21:11:30] (03Abandoned) 10Dzahn: Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into review/dzahn/674169 [puppet] - 10https://gerrit.wikimedia.org/r/674702 (owner: 10Dzahn) [21:11:32] (03PS6) 10Dzahn: webperf: use new http_only parameter with httpd in processors_and_site [puppet] - 10https://gerrit.wikimedia.org/r/674458 (https://phabricator.wikimedia.org/T277989) [21:11:39] hmm [21:11:44] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.36.0-wmf.36 (duration: 01m 07s) [21:11:51] ah yeah here it is [21:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10Jclark-ctr) [21:12:08] (03CR) 10Dzahn: "> Patch Set 5:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674458 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn) [21:12:13] (03CR) 10Dave Pifke: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/674458 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn) [21:12:20] (03PS7) 10Jeena Huneidi: Rsync private mediawiki files to releases server [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) [21:12:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [21:13:28] (03CR) 10Jeena Huneidi: Rsync private mediawiki files to releases server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [21:14:35] and I roll it exactly two hours after the first rolling [21:14:54] (03CR) 10Dduvall: [C: 03+1] Rsync private mediawiki files to releases server [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [21:15:03] I got confused by the last error 19:14 vs currently 22:14 here [21:15:05] or something like that [21:17:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:17:13] (03CR) 10Dzahn: [C: 03+2] webperf: use new http_only parameter with httpd in processors_and_site [puppet] - 10https://gerrit.wikimedia.org/r/674458 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn) [21:17:25] (03CR) 10Ahmon Dancy: Rsync private mediawiki files to releases server (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [21:17:35] (03CR) 10Dduvall: [C: 03+1] "Looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674698 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [21:19:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:19:43] !log webperf2001 - restarted apache [21:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:12] (03CR) 10Dzahn: "webperf1001 - puppet change looks fine, untouched. webperf1002 - noop, webperf2001 - restarted apache, all is fine" [puppet] - 10https://gerrit.wikimedia.org/r/674458 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn) [21:21:28] 10Puppet, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Have puppet httpd class support enabling mod_ssl without having apache listen on port 443 - https://phabricator.wikimedia.org/T277989 (10Dzahn) 05Open→03Resolved a:03Dzahn The puppet httpd class now has the "`http_only` parameter to mak... [21:25:15] (03PS1) 10Andrew Bogott: nova cloud-init vendordata: further attempts at jinja rendering [puppet] - 10https://gerrit.wikimedia.org/r/674716 [21:27:01] Looks like 1.36.0-wmf.36 on group 1 wikis is all fine \o/ [21:27:35] dancy: I am off for now, will catch up tomorrow. The risky patches do not seem to surface anything, the new-errors dashboard is well... empty. Perf metrics looks fine [21:27:42] so it is probably a success (so far) [21:27:49] Agreed. Have a good night hashar! [21:27:57] yeah night night! [21:28:05] will do group 2 tomorrow during the european window [21:29:17] (03PS8) 10Legoktm: Configure lists1002 [puppet] - 10https://gerrit.wikimedia.org/r/673636 [21:31:03] (03CR) 10Dzahn: "> Patch Set 7:" [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm) [21:32:18] (03CR) 10Andrew Bogott: [C: 03+2] nova cloud-init vendordata: further attempts at jinja rendering [puppet] - 10https://gerrit.wikimedia.org/r/674716 (owner: 10Andrew Bogott) [21:32:27] (03CR) 10Dzahn: "> It perhaps begs the question, do we actually need Envoy between the two? The Apache instance is further proxying to XHGui, Swift, and w" [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm) [21:32:53] (03PS1) 10Cwhite: logstash: replace ECS allow list with filter_on_template [puppet] - 10https://gerrit.wikimedia.org/r/674718 (https://phabricator.wikimedia.org/T234565) [21:32:58] (03CR) 10Legoktm: [C: 03+2] "Here we go!" [puppet] - 10https://gerrit.wikimedia.org/r/673636 (owner: 10Legoktm) [21:34:19] (03PS1) 10Andrew Bogott: cloud-init vendordata: use text/cloud-config rather than text/cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/674719 [21:35:09] (03CR) 10jerkins-bot: [V: 04-1] logstash: replace ECS allow list with filter_on_template [puppet] - 10https://gerrit.wikimedia.org/r/674718 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [21:35:30] (03CR) 10Andrew Bogott: [C: 03+2] cloud-init vendordata: use text/cloud-config rather than text/cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/674719 (owner: 10Andrew Bogott) [21:39:41] (03PS2) 10Cwhite: logstash: replace ECS allow list with filter_on_template [puppet] - 10https://gerrit.wikimedia.org/r/674718 (https://phabricator.wikimedia.org/T234565) [21:40:08] (03PS3) 10Cwhite: logstash: replace ECS allow list with filter_on_template [puppet] - 10https://gerrit.wikimedia.org/r/674718 (https://phabricator.wikimedia.org/T234565) [21:42:18] 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki) (sorry for not quoting) As far as connectivity goes, we can run both mcrouter and onhost memcached on a unix socket, if that is of any help. Generally speaking, we ha... [21:43:44] (03CR) 10Cwhite: "This change is ready for review." [software/ecs] - 10https://gerrit.wikimedia.org/r/672805 (owner: 10Cwhite) [21:46:41] (03CR) 10Cwhite: "Logstash-filter-verifier tests contingent on I73f0c6587aa601b03ae654e25dd92aa7a1076b96" [puppet] - 10https://gerrit.wikimedia.org/r/674718 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [21:47:31] mutante: only remaining issue is that m5-master (db1128.eqiad.wmnet) is firewalled to block our connections [21:47:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:48:03] I'm not seeing an obvious place where I can poke a hole in that [21:49:42] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [21:49:46] oh, maybe ferm_misc [21:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:01] PROBLEM - Check systemd state on wdqs2008 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:50:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:52:09] hm, that role isn't used on m5-master [21:53:21] (03PS1) 10Andrew Bogott: Revert "cloud-init vendordata: use text/cloud-config rather than text/cloud-init" [puppet] - 10https://gerrit.wikimedia.org/r/674722 [21:54:51] (03CR) 10Andrew Bogott: [C: 03+2] Revert "cloud-init vendordata: use text/cloud-config rather than text/cloud-init" [puppet] - 10https://gerrit.wikimedia.org/r/674722 (owner: 10Andrew Bogott) [21:56:19] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create test databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Legoktm) >>! In T256538#6933316, @Marostegui wrote: > @Legoktm @Dzahn you might need to open firewall rules to be able to reach db1128 (m5 master) from lists1002. > ` > # telnet db1128.e... [21:58:23] (03PS1) 10Legoktm: mariadb::misc: Open firewall hole for lists1002 (mailman3) [puppet] - 10https://gerrit.wikimedia.org/r/674724 (https://phabricator.wikimedia.org/T256538) [21:59:30] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28763/console" [puppet] - 10https://gerrit.wikimedia.org/r/674724 (https://phabricator.wikimedia.org/T256538) (owner: 10Legoktm) [22:00:48] (03CR) 10Bstorm: "To add a finer point to this: The querykiller in quarry has been broken for ages since we started loadbalancing the replicas. I had theref" [puppet] - 10https://gerrit.wikimedia.org/r/674710 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm) [22:07:10] PROBLEM - Check systemd state on lists1002 is CRITICAL: CRITICAL - degraded: The following units failed: mailman3-web.service,mailman3.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:07:30] !log disabled puppet on lists1002 while mailman3-web is broken [22:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:44] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=0) [22:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:03] oh that's neat, it gives you the unit names now [22:08:59] ACKNOWLEDGEMENT - Check systemd state on lists1002 is CRITICAL: CRITICAL - degraded: The following units failed: mailman3-web.service,mailman3.service Legoktm working on it https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:09:09] ACKNOWLEDGEMENT - DPKG on lists1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages Legoktm working on it https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [22:20:18] PROBLEM - spamassassin on lists1002 is CRITICAL: PROCS CRITICAL: 0 processes with args spamd https://wikitech.wikimedia.org/wiki/Mail%23SpamAssassin [22:25:09] (03CR) 10Legoktm: [V: 03+1 C: 03+2] "Going ahead with this for now, can be improved later" [puppet] - 10https://gerrit.wikimedia.org/r/674724 (https://phabricator.wikimedia.org/T256538) (owner: 10Legoktm) [22:29:02] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Create test databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Legoktm) Can connect now: ` legoktm@lists1002:~$ telnet db1128.eqiad.wmnet 3306 Trying 10.64.0.98... Connected to db1128.eqiad.wmnet. Escape character is '^]'. ] 5.... [22:29:08] RECOVERY - spamassassin on lists1002 is OK: PROCS OK: 3 processes with args spamd https://wikitech.wikimedia.org/wiki/Mail%23SpamAssassin [22:42:40] things are still broken, but I know what to fix now [22:45:05] (03PS11) 10Effie Mouzeli: mediawiki::mcrouter: add onhost memcached unix socket support [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) [22:45:24] (03PS9) 10Effie Mouzeli: hieradata: enable memcached socket mwdebug1003, mwdebug2001 [puppet] - 10https://gerrit.wikimedia.org/r/663796 (https://phabricator.wikimedia.org/T273115) [22:48:32] PROBLEM - MariaDB Replica Lag: m5 on db2135 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1296.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:50:06] PROBLEM - MariaDB Replica Lag: m5 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1391.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:54:44] PROBLEM - DPKG on lists1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [22:55:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:57:50] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210324T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10Jclark-ctr) name rack_name position Port Cable ID cloudvirt1040 D5 22 22,34 5358 ,5359 cloudvirt1041 D5 23 24,32 5360 ,5361 cloudvi... [23:01:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10Jclark-ctr) [23:02:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [23:02:20] (03PS12) 10Effie Mouzeli: mediawiki::mcrouter: add onhost memcached unix socket support [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) [23:05:51] 10SRE, 10ops-eqiad, 10DC-Ops: apply new hostname label for pki-root1001 (was auth1002) - https://phabricator.wikimedia.org/T278273 (10Jclark-ctr) a:03Jclark-ctr [23:05:57] (03PS13) 10Effie Mouzeli: mediawiki::mcrouter: add onhost memcached unix socket support [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) [23:06:15] 10SRE, 10ops-eqiad, 10DC-Ops: apply new hostname label for pki-root1001 (was auth1002) - https://phabricator.wikimedia.org/T278273 (10Jclark-ctr) 05Open→03Resolved applied new hostname label [23:06:19] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T276625 (10Jclark-ctr) [23:06:20] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad total VRPs alert, total VRPs alert, valid ROAs alert, valid ROAs alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [23:10:37] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs200[123] - https://phabricator.wikimedia.org/T276647 (10Papaul) [23:17:52] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephmon2004-dev - https://phabricator.wikimedia.org/T276509 (10Papaul) [23:19:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:22:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:28:52] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be200[12] - https://phabricator.wikimedia.org/T276642 (10Papaul) [23:30:01] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be200[12] - https://phabricator.wikimedia.org/T276642 (10Papaul) [23:30:26] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [23:31:16] (03PS1) 10Dzahn: site/conftool-data: turn new servers mw2377,mw2378 into jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/674727 (https://phabricator.wikimedia.org/T277780) [23:31:28] alert on alert ok - not alerting [23:31:36] 10Puppet, 10SRE, 10Wikimedia-Mailing-lists: Make puppet for mailman3 ready for production - https://phabricator.wikimedia.org/T277286 (10Legoktm) Running into https://github.com/sqlalchemy/alembic/issues/551 `lines=10 root@lists1002:/etc/mailman3# mailman-wrapper conf Traceback (most recent call last): Fi... [23:33:04] (03PS2) 10Dzahn: site/conftool-data: turn new servers mw2377,mw2378 into jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/674727 (https://phabricator.wikimedia.org/T277780) [23:47:42] 10Puppet, 10SRE, 10Wikimedia-Mailing-lists: Make puppet for mailman3 ready for production - https://phabricator.wikimedia.org/T277286 (10Legoktm) Cloud: ` MariaDB [mailman3]> show create table alembic_version; +-----------------+--------------------------------------------------------------------------------... [23:48:52] !log generating new mcrouter certs for mw2377, mw2378 [23:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:04] (03PS1) 10Dzahn: add fake mcrouter certs for mw2377,mw2378 [labs/private] - 10https://gerrit.wikimedia.org/r/674732 (https://phabricator.wikimedia.org/T278396) [23:53:41] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake mcrouter certs for mw2377,mw2378 [labs/private] - 10https://gerrit.wikimedia.org/r/674732 (https://phabricator.wikimedia.org/T278396) (owner: 10Dzahn) [23:55:41] (03CR) 10Dzahn: [C: 03+2] site/conftool-data: turn new servers mw2377,mw2378 into jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/674727 (https://phabricator.wikimedia.org/T277780) (owner: 10Dzahn) [23:56:33] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-upgrade [23:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:46] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2377.codfw.wmnet with reason: new_install [23:56:47] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2377.codfw.wmnet with reason: new_install [23:56:53] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2378.codfw.wmnet with reason: new_install [23:56:54] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2378.codfw.wmnet with reason: new_install [23:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log