[00:00:34] RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:10] PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_exclude_backups.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:56] (03PS1) 10Andrew Bogott: radosgw: remove "rgw dns name" setting [puppet] - 10https://gerrit.wikimedia.org/r/682780 [00:20:35] (03CR) 10Andrew Bogott: [C: 03+2] radosgw: remove "rgw dns name" setting [puppet] - 10https://gerrit.wikimedia.org/r/682780 (owner: 10Andrew Bogott) [00:20:45] (03CR) 10Krinkle: [C: 04-1] Move ExternalStore log group from debug to error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682322 (https://phabricator.wikimedia.org/T281048) (owner: 10Reedy) [00:21:06] (03PS1) 10Krinkle: externalstore: convert some log messages to WARNING [core] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/682720 (https://phabricator.wikimedia.org/T281048) [00:23:34] (03CR) 10Reedy: Move ExternalStore log group from debug to error (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682322 (https://phabricator.wikimedia.org/T281048) (owner: 10Reedy) [00:27:54] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 1.073 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [00:50:42] (03CR) 10LMata: [C: 03+2] replace mwlog1001 with new mwlog[12]002 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [01:13:47] 10SRE, 10CommRel-Specialists-Support, 10Datacenter-Switchover: CommRel support for June 2021 Switchover - https://phabricator.wikimedia.org/T281209 (10Legoktm) [01:21:05] ACKNOWLEDGEMENT - WDQS SPARQL on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 1.063 second response time Ryan Kemper https://phabricator.wikimedia.org/T280382 https://wikitech.wikimedia.org/wiki/Wikidata_qu [01:21:05] ok [01:21:38] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [01:21:42] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1006.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin1001` tmux session `reimage` [01:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:57] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [01:27:06] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [01:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:29] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [01:29:32] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1006.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph --task-id T280382` on `ryankemper@cumin1001` tmux session `reimage` [01:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:45] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [02:06:02] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-import-siteinfo-dumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.3 [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682784 [02:07:47] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.3 [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682784 (owner: 10TrainBranchBot) [02:13:04] (03PS1) 10Razzi: netboot: Add reuse recipe to preserve /srv on an-master [puppet] - 10https://gerrit.wikimedia.org/r/682785 (https://phabricator.wikimedia.org/T278423) [02:19:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:21:13] (03PS2) 10Razzi: netboot: Add reuse recipe to preserve /srv on an-master [puppet] - 10https://gerrit.wikimedia.org/r/682785 (https://phabricator.wikimedia.org/T278423) [02:21:30] (03CR) 10Razzi: "This is probably missing something, but I've been stuck on this for a while and could use some input. Here's what I know:" [puppet] - 10https://gerrit.wikimedia.org/r/682785 (https://phabricator.wikimedia.org/T278423) (owner: 10Razzi) [02:21:40] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:26:07] (03PS1) 10Razzi: Revert "sqoop: switch to single grouped_wikis.csv" [puppet] - 10https://gerrit.wikimedia.org/r/682790 (https://phabricator.wikimedia.org/T279564) [02:32:52] (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.3 [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682784 (owner: 10TrainBranchBot) [02:34:05] (03CR) 10Razzi: [C: 03+2] Revert "sqoop: switch to single grouped_wikis.csv" [puppet] - 10https://gerrit.wikimedia.org/r/682790 (https://phabricator.wikimedia.org/T279564) (owner: 10Razzi) [02:41:29] (03PS1) 10Razzi: Revert "Revert "sqoop: switch to single grouped_wikis.csv"" [puppet] - 10https://gerrit.wikimedia.org/r/682791 (https://phabricator.wikimedia.org/T279564) [02:53:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:54:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:56:45] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [02:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:58:01] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.075 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:17:15] !log T280382 `wdqs1006` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to raid0: `/dev/md2 2.6T 998G 1.5T 40% /srv` [03:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:17:25] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [03:25:41] PROBLEM - WDQS SPARQL on wdqs1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:27:46] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.70`. Pre-deploy tests passing on canary `wdqs1003` [03:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:01] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@08ad17a]: 0.3.70 [03:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:45] !log [WDQS Deploy] Tests passing following deploy of `0.3.70` on canary `wdqs1003`; proceeding to rest of fleet [03:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:35:07] RECOVERY - WDQS SPARQL on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:36:20] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@08ad17a]: 0.3.70 (duration: 08m 18s) [03:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:37:01] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [03:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:37:09] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [03:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:37:27] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [03:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:17:30] I am going to put phabricator in read only for a couple of minutes to restart the db primary master [04:17:47] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (people1003), Fresh: 97 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:19:18] !log Set phabricator on read only T279625 [04:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:19:27] T279625: Upgrade mysql on db1132 (phabricator db master) - https://phabricator.wikimedia.org/T279625 [04:20:50] "Woe! This request had its journey cut short by unexpected circumstances (Can Not Connect to MySQL)." guess that's intentional? [04:21:50] works now :) [04:22:39] legoktm: yep, see my !log above :) [04:24:28] (03CR) 10Legoktm: [V: 03+1 C: 03+2] mailman3: Use backported packages from component/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/678134 (https://phabricator.wikimedia.org/T278905) (owner: 10Legoktm) [04:25:51] !log upgrading lists-next.wikimedia.org to mailman3-from-bullseye (T280887) [04:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:26:01] T280887: Upgrade lists-next to bullseye mailman versions - https://phabricator.wikimedia.org/T280887 [04:28:34] marostegui: ok, running the updates now [04:29:22] should be done [04:30:15] that was fast! [04:31:18] I guess we don't have enough data in our database yet? ;) [04:31:29] hehe yeah [04:31:42] but I double checked that it actually ran the migrations and it did. tbh I didn't actually check how many migrations there were, just that some did exist [04:32:48] * legoktm tries sending some emails [04:37:15] ah, that's what tehhiccup was, I got the "can't contact db server" error and wondered what was happening [04:37:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 25%: Repool db1158', diff saved to https://phabricator.wikimedia.org/P15539 and previous config saved to /var/cache/conftool/dbconfig/20210427-043725-root.json [04:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:38:11] (03PS1) 10Marostegui: db1124: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/682794 (https://phabricator.wikimedia.org/T258361) [04:38:26] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [04:38:47] (03CR) 10Marostegui: [C: 03+2] db1124: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/682794 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [04:41:07] (03PS1) 10Marostegui: instances.yaml: Add db1124 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/682795 (https://phabricator.wikimedia.org/T258361) [04:43:29] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1124 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/682795 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [04:45:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1124 to dbctl, depooled, T258361', diff saved to https://phabricator.wikimedia.org/P15540 and previous config saved to /var/cache/conftool/dbconfig/20210427-044520-marostegui.json [04:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:45:29] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [04:46:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1124 with minimal weight for the first time in s7 T258361', diff saved to https://phabricator.wikimedia.org/P15541 and previous config saved to /var/cache/conftool/dbconfig/20210427-044609-marostegui.json [04:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:35] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) Pooled db1124 with minimal weight for the first time in s7 [04:47:28] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Upgrade lists-next to bullseye mailman versions - https://phabricator.wikimedia.org/T280887 (10Legoktm) 05Open→03Resolved a:03Legoktm Upgraded, thanks to @Marostegui for supervising! [04:47:31] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Reconsider which mailman3 version we're running - https://phabricator.wikimedia.org/T278905 (10Legoktm) [04:48:39] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Reconsider which mailman3 version we're running - https://phabricator.wikimedia.org/T278905 (10Legoktm) 05Open→03Resolved We're going with bullseye packages, but it has introduced some regressions. [04:52:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 50%: Repool db1158', diff saved to https://phabricator.wikimedia.org/P15543 and previous config saved to /var/cache/conftool/dbconfig/20210427-045229-root.json [04:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1077.eqiad.wmnet [04:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:21] (03PS1) 10Marostegui: mariadb: Decommission db1077 [puppet] - 10https://gerrit.wikimedia.org/r/682796 (https://phabricator.wikimedia.org/T281075) [04:55:43] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 tries to load Google Fonts, but blocked by CSP - https://phabricator.wikimedia.org/T281213 (10Legoktm) https://lists-next.wikimedia.org/mailman3/static/CACHE/css/54a97321b5f1.css ` @font-face { font-family: 'Droid Sans'; font-style: normal; font-weight: 400;... [04:57:25] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 tries to load Google Fonts, but blocked by CSP - https://phabricator.wikimedia.org/T281213 (10Legoktm) This was already fixed upstream at https://gitlab.com/mailman/hyperkitty/-/commit/b35d20f45aafbd152e059abe3d4052485ffae305 [04:57:36] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1077 [puppet] - 10https://gerrit.wikimedia.org/r/682796 (https://phabricator.wikimedia.org/T281075) (owner: 10Marostegui) [04:59:52] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1077.eqiad.wmnet - https://phabricator.wikimedia.org/T281075 (10Marostegui) a:03wiki_willy [05:00:31] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [05:03:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1077.eqiad.wmnet [05:03:07] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1077.eqiad.wmnet - https://phabricator.wikimedia.org/T281075 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1077.eqiad.wmnet` - db1077.eqiad.wmnet (**PASS**) - Downtimed host on Icinga -... [05:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:27] PROBLEM - Disk space on backup2002 is CRITICAL: DISK CRITICAL - free space: /srv 3002177 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup2002&var-datasource=codfw+prometheus/ops [05:07:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 75%: Repool db1158', diff saved to https://phabricator.wikimedia.org/P15544 and previous config saved to /var/cache/conftool/dbconfig/20210427-050732-root.json [05:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1124 with minimal weight for the first time in s7 T258361', diff saved to https://phabricator.wikimedia.org/P15545 and previous config saved to /var/cache/conftool/dbconfig/20210427-050826-marostegui.json [05:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:38] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [05:18:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 5%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15546 and previous config saved to /var/cache/conftool/dbconfig/20210427-051802-root.json [05:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:31] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) I am automatically pooling db1124 into s7. [05:18:42] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [05:20:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1114 temporarily as db1087 will be depooled', diff saved to https://phabricator.wikimedia.org/P15547 and previous config saved to /var/cache/conftool/dbconfig/20210427-052026-marostegui.json [05:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:25] !log Stop mysql on db1087 to clone db1167 (lag will appear on wikidata on wikireplicas) T258361 [05:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:34] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [05:22:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 100%: Repool db1158', diff saved to https://phabricator.wikimedia.org/P15549 and previous config saved to /var/cache/conftool/dbconfig/20210427-052236-root.json [05:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:28] PROBLEM - snapshot of s6 in eqiad on alert1001 is CRITICAL: snapshot for s6 at eqiad taken more than 3 days ago: Most recent backup 2021-04-24 05:13:41 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [05:27:25] !log imported hyperkitty_1.3.4-2~bpo10+2 to apt.wm.o (T281213) [05:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:35] T281213: Mailman3 tries to load Google Fonts, but blocked by CSP - https://phabricator.wikimedia.org/T281213 [05:30:33] !log push pfw fw policies - T281137 [05:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:09] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 tries to load Google Fonts, but blocked by CSP - https://phabricator.wikimedia.org/T281213 (10Legoktm) I installed the new package, but I guess there's some command I need to run to force it to regenerate the CSS? [05:33:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 10%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15550 and previous config saved to /var/cache/conftool/dbconfig/20210427-053306-root.json [05:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:54] PROBLEM - MariaDB Replica Lag: s8 on clouddb1020 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 978.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:39:04] ^ known [05:40:36] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 tries to load Google Fonts, but blocked by CSP - https://phabricator.wikimedia.org/T281213 (10Legoktm) 05Open→03Resolved a:03Legoktm Also filed in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=987654 >>! In T281213#7036776, @Legoktm wrote: > I installed the... [05:48:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 15%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15551 and previous config saved to /var/cache/conftool/dbconfig/20210427-054809-root.json [05:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10ArielGlenn) Hey, this looks almost done, am I reading that right? :-) :-) [05:50:13] (03PS1) 10Marostegui: install_server: Reimage db1118 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/682798 (https://phabricator.wikimedia.org/T278214) [05:51:16] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1118 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/682798 (https://phabricator.wikimedia.org/T278214) (owner: 10Marostegui) [06:03:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 20%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15552 and previous config saved to /var/cache/conftool/dbconfig/20210427-060313-root.json [06:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:00] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:07:03] (03PS3) 10Legoktm: hosts: assign puppet role for rdb2007,rdb2008 [puppet] - 10https://gerrit.wikimedia.org/r/614894 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli) [06:07:37] (03CR) 10jerkins-bot: [V: 04-1] hosts: assign puppet role for rdb2007,rdb2008 [puppet] - 10https://gerrit.wikimedia.org/r/614894 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli) [06:09:24] (03PS4) 10Legoktm: hosts: assign puppet role for rdb2007,rdb2008 [puppet] - 10https://gerrit.wikimedia.org/r/614894 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli) [06:09:51] (03CR) 10jerkins-bot: [V: 04-1] hosts: assign puppet role for rdb2007,rdb2008 [puppet] - 10https://gerrit.wikimedia.org/r/614894 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli) [06:11:24] (03PS5) 10Legoktm: site.pp: assign puppet role for rdb2007,rdb2008 [puppet] - 10https://gerrit.wikimedia.org/r/614894 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli) [06:11:29] !log powercycle elastic2043 - no ssh, no tty remote console available [06:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:01] (03CR) 10Legoktm: [C: 03+2] site.pp: assign puppet role for rdb2007,rdb2008 [puppet] - 10https://gerrit.wikimedia.org/r/614894 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli) [06:14:22] (03CR) 10Legoktm: site.pp: assign puppet role for rdb2007,rdb2008 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/614894 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli) [06:18:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 25%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15553 and previous config saved to /var/cache/conftool/dbconfig/20210427-061817-root.json [06:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:35] 10ops-codfw, 10Discovery: elastic2043 doesn't power up - https://phabricator.wikimedia.org/T281215 (10elukey) [06:27:05] ACKNOWLEDGEMENT - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% Elukey T281215 [06:29:39] (03Abandoned) 10Legoktm: hiera: switch nutcracker shard from rdb2003 to rdb2007 [puppet] - 10https://gerrit.wikimedia.org/r/615163 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli) [06:31:01] (03CR) 10Elukey: [C: 03+2] install_server: kafka-main[12]00[1-5] use default release installer [puppet] - 10https://gerrit.wikimedia.org/r/682731 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [06:31:07] (03PS2) 10Elukey: install_server: kafka-main[12]00[1-5] use default release installer [puppet] - 10https://gerrit.wikimedia.org/r/682731 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [06:31:33] (03PS3) 10Legoktm: site.pp: make rdb2007, rdb2008 a redis cluster [puppet] - 10https://gerrit.wikimedia.org/r/614897 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli) [06:33:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 30%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15554 and previous config saved to /var/cache/conftool/dbconfig/20210427-063320-root.json [06:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:48] log version 1.37.0-wmf.3 was branched at 20ab303fd1d883592b4d2ec2468dfaccad7a9e10 for T278347 [06:33:49] T278347: 1.37.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T278347 [06:34:26] (03PS2) 10Alexandros Kosiaris: Remove site.pp scb entries [puppet] - 10https://gerrit.wikimedia.org/r/676906 (https://phabricator.wikimedia.org/T275759) [06:34:37] liw: missing the ! in your log [06:35:56] (03PS3) 10Alexandros Kosiaris: Remove site.pp scb entries [puppet] - 10https://gerrit.wikimedia.org/r/676906 (https://phabricator.wikimedia.org/T275759) [06:37:08] !log version 1.37.0-wmf.3 was branched at 20ab303fd1d883592b4d2ec2468dfaccad7a9e10 for T278347 [06:37:11] ryankemper, thanks [06:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:53] (03CR) 10Elukey: [C: 03+1] "very ignorant about this bit of code but LGTM" [software/cumin] - 10https://gerrit.wikimedia.org/r/681758 (owner: 10Volans) [06:48:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 40%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15555 and previous config saved to /var/cache/conftool/dbconfig/20210427-064824-root.json [06:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:16] (03PS1) 10Marostegui: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/682881 (https://phabricator.wikimedia.org/T278214) [06:50:47] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/682881 (https://phabricator.wikimedia.org/T278214) (owner: 10Marostegui) [06:51:24] (03PS2) 10Marostegui: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/682881 (https://phabricator.wikimedia.org/T278214) [06:52:42] (03PS1) 10Marostegui: wmnet: Update s1-master to the right master [dns] - 10https://gerrit.wikimedia.org/r/682882 (https://phabricator.wikimedia.org/T278214) [06:52:50] (03PS1) 10Lars Wirzenius: testwikis wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682883 [06:52:52] (03CR) 10Lars Wirzenius: [C: 03+2] testwikis wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682883 (owner: 10Lars Wirzenius) [06:53:21] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/682882 (https://phabricator.wikimedia.org/T278214) (owner: 10Marostegui) [06:53:35] (03Merged) 10jenkins-bot: testwikis wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682883 (owner: 10Lars Wirzenius) [06:54:40] !log liw@deploy1002 Started scap: testwikis wikis to 1.37.0-wmf.3 [06:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:28] !log upgrade mariadb to 10.4.18-1 + reboot on db1108 - T279281 [06:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:36] T279281: Upgrade 10.4.13 hosts to a higher version - https://phabricator.wikimedia.org/T279281 [06:56:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1179 for schema change', diff saved to https://phabricator.wikimedia.org/P15556 and previous config saved to /var/cache/conftool/dbconfig/20210427-065628-marostegui.json [06:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:50] (03PS1) 10Marostegui: mariadb: Productionize db1167 [puppet] - 10https://gerrit.wikimedia.org/r/682885 (https://phabricator.wikimedia.org/T258361) [07:02:22] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1167 [puppet] - 10https://gerrit.wikimedia.org/r/682885 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [07:03:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 50%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15557 and previous config saved to /var/cache/conftool/dbconfig/20210427-070328-root.json [07:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:48] (03PS1) 10Marostegui: site.pp: Fix db1167 role [puppet] - 10https://gerrit.wikimedia.org/r/682887 [07:04:25] (03PS1) 10Majavah: beta: Switchover to deployment-sessionstore04 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682888 (https://phabricator.wikimedia.org/T263617) [07:04:58] (03PS2) 10Marostegui: site.pp: Fix db1167 role [puppet] - 10https://gerrit.wikimedia.org/r/682887 [07:05:38] (03CR) 10Marostegui: [C: 03+2] site.pp: Fix db1167 role [puppet] - 10https://gerrit.wikimedia.org/r/682887 (owner: 10Marostegui) [07:06:11] liw: hi, I have a beta-only config patch https://gerrit.wikimedia.org/r/682888 that I'd like to get merged as soon as possible to unbreak https://phabricator.wikimedia.org/T263617, could you ping me after that scap is done and that patch can be merged? [07:08:31] (03PS1) 10Elukey: Enable the Yarn Labels for Hadoop Analytics [puppet] - 10https://gerrit.wikimedia.org/r/682889 (https://phabricator.wikimedia.org/T277062) [07:09:08] (03CR) 10Elukey: [C: 03+2] Enable the Yarn Labels for Hadoop Analytics [puppet] - 10https://gerrit.wikimedia.org/r/682889 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [07:10:24] 10SRE, 10serviceops: Replace rdb2005, rdb2006 with rdb2009, rdb2010 - https://phabricator.wikimedia.org/T281216 (10Legoktm) p:05Triage→03High [07:11:34] (03PS12) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [07:11:47] (03PS3) 10Marostegui: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/682881 (https://phabricator.wikimedia.org/T278214) [07:12:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: Repool db1179', diff saved to https://phabricator.wikimedia.org/P15558 and previous config saved to /var/cache/conftool/dbconfig/20210427-071227-root.json [07:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:54] (03PS1) 10Legoktm: site.pp: Setup rdb2009, rdb2010 [puppet] - 10https://gerrit.wikimedia.org/r/682890 (https://phabricator.wikimedia.org/T281216) [07:12:56] (03PS1) 10Legoktm: Have rdb2010 replicate from rdb2009 [puppet] - 10https://gerrit.wikimedia.org/r/682891 (https://phabricator.wikimedia.org/T281216) [07:13:06] (03PS3) 10Ladsgroup: lists: Send error logs of apache2/exim4 to logstash [puppet] - 10https://gerrit.wikimedia.org/r/682736 (https://phabricator.wikimedia.org/T276697) [07:13:38] (03PS4) 10Ladsgroup: mailman3: Increase the log level to WARNING and send them to logstash [puppet] - 10https://gerrit.wikimedia.org/r/682737 (https://phabricator.wikimedia.org/T276697) [07:14:26] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [07:17:02] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) Checking tables on db1167 [07:18:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 60%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15559 and previous config saved to /var/cache/conftool/dbconfig/20210427-071831-root.json [07:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:56] (03PS2) 10JMeybohm: Swap zookeeper from conf2002 to conf2005 [puppet] - 10https://gerrit.wikimedia.org/r/682666 (https://phabricator.wikimedia.org/T271573) [07:19:47] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on conf[2002-2003].codfw.wmnet with reason: for zookeeper migration [07:19:49] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on conf[2002-2003].codfw.wmnet with reason: for zookeeper migration [07:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:56] 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10ops-monitoring-bot) Icinga downtime set by jayme@cumin1001 for 1 day, 0:00:00 2 host(s) and their services with reason: for zookeeper migration ` conf[2002-2003].codfw.wmnet ` [07:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:36] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on conf[2004-2006].codfw.wmnet with reason: for zookeeper migration [07:21:38] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on conf[2004-2006].codfw.wmnet with reason: for zookeeper migration [07:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:45] 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10ops-monitoring-bot) Icinga downtime set by jayme@cumin1001 for 2:00:00 3 host(s) and their services with reason: for zookeeper migration ` conf[2004-2006].codfw.wmnet ` [07:21:51] 10SRE, 10serviceops: Replace rdb1005, rdb1006 with rdb1011, rdb1012 - https://phabricator.wikimedia.org/T281217 (10Legoktm) p:05Triage→03High [07:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:06] 10SRE, 10serviceops: Replace rdb1005, rdb1006 with rdb1011, rdb1012 - https://phabricator.wikimedia.org/T281217 (10Legoktm) [07:24:28] (03CR) 10JMeybohm: [C: 03+2] Swap zookeeper from conf2002 to conf2005 [puppet] - 10https://gerrit.wikimedia.org/r/682666 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [07:24:31] !log liw@deploy1002 Finished scap: testwikis wikis to 1.37.0-wmf.3 (duration: 30m 54s) [07:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:05] (03PS3) 10Jcrespo: mariadb: Reenable notifications on db1156 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/682668 (https://phabricator.wikimedia.org/T280492) [07:26:10] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_zookeeper site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:26:43] (03CR) 10Jcrespo: [C: 03+1] "compare from db1074 was successful." [puppet] - 10https://gerrit.wikimedia.org/r/682668 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [07:26:57] !log swift eqiad-prod: less weight for ms-be[1019-1026] / more weight to ms-be106[0-3] - T272836 [07:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:06] T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836 [07:27:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: Repool db1179', diff saved to https://phabricator.wikimedia.org/P15560 and previous config saved to /var/cache/conftool/dbconfig/20210427-072731-root.json [07:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1087 (re)pooling @ 25%: Repool db1087', diff saved to https://phabricator.wikimedia.org/P15561 and previous config saved to /var/cache/conftool/dbconfig/20210427-072814-root.json [07:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:34] (03PS4) 10Legoktm: site.pp: make rdb2007, rdb2008 a redis cluster [puppet] - 10https://gerrit.wikimedia.org/r/614897 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli) [07:31:36] (03PS2) 10Legoktm: site.pp: Setup rdb2009, rdb2010 [puppet] - 10https://gerrit.wikimedia.org/r/682890 (https://phabricator.wikimedia.org/T281216) [07:31:38] (03PS2) 10Legoktm: Have rdb2010 replicate from rdb2009 [puppet] - 10https://gerrit.wikimedia.org/r/682891 (https://phabricator.wikimedia.org/T281216) [07:31:40] (03PS1) 10Legoktm: site.pp: Setup rdb1011, rdb1012 [puppet] - 10https://gerrit.wikimedia.org/r/682892 (https://phabricator.wikimedia.org/T281217) [07:31:42] (03PS1) 10Legoktm: Have rdb1012 replicate from rdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/682893 (https://phabricator.wikimedia.org/T281217) [07:32:10] RECOVERY - MariaDB Replica Lag: s8 on clouddb1020 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:33:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 75%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15562 and previous config saved to /var/cache/conftool/dbconfig/20210427-073335-root.json [07:33:37] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10fgiunchedi) Thank you @papaul, could you forward the attached mib? I'll take a look, though I think a call will be best [07:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:37] (03PS13) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [07:38:21] 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10jcrespo) FYI: people1003 is failing to be backed up. https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&var-site=eqiad&var-job=people1003.eqiad.wmnet-Monthly-1st-Sun-production-home&from=1619492511586... [07:40:11] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [07:41:27] (03PS1) 10Legoktm: Reimage rdb2007, rdb2008 as bullseye [puppet] - 10https://gerrit.wikimedia.org/r/682894 (https://phabricator.wikimedia.org/T255250) [07:42:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: Repool db1179', diff saved to https://phabricator.wikimedia.org/P15563 and previous config saved to /var/cache/conftool/dbconfig/20210427-074234-root.json [07:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1087 (re)pooling @ 50%: Repool db1087', diff saved to https://phabricator.wikimedia.org/P15564 and previous config saved to /var/cache/conftool/dbconfig/20210427-074318-root.json [07:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:44:23] (03CR) 10Legoktm: [C: 03+2] Reimage rdb2007, rdb2008 as bullseye [puppet] - 10https://gerrit.wikimedia.org/r/682894 (https://phabricator.wikimedia.org/T255250) (owner: 10Legoktm) [07:48:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 80%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15565 and previous config saved to /var/cache/conftool/dbconfig/20210427-074839-root.json [07:48:40] PROBLEM - Too many messages in kafka logging-codfw #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-codfw group=cdanis-kafkacat instance=kafkamon2002 job=burrow partition={0,1,2,3,4,5} prometheus=ops site=codfw topic=codfw.w3c.reportingapi.network_error https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var- [07:48:40] &var-cluster=logging-codfw&var-topic=All&var-consumer_group=All [07:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:16] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) Vanilla 6.0.1 was performing worse than 5.1.3 and similarly to 6.0.7 when we tested it in January: >>! In T264398#673141... [07:52:12] (03PS1) 10Legoktm: Reimage rdb1011, rdb1012, rdb2009, rdb2010 as bullseye [puppet] - 10https://gerrit.wikimedia.org/r/682896 (https://phabricator.wikimedia.org/T281216) [07:52:50] !log liw@deploy1002 Pruned MediaWiki: 1.36.0-wmf.38 (duration: 03m 17s) [07:52:56] 10SRE, 10netops: mr1 port utilization alerts shouldn't mention hash page in their IRC logs - https://phabricator.wikimedia.org/T281055 (10fgiunchedi) >>! In T281055#7034863, @CDanis wrote: > Moving to AM sounds good to me. But if needed, in the interim we could change the magic string we use in `check_librenm... [07:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:34] (03PS4) 10Alexandros Kosiaris: Remove site.pp scb entries [puppet] - 10https://gerrit.wikimedia.org/r/676906 (https://phabricator.wikimedia.org/T275759) [07:53:45] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove site.pp scb entries [puppet] - 10https://gerrit.wikimedia.org/r/676906 (https://phabricator.wikimedia.org/T275759) (owner: 10Alexandros Kosiaris) [07:53:47] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Remove site.pp scb entries [puppet] - 10https://gerrit.wikimedia.org/r/676906 (https://phabricator.wikimedia.org/T275759) (owner: 10Alexandros Kosiaris) [07:55:29] (03CR) 10Jcrespo: "FYI" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682537 (https://phabricator.wikimedia.org/T281094) (owner: 10Jcrespo) [07:55:53] (03PS5) 10Jcrespo: Xtrabackup: Increase default open-files-limit to match production [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682537 (https://phabricator.wikimedia.org/T281094) [07:56:17] (03CR) 10David Caro: [C: 03+2] wmcs.openstack: Use Icinga directly [cookbooks] - 10https://gerrit.wikimedia.org/r/682663 (owner: 10David Caro) [07:57:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: Repool db1179', diff saved to https://phabricator.wikimedia.org/P15566 and previous config saved to /var/cache/conftool/dbconfig/20210427-075738-root.json [07:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166 for schema change', diff saved to https://phabricator.wikimedia.org/P15567 and previous config saved to /var/cache/conftool/dbconfig/20210427-075759-marostegui.json [07:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:10] mmhh the kafka lag alert is due to 'cdanis-kafkacat' consumer group for network errors, looking [07:58:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1087 (re)pooling @ 75%: Repool db1087', diff saved to https://phabricator.wikimedia.org/P15568 and previous config saved to /var/cache/conftool/dbconfig/20210427-075822-root.json [07:58:26] only in codfw though [07:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:45] jouncebot: now [07:58:46] No deployments scheduled for the next 3 hour(s) and 1 minute(s) [07:59:29] is someone around that could get a beta-only config patch (https://gerrit.wikimedia.org/r/682888) merged? I'd like to unbreak beta clusters session storage [07:59:57] (03Merged) 10jenkins-bot: wmcs.openstack: Use Icinga directly [cookbooks] - 10https://gerrit.wikimedia.org/r/682663 (owner: 10David Caro) [08:01:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: Repool db1166', diff saved to https://phabricator.wikimedia.org/P15569 and previous config saved to /var/cache/conftool/dbconfig/20210427-080119-root.json [08:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:10] (03PS3) 10Jcrespo: Increase default memory usage of xtrabackup --prepare to 40GB [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682536 (https://phabricator.wikimedia.org/T281094) [08:03:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 90%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15570 and previous config saved to /var/cache/conftool/dbconfig/20210427-080342-root.json [08:03:49] (03CR) 10Jcrespo: [C: 03+2] Increase default memory usage of xtrabackup --prepare to 40GB [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682536 (https://phabricator.wikimedia.org/T281094) (owner: 10Jcrespo) [08:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:59] !log jayme@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers [08:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:03] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) It's great that we narrowed this down and confirmed it, excellent work! The change's claimed behaviour is definitely c... [08:06:50] (03PS6) 10Jcrespo: WIP [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/665383 [08:06:52] (03PS6) 10Jcrespo: Xtrabackup: Increase default open-files-limit to match production [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682537 (https://phabricator.wikimedia.org/T281094) [08:07:01] (03PS7) 10Jcrespo: Xtrabackup: Increase default open-files-limit to match production [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682537 (https://phabricator.wikimedia.org/T281094) [08:07:19] (03CR) 10jerkins-bot: [V: 04-1] WIP [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/665383 (owner: 10Jcrespo) [08:07:26] (03CR) 10jerkins-bot: [V: 04-1] Xtrabackup: Increase default open-files-limit to match production [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682537 (https://phabricator.wikimedia.org/T281094) (owner: 10Jcrespo) [08:07:42] (03CR) 10Jcrespo: [C: 03+2] Xtrabackup: Increase default open-files-limit to match production [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682537 (https://phabricator.wikimedia.org/T281094) (owner: 10Jcrespo) [08:08:24] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb2007.codfw.wmnet with reason: REIMAGE [08:08:29] (03CR) 10Muehlenhoff: "> Patch Set 11:" [puppet] - 10https://gerrit.wikimedia.org/r/675124 (owner: 10Jbond) [08:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:24] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb2008.codfw.wmnet with reason: REIMAGE [08:10:26] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb2007.codfw.wmnet with reason: REIMAGE [08:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:12] !log jayme@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers [08:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:34] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb2008.codfw.wmnet with reason: REIMAGE [08:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1087 (re)pooling @ 100%: Repool db1087', diff saved to https://phabricator.wikimedia.org/P15571 and previous config saved to /var/cache/conftool/dbconfig/20210427-081325-root.json [08:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:08] 10SRE: Find a replacement for RSS aggregator for planet.wikimedia.org - https://phabricator.wikimedia.org/T281219 (10MoritzMuehlenhoff) [08:14:14] 10SRE: Find a replacement for RSS aggregator for planet.wikimedia.org - https://phabricator.wikimedia.org/T281219 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:15:26] 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10MoritzMuehlenhoff) >>! In T280989#7035799, @gerritbot wrote: > Change 682739 **merged** by Dzahn: > %%%[operations/puppet@production] site/DHCP: remove planet1003%%% > https://gerrit.wikimedia.org/r/682739... [08:16:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: Repool db1166', diff saved to https://phabricator.wikimedia.org/P15572 and previous config saved to /var/cache/conftool/dbconfig/20210427-081623-root.json [08:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 100%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15573 and previous config saved to /var/cache/conftool/dbconfig/20210427-081846-root.json [08:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1114 into main and traffic', diff saved to https://phabricator.wikimedia.org/P15574 and previous config saved to /var/cache/conftool/dbconfig/20210427-081911-marostegui.json [08:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:20] (03PS1) 10Legoktm: debian: Fix typo in NaN error message [puppet] - 10https://gerrit.wikimedia.org/r/682898 [08:22:27] (03CR) 10Ayounsi: [C: 03+2] Homer: get Capirca definitions from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/681775 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [08:24:06] !log Restarting CI Jenkins for plugins upgrade [08:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:11] (03PS2) 10Legoktm: debian: Fix typo in NaN error message [puppet] - 10https://gerrit.wikimedia.org/r/682898 [08:31:13] (03PS1) 10Legoktm: redis: Add redis-bullseye.conf [puppet] - 10https://gerrit.wikimedia.org/r/682900 [08:31:15] (03PS1) 10Legoktm: redis: Get rid of distro-specific config [puppet] - 10https://gerrit.wikimedia.org/r/682901 [08:31:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: Repool db1166', diff saved to https://phabricator.wikimedia.org/P15575 and previous config saved to /var/cache/conftool/dbconfig/20210427-083126-root.json [08:31:31] (03CR) 10Legoktm: [V: 03+2 C: 03+2] debian: Fix typo in NaN error message [puppet] - 10https://gerrit.wikimedia.org/r/682898 (owner: 10Legoktm) [08:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1114 into main and traffic', diff saved to https://phabricator.wikimedia.org/P15576 and previous config saved to /var/cache/conftool/dbconfig/20210427-083145-marostegui.json [08:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:38] (03CR) 10Legoktm: [C: 03+2] redis: Add redis-bullseye.conf [puppet] - 10https://gerrit.wikimedia.org/r/682900 (owner: 10Legoktm) [08:35:05] (03CR) 10Alexandros Kosiaris: [C: 04-1] redis: Get rid of distro-specific config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682901 (owner: 10Legoktm) [08:36:08] (03PS2) 10Legoktm: redis: Get rid of distro-specific config [puppet] - 10https://gerrit.wikimedia.org/r/682901 [08:36:10] (03CR) 10Legoktm: redis: Get rid of distro-specific config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682901 (owner: 10Legoktm) [08:36:43] !log standardize management routers ACLs with Capirca [08:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1114 into main and api', diff saved to https://phabricator.wikimedia.org/P15577 and previous config saved to /var/cache/conftool/dbconfig/20210427-083910-marostegui.json [08:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:53] PROBLEM - Host asw2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [08:40:09] er that's me [08:40:25] (just monitoring) [08:41:01] PROBLEM - Host re0.cr4-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [08:41:01] PROBLEM - Host re0.cr3-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [08:41:04] ACKNOWLEDGEMENT - Host asw2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% ayounsi ack [08:41:10] (03CR) 10Alexandros Kosiaris: [C: 03+1] Reimage rdb1011, rdb1012, rdb2009, rdb2010 as bullseye [puppet] - 10https://gerrit.wikimedia.org/r/682896 (https://phabricator.wikimedia.org/T281216) (owner: 10Legoktm) [08:41:30] (rolling back) [08:41:37] RECOVERY - Host asw2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 69.22 ms [08:41:38] (done) [08:41:46] (03CR) 10Alexandros Kosiaris: [C: 03+1] redis: Get rid of distro-specific config [puppet] - 10https://gerrit.wikimedia.org/r/682901 (owner: 10Legoktm) [08:42:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] site.pp: Setup rdb2009, rdb2010 [puppet] - 10https://gerrit.wikimedia.org/r/682890 (https://phabricator.wikimedia.org/T281216) (owner: 10Legoktm) [08:42:50] found the issue [08:43:00] (03CR) 10Alexandros Kosiaris: [C: 04-1] "LGTM, but -1 until the applications have been switched over" [puppet] - 10https://gerrit.wikimedia.org/r/682891 (https://phabricator.wikimedia.org/T281216) (owner: 10Legoktm) [08:43:42] (03CR) 10Alexandros Kosiaris: [C: 03+1] site.pp: Setup rdb1011, rdb1012 [puppet] - 10https://gerrit.wikimedia.org/r/682892 (https://phabricator.wikimedia.org/T281217) (owner: 10Legoktm) [08:44:14] (03CR) 10Alexandros Kosiaris: [C: 04-1] "LGTM, but -1 until the applications that talk to this have been switched over" [puppet] - 10https://gerrit.wikimedia.org/r/682893 (https://phabricator.wikimedia.org/T281217) (owner: 10Legoktm) [08:44:21] godog: are the icinga* hosts still running icinga or everything is on alert* now? [08:44:33] XioNoX: wow Capirca?? [08:45:28] (03CR) 10Legoktm: [C: 03+2] Reimage rdb1011, rdb1012, rdb2009, rdb2010 as bullseye [puppet] - 10https://gerrit.wikimedia.org/r/682896 (https://phabricator.wikimedia.org/T281216) (owner: 10Legoktm) [08:46:13] elukey: yay :) [08:46:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: Repool db1166', diff saved to https://phabricator.wikimedia.org/P15578 and previous config saved to /var/cache/conftool/dbconfig/20210427-084630-root.json [08:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:46] XioNoX: only alert* are active, icinga* are pending decom [08:46:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1175 for schema change', diff saved to https://phabricator.wikimedia.org/P15579 and previous config saved to /var/cache/conftool/dbconfig/20210427-084651-marostegui.json [08:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:10] godog: so I can remove everthing related to icinga* from the management routers? [08:47:11] RECOVERY - Host re0.cr4-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.64 ms [08:47:11] RECOVERY - Host re0.cr3-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 70.26 ms [08:47:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:47:21] XioNoX: yes, definitely [08:47:24] cool1 [08:47:25] ! [08:48:56] 10SRE, 10serviceops: Put rdb20[09|10] into service - https://phabricator.wikimedia.org/T281225 (10akosiaris) [08:49:07] 10SRE, 10serviceops, 10Patch-For-Review: Replace rdb2005, rdb2006 with rdb2009, rdb2010 - https://phabricator.wikimedia.org/T281216 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by legoktm on cumin1001.eqiad.wmnet for hosts: ` ['rdb2009.codfw.wmnet', 'rdb2010.codfw.wmnet'] ` The log can be foun... [08:49:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: Repool db1175', diff saved to https://phabricator.wikimedia.org/P15580 and previous config saved to /var/cache/conftool/dbconfig/20210427-084950-root.json [08:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:47] 10SRE, 10serviceops, 10Patch-For-Review: Replace rdb1005, rdb1006 with rdb1011, rdb1012 - https://phabricator.wikimedia.org/T281217 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by legoktm on cumin1001.eqiad.wmnet for hosts: ` ['rdb1011.eqiad.wmnet', 'rdb1012.eqiad.wmnet'] ` The log can be foun... [08:51:12] 2nd try [08:53:30] (03CR) 10Lars Wirzenius: [C: 03+2] beta: Switchover to deployment-sessionstore04 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682888 (https://phabricator.wikimedia.org/T263617) (owner: 10Majavah) [08:53:58] looks like it worked [08:54:28] (03Merged) 10jenkins-bot: beta: Switchover to deployment-sessionstore04 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682888 (https://phabricator.wikimedia.org/T263617) (owner: 10Majavah) [08:57:07] PROBLEM - Host mr1-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [08:57:09] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [08:58:32] (03PS2) 10Alexandros Kosiaris: changeprop/changeprop-jobqueue/api-gateway: Use the new rdbs [deployment-charts] - 10https://gerrit.wikimedia.org/r/614901 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli) [08:58:34] (03PS1) 10Alexandros Kosiaris: api-gateway: Move networkpolicy to shared values [deployment-charts] - 10https://gerrit.wikimedia.org/r/682905 [08:59:31] maybe not [09:00:32] (rolled back) [09:01:17] !log legoktm@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2007.codfw.wmnet [09:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:44] (03PS3) 10Ayounsi: Use Capirca to generate mgmt SRX security policies [homer/public] - 10https://gerrit.wikimedia.org/r/681708 [09:01:52] (03PS1) 10Alexandros Kosiaris: Switchover ORES and docker-registry to new redis servers [puppet] - 10https://gerrit.wikimedia.org/r/682906 (https://phabricator.wikimedia.org/T255250) [09:02:39] RECOVERY - Host mr1-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 70.77 ms [09:02:41] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 69.41 ms [09:03:46] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb1011.eqiad.wmnet with reason: REIMAGE [09:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:14] 10SRE, 10serviceops, 10Patch-For-Review: Replace rdb1005, rdb1006 with rdb1011, rdb1012 - https://phabricator.wikimedia.org/T281217 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['rdb1011.eqiad.wmnet', 'rdb1012.eqiad.wmnet'] ` Of which those **FAILED**: ` ['rdb1011.eqiad.wmnet', 'rdb1012.eqiad.w... [09:04:47] !log jayme@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) [09:04:51] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb2009.codfw.wmnet with reason: REIMAGE [09:04:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: Repool db1175', diff saved to https://phabricator.wikimedia.org/P15581 and previous config saved to /var/cache/conftool/dbconfig/20210427-090454-root.json [09:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:22] 10SRE, 10serviceops, 10Patch-For-Review: Replace rdb2005, rdb2006 with rdb2009, rdb2010 - https://phabricator.wikimedia.org/T281216 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['rdb2009.codfw.wmnet', 'rdb2010.codfw.wmnet'] ` Of which those **FAILED**: ` ['rdb2009.codfw.wmnet', 'rdb2010.codfw.w... [09:05:46] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb1012.eqiad.wmnet with reason: REIMAGE [09:05:47] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb1011.eqiad.wmnet with reason: REIMAGE [09:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:48] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb2010.codfw.wmnet with reason: REIMAGE [09:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:51] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb1012.eqiad.wmnet with reason: REIMAGE [09:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:52] !log legoktm@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on rdb2009.codfw.wmnet with reason: REIMAGE [09:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:48] !log legoktm@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on rdb2010.codfw.wmnet with reason: REIMAGE [09:11:49] !log jayme@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) [09:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:04] 3rd time will do it? [09:14:29] (03PS4) 10Ayounsi: Use Capirca to generate mgmt SRX security policies [homer/public] - 10https://gerrit.wikimedia.org/r/681708 [09:15:44] XioNoX: 3rd times the charm :-P [09:15:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:16:00] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2007.codfw.wmnet [09:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:19] !log legoktm@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2008.codfw.wmnet [09:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:33] !log legoktm@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2009.codfw.wmnet [09:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:48] !log legoktm@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2010.codfw.wmnet [09:16:49] !log legoktm@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host rdb2010.codfw.wmnet [09:16:51] at least I fixed all the alerting ones so far [09:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:41] !log legoktm@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2010.codfw.wmnet [09:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:08] (03CR) 10Marostegui: [C: 04-1] "I will take care of this - I am doing some last checks before repooling it." [puppet] - 10https://gerrit.wikimedia.org/r/682668 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [09:19:37] !log legoktm@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1011.eqiad.wmnet [09:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:52] !log legoktm@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1012.eqiad.wmnet [09:19:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: Repool db1175', diff saved to https://phabricator.wikimedia.org/P15582 and previous config saved to /var/cache/conftool/dbconfig/20210427-091957-root.json [09:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:07] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:21:55] PROBLEM - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dumps/XML-SQL_Dumps%23A_labstore_host_dies_%28web_or_nfs_server_for_dumps%29 [09:23:08] (03PS1) 10Alexandros Kosiaris: Remove rdb200{3,5} from netpols [deployment-charts] - 10https://gerrit.wikimedia.org/r/682912 (https://phabricator.wikimedia.org/T255250) [09:23:10] that's not me, I didn't push anything to eqsin ^ [09:26:45] (03PS2) 10Tonina Zhelyazkova: wikidata: post edit constraint jobs on 70% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682608 (https://phabricator.wikimedia.org/T204031) [09:28:35] (03Abandoned) 10Alexandros Kosiaris: rdb: use buster on newer servers [puppet] - 10https://gerrit.wikimedia.org/r/670850 (owner: 10Giuseppe Lavagetto) [09:28:47] (03Abandoned) 10Alexandros Kosiaris: redis: also configure the new rdb servers [puppet] - 10https://gerrit.wikimedia.org/r/670846 (owner: 10Giuseppe Lavagetto) [09:29:20] (03PS3) 10Alexandros Kosiaris: site.pp: Setup rdb2009, rdb2010 [puppet] - 10https://gerrit.wikimedia.org/r/682890 (https://phabricator.wikimedia.org/T281216) (owner: 10Legoktm) [09:29:35] RECOVERY - HTTP on labstore1007 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 383 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Dumps/XML-SQL_Dumps%23A_labstore_host_dies_%28web_or_nfs_server_for_dumps%29 [09:30:09] alright, it looks all good, pushing the same to mr1-codfw [09:30:12] (03CR) 10Alexandros Kosiaris: [C: 03+2] site.pp: Setup rdb2009, rdb2010 [puppet] - 10https://gerrit.wikimedia.org/r/682890 (https://phabricator.wikimedia.org/T281216) (owner: 10Legoktm) [09:31:03] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2008.codfw.wmnet [09:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:18] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2009.codfw.wmnet [09:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:49] RECOVERY - snapshot of s6 in eqiad on alert1001 is OK: Last snapshot for s6 at eqiad (db1139.eqiad.wmnet:3316) taken on 2021-04-27 08:33:51 (560 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [09:32:28] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2010.codfw.wmnet [09:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:42] !log rolling restart of elastic in relforge* to pick up Java updates [09:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:11] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1011.eqiad.wmnet [09:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:27] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1012.eqiad.wmnet [09:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:50] (03PS3) 10Alexandros Kosiaris: changeprop/changeprop-jobqueue/api-gateway: Use the new rdbs [deployment-charts] - 10https://gerrit.wikimedia.org/r/614901 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli) [09:34:52] (03PS2) 10Alexandros Kosiaris: Remove rdb200{3,5} from netpols [deployment-charts] - 10https://gerrit.wikimedia.org/r/682912 (https://phabricator.wikimedia.org/T255250) [09:35:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: Repool db1175', diff saved to https://phabricator.wikimedia.org/P15583 and previous config saved to /var/cache/conftool/dbconfig/20210427-093501-root.json [09:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:22] !log standardize management routers ACLs with Capirca - mr1-eqsin [09:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1157 for schema change', diff saved to https://phabricator.wikimedia.org/P15584 and previous config saved to /var/cache/conftool/dbconfig/20210427-093536-marostegui.json [09:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:46] (03PS5) 10Ayounsi: Use Capirca to generate mgmt SRX security policies [homer/public] - 10https://gerrit.wikimedia.org/r/681708 [09:37:51] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [09:39:55] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [09:41:16] (03CR) 10Ayounsi: [C: 03+2] Use Capirca to generate mgmt SRX security policies [homer/public] - 10https://gerrit.wikimedia.org/r/681708 (owner: 10Ayounsi) [09:41:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] Switchover ORES and docker-registry to new redis servers [puppet] - 10https://gerrit.wikimedia.org/r/682906 (https://phabricator.wikimedia.org/T255250) (owner: 10Alexandros Kosiaris) [09:43:11] (03Merged) 10jenkins-bot: Use Capirca to generate mgmt SRX security policies [homer/public] - 10https://gerrit.wikimedia.org/r/681708 (owner: 10Ayounsi) [09:43:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:52:32] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 1889 threshold =0.15 breach: active_shards_percent_as_number: 63.47641144624904, initializing_shards: 2, number_of_nodes: 2, status: yellow, cluster_name: relforge-eqiad, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0, relocating_shards: 0, active_shards: 3283, active_primary_shards: 2586, task [09:52:32] ueue_millis: 0, timed_out: False, number_of_data_nodes: 2, number_of_pending_tasks: 0, unassigned_shards: 1887 https://wikitech.wikimedia.org/wiki/Search%23Administration [09:52:58] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 1813 threshold =0.15 breach: unassigned_shards: 1811, number_of_in_flight_fetch: 0, initializing_shards: 2, relocating_shards: 0, number_of_pending_tasks: 0, active_primary_shards: 2586, active_shards_percent_as_number: 64.94586233565352, timed_out: False, delayed_unassigned_shards: 0, task_max_waiting_in_ [09:52:58] active_shards: 3359, cluster_name: relforge-eqiad, status: yellow, number_of_nodes: 2, number_of_data_nodes: 2 https://wikitech.wikimedia.org/wiki/Search%23Administration [09:56:18] !log ayounsi@deploy1002 Started deploy [homer/deploy@759f82c]: Homer release v0.2.7 [09:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:40] !log ayounsi@deploy1002 Finished deploy [homer/deploy@759f82c]: Homer release v0.2.7 (duration: 00m 22s) [09:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:54] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: yellow, delayed_unassigned_shards: 0, timed_out: False, relocating_shards: 0, number_of_pending_tasks: 0, unassigned_shards: 753, initializing_shards: 2, number_of_nodes: 2, active_shards_percent_as_number: 85.40216550657385, number_of_in_flight_fetch: 0, task_max_waiting_ [09:58:54] 0, number_of_data_nodes: 2, active_primary_shards: 2586, active_shards: 4417 https://wikitech.wikimedia.org/wiki/Search%23Administration [09:59:20] !log ayounsi@deploy1002 Started deploy [homer/deploy@759f82c]: Homer release v0.2.7 [09:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:44] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: active_shards_percent_as_number: 88.16705336426914, number_of_nodes: 2, active_shards: 4560, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, delayed_unassigned_shards: 0, number_of_data_nodes: 2, number_of_pending_tasks: 0, initializing_shards: 2, status: yellow, relocating_shards: 0, unassi [09:59:44] timed_out: False, number_of_in_flight_fetch: 0, active_primary_shards: 2586 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:00:24] (03PS1) 10Jcrespo: Release new v0.5 version [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682916 (https://phabricator.wikimedia.org/T281094) [10:01:36] !log ayounsi@deploy1002 Finished deploy [homer/deploy@759f82c]: Homer release v0.2.7 (duration: 02m 16s) [10:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:01] !log standardize management routers ACLs with Capirca - mr1-eqiad (last one) [10:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:54] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [10:13:56] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [10:17:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:18:35] !log swift eqiad-prod: less weight for ms-be[1019-1026] / more weight to ms-be106[0-3] - T272836 [10:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:44] T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836 [10:22:47] (03PS1) 10Filippo Giunchedi: Decom ms-be[1019-1026] [puppet] - 10https://gerrit.wikimedia.org/r/682920 (https://phabricator.wikimedia.org/T272836) [10:23:04] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [10:23:14] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [10:23:53] (03PS1) 10Urbanecm: WikiPageConfigValidation: Mentor lists and help desk can be null [extensions/GrowthExperiments] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682812 (https://phabricator.wikimedia.org/T281229) [10:30:14] jouncebot: now [10:30:14] No deployments scheduled for the next 0 hour(s) and 29 minute(s) [10:30:21] (03CR) 10Urbanecm: [C: 03+2] WikiPageConfigValidation: Mentor lists and help desk can be null [extensions/GrowthExperiments] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682812 (https://phabricator.wikimedia.org/T281229) (owner: 10Urbanecm) [10:31:06] sth is wrong in upload eqsin, seeing lots of 5xx [10:31:23] no please ignore me, wrong time on the dashboard :( [10:31:23] (03PS1) 10Hnowlan: api-gateway: Create individual cluster definitions for read and write [deployment-charts] - 10https://gerrit.wikimedia.org/r/682921 (https://phabricator.wikimedia.org/T277585) [10:32:18] (03PS1) 10Jbond: systemd::timer::job: quote command as it may contain arguments [puppet] - 10https://gerrit.wikimedia.org/r/682922 [10:33:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:33:22] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/682920 (https://phabricator.wikimedia.org/T272836) (owner: 10Filippo Giunchedi) [10:33:38] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.9524 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [10:33:43] PROBLEM - MariaDB Replica IO: s4 #page on db1143 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2013, Errmsg: error reconnecting to master repl@db1138.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Lost connection to MySQL server at waiting for initial communication packet, system error: 110 Connection timed out https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:33:48] what? [10:33:50] checking [10:34:00] marostegui: need help? [10:34:07] <_joe_> in a meeting, but around if needed [10:34:09] the master looks unreachable [10:34:09] the write on s4 is now zero [10:34:09] also here [10:34:10] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me [10:34:13] PROBLEM - MariaDB Replica IO: s4 #page on db1141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2013, Errmsg: error reconnecting to master repl@db1138.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Lost connection to MySQL server at waiting for initial communication packet, system error: 110 Connection timed out https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:34:17] around [10:34:18] they are all going to page [10:34:23] <_joe_> yeah [10:34:24] * volans acking the pages [10:34:25] PROBLEM - MariaDB Replica IO: s4 #page on db1146 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2013, Errmsg: error reconnecting to master repl@db1138.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Lost connection to MySQL server at waiting for initial communication packet, system error: 110 Connection timed out https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:34:26] PROBLEM - MariaDB Replica IO: s4 #page on db1148 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2013, Errmsg: error connecting to master repl@db1138.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Lost connection to MySQL server at waiting for initial communication packet, system error: 110 Connection timed out https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:34:27] the master is overloaded [10:34:30] <_joe_> also mw is in shambles [10:34:32] PROBLEM - Apache HTTP on mw1396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:34:41] yeah (test)commons is ro [10:34:51] my query is once every five seconds [10:34:54] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29213/console" [puppet] - 10https://gerrit.wikimedia.org/r/682922 (owner: 10Jbond) [10:34:59] uh oh [10:34:59] here too, checking [10:35:00] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [10:35:04] PROBLEM - PHP7 rendering on mw1361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:07] I cannot connect to the master [10:35:10] I am in [10:35:11] <_joe_> everything is down I'd say [10:35:14] thousands of SELECTs [10:35:18] PROBLEM - MariaDB Replica IO: s4 on db1150 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2013, Errmsg: error connecting to master repl@db1138.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Lost connection to MySQL server at waiting for initial communication packet, system error: 110 Connection timed out https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:35:20] PROBLEM - Apache HTTP on mw1380 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:30] PROBLEM - PHP7 rendering on mw1383 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:31] <_joe_> marostegui: can we kill em all? [10:35:32] PROBLEM - Apache HTTP on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:36] lots of | 2420632378 | wikiuser | 10.64.32.50:43834 | commonswiki | Query | 413 | statistics | SELECT /* MediaWiki\Extension\GlobalUsage\GlobalUsage::getLinksFromPage */ gil_to FROM `globalimagelinks` WHERE gil_wiki = 'ptwiki' AND gil_page = 396261 [10:35:36] PROBLEM - PHP7 rendering on mw1410 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:36] [10:35:36] | 0.000 | [10:35:40] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:35:44] PROBLEM - Apache HTTP on mw1390 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:46] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:35:46] PROBLEM - Apache HTTP on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:48] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [10:35:52] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.8769 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [10:35:52] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:35:56] PROBLEM - Apache HTTP on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:56] PROBLEM - PHP7 rendering on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:56] PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:57] bad stuff on log [10:35:58] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [10:35:58] PROBLEM - Apache HTTP on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:58] PROBLEM - Apache HTTP on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:00] PROBLEM - PHP7 rendering on mw1390 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:00] PROBLEM - PHP7 rendering on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:02] PROBLEM - Apache HTTP on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:02] PROBLEM - PHP7 rendering on mw1363 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:04] PROBLEM - PHP7 rendering on mw1375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:04] PROBLEM - Apache HTTP on mw1412 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:04] PROBLEM - Apache HTTP on mw1394 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:06] PROBLEM - PHP7 rendering on mw1404 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:08] PROBLEM - Apache HTTP on mw1388 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:08] PROBLEM - Apache HTTP on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:08] PROBLEM - PHP7 rendering on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:12] PROBLEM - Apache HTTP on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:12] PROBLEM - Apache HTTP on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:12] PROBLEM - Apache HTTP on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:12] PROBLEM - Apache HTTP on mw1406 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:14] PROBLEM - Apache HTTP on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:14] PROBLEM - Apache HTTP on mw1362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:18] PROBLEM - Apache HTTP on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:18] PROBLEM - PHP7 rendering on mw1377 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:18] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:36:18] PROBLEM - PHP7 rendering on mw1394 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:20] PROBLEM - Apache HTTP on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:22] PROBLEM - PHP7 rendering on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:22] PROBLEM - PHP7 rendering on mw1398 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:22] PROBLEM - PHP7 rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:22] PROBLEM - PHP7 rendering on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:22] PROBLEM - PHP7 rendering on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:22] PROBLEM - Apache HTTP on mw1379 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:23] PROBLEM - PHP7 rendering on mw1356 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:23] PROBLEM - Apache HTTP on mw1382 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:24] PROBLEM - Apache HTTP on mw1360 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:24] I am killing all the selects [10:36:26] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:36:26] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - api_80: Servers mw1374.eqiad.wmnet, mw1344.eqiad.wmnet, mw1348.eqiad.wmnet, mw1386.eqiad.wmnet, mw1378.eqiad.wmnet, mw1390.eqiad.wmnet, mw1388.eqiad.wmnet, mw1345.eqiad.wmnet, mw1282.eqiad.wmnet, mw1408.eqiad.wmnet, mw1398.eqiad.wmnet, mw1357.eqiad.wmnet, mw1317.eqiad.wmnet, mw1290.eqiad.wmnet, mw1316.eqiad.wmnet, mw1342.eqiad.wmnet, mw1382. [10:36:26] 89.eqiad.wmnet, mw1341.eqiad.wmnet, mw1360.eqiad.wmnet, mw1313.eqiad.wmnet, mw1346.eqiad.wmnet, mw1361.eqiad.wmnet, mw1392.eqiad.wmnet, mw1287.eqiad.wmnet, mw1288.eqiad.wmnet, mw1281.eqiad.wmnet, mw1314.eqiad.wmnet, mw1412.eqiad.wmnet, mw1396.eqiad.wmnet, mw1404.eqiad.wmnet, mw1283.eqiad.wmnet, mw1381.eqiad.wmnet, mw1362.eqiad.wmnet, mw1340.eqiad.wmnet, mw1343.eqiad.wmnet, mw1347.eqiad.wmnet, mw1377.eqiad.wmnet, mw1286.eqiad.wmne [10:36:26] mnet, mw1363.eqiad.wmnet, mw1359.eqiad.wmnet, mw1400.eqiad.wmnet, mw1383.eqiad.wmnet, mw1297.eqiad.wmnet, mw1375.eqiad.wmnet, mw1315.eqiad.wmnet, mw1285.eqiad.wmnet, mw1402.eqiad.wmnet, https://wikitech.wikimedia.org/wiki/PyBal [10:36:28] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [10:36:28] PROBLEM - Apache HTTP on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:28] PROBLEM - PHP7 rendering on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:28] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [10:36:28] PROBLEM - PHP7 rendering on mw1379 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:29] PROBLEM - Apache HTTP on mw1374 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:32] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:36:40] PROBLEM - PHP7 rendering on mw1339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:40] PROBLEM - Apache HTTP on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:40] PROBLEM - Apache HTTP on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:40] PROBLEM - PHP7 rendering on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:40] PROBLEM - PHP7 rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:42] PROBLEM - PHP7 rendering on mw1358 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:42] PROBLEM - PHP7 rendering on mw1396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:48] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [10:36:50] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:36:50] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/description/{title} (Get description for test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata [10:36:50] e on English Wikipedia returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML fo [10:36:50] ned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [10:36:50] PROBLEM - Apache HTTP on mw1377 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:50] PROBLEM - PHP7 rendering on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:51] PROBLEM - PHP7 rendering on mw1362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:51] PROBLEM - Apache HTTP on mw1376 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:52] PROBLEM - Apache HTTP on mw1400 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:37:08] PROBLEM - Apache HTTP on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:37:08] PROBLEM - PHP7 rendering on mw1378 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:37:08] PROBLEM - Apache HTTP on mw1363 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:37:10] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1284.eqiad.wmnet, mw1346.eqiad.wmnet, mw1380.eqiad.wmnet, mw1390.eqiad.wmnet, mw1357.eqiad.wmnet, mw1362.eqiad.wmnet, mw1361.eqiad.wmnet, mw1392.eqiad.wmnet, mw1374.eqiad.wmnet, mw1344.eqiad.wmnet, mw1287.eqiad.wmnet, mw1348.eqiad.wmnet, mw1288.eqiad.wmnet, mw1281.eqiad.wmnet, mw1314.eqiad.wmnet, mw1386.eqiad.wmnet, mw1410 [10:37:10] 402.eqiad.wmnet, mw1404.eqiad.wmnet, mw1283.eqiad.wmnet, mw1381.eqiad.wmnet, mw1388.eqiad.wmnet, mw1340.eqiad.wmnet, mw1343.eqiad.wmnet, mw1347.eqiad.wmnet, mw1377.eqiad.wmnet, mw1345.eqiad.wmnet, mw1339.eqiad.wmnet, mw1286.eqiad.wmnet, mw1282.eqiad.wmnet, mw1412.eqiad.wmnet, mw1398.eqiad.wmnet, mw1408.eqiad.wmnet, mw1384.eqiad.wmnet, mw1376.eqiad.wmnet, mw1363.eqiad.wmnet, mw1315.eqiad.wmnet, mw1317.eqiad.wmnet, mw1290.eqiad.wmn [10:37:10] wmnet, mw1379.eqiad.wmnet, mw1396.eqiad.wmnet, mw1312.eqiad.wmnet, mw1394.eqiad.wmnet, mw1 https://wikitech.wikimedia.org/wiki/PyBal [10:37:12] PROBLEM - Apache HTTP on mw1287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:37:20] RECOVERY - Apache HTTP on mw1347 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 9.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:37:22] PROBLEM - Apache HTTP on mw1339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:37:22] PROBLEM - PHP7 rendering on mw1376 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:37:30] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:37:34] PROBLEM - PHP7 rendering on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:37:34] PROBLEM - PHP7 rendering on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:37:34] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [10:37:34] PROBLEM - Apache HTTP on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:37:35] PROBLEM - PHP7 rendering on mw1357 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:37:36] PROBLEM - PHP7 rendering on mw1380 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:37:36] PROBLEM - PHP7 rendering on mw1382 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:37:36] PROBLEM - Apache HTTP on mw1386 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:37:36] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:37:38] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:37:38] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [10:37:38] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [10:37:40] PROBLEM - Apache HTTP on mw1375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:37:44] PROBLEM - PHP7 rendering on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:37:46] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:37:48] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:37:48] PROBLEM - Apache HTTP on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:37:50] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:37:50] PROBLEM - Apache HTTP on mw1361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:37:50] RECOVERY - Apache HTTP on mw1345 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 8.563 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:37:52] RECOVERY - Apache HTTP on mw1394 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 8.449 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:37:54] RECOVERY - PHP7 rendering on mw1404 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 7.699 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:37:56] RECOVERY - Apache HTTP on mw1388 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 8.570 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:37:58] RECOVERY - PHP7 rendering on mw1346 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 9.903 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:37:58] PROBLEM - PHP7 rendering on mw1381 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:37:58] RECOVERY - Apache HTTP on mw1406 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 6.989 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:38:04] RECOVERY - Apache HTTP on mw1360 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:38:04] RECOVERY - Apache HTTP on mw1396 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 7.793 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:38:04] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - api_80: Servers mw1284.eqiad.wmnet, mw1346.eqiad.wmnet, mw1404.eqiad.wmnet, mw1357.eqiad.wmnet, mw1361.eqiad.wmnet, mw1392.eqiad.wmnet, mw1374.eqiad.wmnet, mw1344.eqiad.wmnet, mw1287.eqiad.wmnet, mw1388.eqiad.wmnet, mw1288.eqiad.wmnet, mw1281.eqiad.wmnet, mw1314.eqiad.wmnet, mw1386.eqiad.wmnet, mw1348.eqiad.wmnet, mw1402.eqiad.wmnet, mw1390. [10:38:04] 83.eqiad.wmnet, mw1381.eqiad.wmnet, mw1362.eqiad.wmnet, mw1340.eqiad.wmnet, mw1343.eqiad.wmnet, mw1313.eqiad.wmnet, mw1347.eqiad.wmnet, mw1377.eqiad.wmnet, mw1345.eqiad.wmnet, mw1286.eqiad.wmnet, mw1282.eqiad.wmnet, mw1412.eqiad.wmnet, mw1398.eqiad.wmnet, mw1408.eqiad.wmnet, mw1376.eqiad.wmnet, mw1363.eqiad.wmnet, mw1359.eqiad.wmnet, mw1317.eqiad.wmnet, mw1290.eqiad.wmnet, mw1316.eqiad.wmnet, mw1379.eqiad.wmnet, mw1400.eqiad.wmne [10:38:04] mnet, mw1406.eqiad.wmnet, mw1375.eqiad.wmnet, mw1342.eqiad.wmnet, mw1378.eqiad.wmnet, mw1382.eqiad.wmnet, mw1289.eqiad.wmnet, mw1315.eqiad.wmnet, mw1341.eqiad.wmnet, mw1285.eqiad.wmnet, https://wikitech.wikimedia.org/wiki/PyBal [10:38:05] PROBLEM - Apache HTTP on mw1398 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:38:06] RECOVERY - Apache HTTP on mw1382 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 3.992 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:38:06] RECOVERY - PHP7 rendering on mw1356 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 5.250 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:38:06] PROBLEM - PHP7 rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:38:06] RECOVERY - PHP7 rendering on mw1394 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 9.584 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:38:08] RECOVERY - PHP7 rendering on mw1289 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 8.350 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:38:10] RECOVERY - PHP7 rendering on mw1342 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 9.075 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:38:10] RECOVERY - PHP7 rendering on mw1398 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 9.398 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:38:14] RECOVERY - PHP7 rendering on mw1340 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 9.094 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:38:15] RECOVERY - Apache HTTP on mw1342 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 9.244 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:38:15] RECOVERY - Apache HTTP on mw1374 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 8.456 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:38:22] RECOVERY - Apache HTTP on mw1315 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 5.198 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:38:24] RECOVERY - PHP7 rendering on mw1396 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 4.205 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:38:24] RECOVERY - PHP7 rendering on mw1345 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 7.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:38:24] RECOVERY - Apache HTTP on mw1348 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 7.319 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:38:26] RECOVERY - PHP7 rendering on mw1312 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 9.204 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:38:28] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:38:30] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most [10:38:30] January 1, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [10:38:32] RECOVERY - PHP7 rendering on mw1347 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 2.708 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:38:38] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:38:38] RECOVERY - Apache HTTP on mw1377 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 8.990 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:38:40] RECOVERY - Apache HTTP on mw1400 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 8.762 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:38:40] RECOVERY - PHP7 rendering on mw1361 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 7.513 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:38:52] RECOVERY - Apache HTTP on mw1363 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:38:54] RECOVERY - Apache HTTP on mw1341 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.193 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:38:54] RECOVERY - PHP7 rendering on mw1378 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 2.163 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:39:00] RECOVERY - Apache HTTP on mw1380 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:39:00] RECOVERY - Apache HTTP on mw1287 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 3.071 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:39:10] RECOVERY - PHP7 rendering on mw1383 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:39:14] RECOVERY - PHP7 rendering on mw1376 is OK: HTTP OK: HTTP/1.1 302 Found - 655 bytes in 0.298 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:39:14] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 990 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:39:14] RECOVERY - Apache HTTP on mw1339 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 2.725 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:39:18] RECOVERY - PHP7 rendering on mw1410 is OK: HTTP OK: HTTP/1.1 302 Found - 655 bytes in 0.749 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:39:26] RECOVERY - PHP7 rendering on mw1288 is OK: HTTP OK: HTTP/1.1 302 Found - 655 bytes in 0.422 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:39:28] RECOVERY - PHP7 rendering on mw1357 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:39:28] RECOVERY - Apache HTTP on mw1343 is OK: HTTP OK: HTTP/1.1 302 Found - 641 bytes in 0.265 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:39:28] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:39:28] RECOVERY - PHP7 rendering on mw1283 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 2.188 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:39:28] RECOVERY - Apache HTTP on mw1316 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:39:28] RECOVERY - PHP7 rendering on mw1380 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:39:29] RECOVERY - PHP7 rendering on mw1382 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:39:30] RECOVERY - Apache HTTP on mw1386 is OK: HTTP OK: HTTP/1.1 302 Found - 641 bytes in 0.960 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:39:32] RECOVERY - Apache HTTP on mw1390 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 4.598 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:39:32] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:39:34] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:39:34] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [10:39:34] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [10:39:38] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:39:38] RECOVERY - Apache HTTP on mw1314 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:39:38] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:39:40] RECOVERY - Apache HTTP on mw1375 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 6.060 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:39:40] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:39:40] RECOVERY - PHP7 rendering on mw1285 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 1.989 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:39:40] RECOVERY - Apache HTTP on mw1317 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:39:40] RECOVERY - Apache HTTP on mw1286 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 2.167 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:39:40] RECOVERY - PHP7 rendering on mw1344 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 2.510 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:39:42] RECOVERY - Apache HTTP on mw1346 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:39:42] RECOVERY - PHP7 rendering on mw1390 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:39:42] RECOVERY - PHP7 rendering on mw1317 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:39:44] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:39:44] RECOVERY - Apache HTTP on mw1340 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 3.343 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:39:44] RECOVERY - PHP7 rendering on mw1363 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:39:44] RECOVERY - Apache HTTP on mw1361 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.497 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:39:45] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:39:46] RECOVERY - PHP7 rendering on mw1375 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:39:46] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:39:46] RECOVERY - Apache HTTP on mw1412 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:39:52] RECOVERY - PHP7 rendering on mw1381 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 2.838 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:39:54] RECOVERY - Apache HTTP on mw1290 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:39:54] RECOVERY - Apache HTTP on mw1297 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 4.662 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:39:56] RECOVERY - Apache HTTP on mw1283 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.099 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:39:56] RECOVERY - Apache HTTP on mw1285 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.747 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:39:58] RECOVERY - Apache HTTP on mw1398 is OK: HTTP OK: HTTP/1.1 302 Found - 641 bytes in 0.347 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:39:58] RECOVERY - Apache HTTP on mw1362 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.460 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:40:00] RECOVERY - Apache HTTP on mw1344 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.919 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:40:00] RECOVERY - Apache HTTP on mw1288 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:40:00] RECOVERY - PHP7 rendering on mw1348 is OK: HTTP OK: HTTP/1.1 302 Found - 655 bytes in 0.312 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:40:02] RECOVERY - PHP7 rendering on mw1377 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 1.491 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:40:04] RECOVERY - Apache HTTP on mw1379 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.148 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:40:05] RECOVERY - PHP7 rendering on mw1313 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 1.239 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:40:05] RECOVERY - Apache HTTP on mw1313 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 2.554 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:40:06] RECOVERY - PHP7 rendering on mw1286 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 1.451 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:40:06] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:40:08] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:40:12] RECOVERY - PHP7 rendering on mw1379 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 1.363 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:40:18] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:40:18] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [10:40:18] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [10:40:20] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:40:25] PROBLEM - MariaDB Replica IO: s4 #page on db1142 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1040, Errmsg: error reconnecting to master repl@db1138.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Too many connections https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:40:25] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:40:26] RECOVERY - PHP7 rendering on mw1358 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 1.765 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:40:26] RECOVERY - PHP7 rendering on mw1339 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 3.296 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:40:30] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:40:38] RECOVERY - Apache HTTP on mw1376 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 2.965 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:40:38] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [10:40:38] RECOVERY - PHP7 rendering on mw1362 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 3.726 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:40:40] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [10:40:57] !log volans@cumin1001 dbctl commit (dc=all): 'S4 RO, outage', diff saved to https://phabricator.wikimedia.org/P15585 and previous config saved to /var/cache/conftool/dbconfig/20210427-104057-volans.json [10:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:50] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [10:42:30] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [10:42:45] PROBLEM - MariaDB Replica IO: s4 #page on db1144 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1040, Errmsg: error connecting to master repl@db1138.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Too many connections https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:42:46] PROBLEM - MariaDB read only s4 #page on db1138 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:42:53] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2548 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [10:43:06] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:43:34] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 1.746e+04 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:43:36] (03Merged) 10jenkins-bot: WikiPageConfigValidation: Mentor lists and help desk can be null [extensions/GrowthExperiments] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682812 (https://phabricator.wikimedia.org/T281229) (owner: 10Urbanecm) [10:43:41] PROBLEM - MariaDB Replica IO: s4 #page on db1160 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1053, Errmsg: error connecting to master repl@db1138.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Server shutdown in progress https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:44:04] PROBLEM - MariaDB Replica IO: s4 on db2090 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1053, Errmsg: error connecting to master repl@db1138.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Server shutdown in progress https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:44:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:44:28] PROBLEM - MariaDB Replica IO: s4 on db1145 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error connecting to master repl@db1138.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1138.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:44:53] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.5643 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [10:45:00] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [10:45:13] PROBLEM - MariaDB Replica IO: s4 #page on db1121 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error connecting to master repl@db1138.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1138.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:45:58] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.01538 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [10:46:44] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [10:46:48] NOTE: all deployments are on hold until a further announcement is made [10:47:12] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [10:47:30] apergos: I put that into the topic, this will get flooded by icinga [10:47:34] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:47:43] good call [10:47:58] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:48:18] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [10:48:25] Urbanecm: can you put s4/commons RO being known there too? /me is surprised that no-one has been asking about that yet [10:48:34] good point [10:48:43] and maybe -tech if you have access there [10:49:28] Majavah: done both chans [10:49:32] ty [10:49:35] np [10:51:54] 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Datacenter-Switchover: CommRel support for June 2021 Switchover - https://phabricator.wikimedia.org/T281209 (10Elitre) [10:52:40] 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Datacenter-Switchover: CommRel support for June 2021 Switchover - https://phabricator.wikimedia.org/T281209 (10Elitre) @sgrabarczuk @Trizek-WMF ^^^ [10:55:12] (03CR) 10Michael Große: [C: 03+1] "seems reasonable to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682608 (https://phabricator.wikimedia.org/T204031) (owner: 10Tonina Zhelyazkova) [10:55:28] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 128 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:56:34] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:57:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:59:15] RECOVERY - MariaDB Replica IO: s4 #page on db1143 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:59:43] Problems when deleting in eswiki [10:59:48] RECOVERY - MariaDB Replica IO: s4 on db1145 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:59:58] jem: we are having issues with commons, so might be related [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: (Dis)respected human, time to deploy European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210427T1100). Please do the needful. [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:05] Ok [11:00:11] RECOVERY - MariaDB Replica IO: s4 #page on db1160 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:00:19] Just to note that it's not just Commons :) [11:00:19] nope, no deploy rn [11:00:28] no deploys [11:00:28] PROBLEM - Device not healthy -SMART- on labstore1007 is CRITICAL: cluster=wmcs device=1I:1:5 instance=labstore1007 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1007&var-datasource=eqiad+prometheus/ops [11:00:48] jem: do you have a concrete problem? [11:00:52] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 554 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:01:10] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 140 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:01:32] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [11:02:01] RECOVERY - MariaDB Replica IO: s4 #page on db1142 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:03:43] Amir1: maybe image global usage? some reason deletion is not being done in a job, just directly in the hook https://github.com/wikimedia/mediawiki-extensions-GlobalUsage/blob/fd85afae25cab78bc40991fab79a61b5b42c1ed4/includes/Hooks.php#L125 [11:04:02] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [11:04:19] RECOVERY - MariaDB Replica IO: s4 #page on db1148 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:05:08] Amir1: Error when deleting a page, let me try again [11:05:20] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 42 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:05:27] jem: in which wiki? [11:05:31] "[f72940ef-e506-4cf5-89cc-703d37572f9c] 2021-04-27 11:05:16: Excepción grave de tipo "Wikimedia\Rdbms\DBReadOnlyError" [11:05:34] Amir1: eswiki [11:06:18] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [11:06:31] Amir1: yeah confirmed on testwiki, globalusage blocks deletion because commons is ro [11:06:32] Recentchanges has activity [11:06:35] RECOVERY - MariaDB Replica IO: s4 #page on db1141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:06:58] jem: yup, it's known, wait for a while until we fix the commons issue, and then it should work again :) [11:07:19] Ok, thanks, Urbanecm :) [11:07:27] I'll keep an eye here [11:07:51] (03Abandoned) 10Hnowlan: site: set role for eventlog1003 to eventlog [puppet] - 10https://gerrit.wikimedia.org/r/681652 (https://phabricator.wikimedia.org/T280679) (owner: 10Hnowlan) [11:09:37] !log ladsgroup@deploy1002 Synchronized wmf-config/CommonSettings.php: Disable GlobalUsage (duration: 01m 08s) [11:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:49] marostegui: synced [11:09:56] ok, going to remove RW [11:10:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove RW from commonswiki', diff saved to https://phabricator.wikimedia.org/P15588 and previous config saved to /var/cache/conftool/dbconfig/20210427-111016-marostegui.json [11:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:31] Amir1: nO DePLoYs :D [11:10:49] (sorry) [11:11:04] haha [11:11:09] RECOVERY - MariaDB Replica IO: s4 #page on db1144 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:11:54] Deletion worked :) [11:12:03] \o/ [11:12:37] commons recent changes is now flowing [11:12:47] <_joe_> Urbanecm: update topic? commons is not ro anymore [11:13:06] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:13:19] RECOVERY - MariaDB Replica IO: s4 #page on db1121 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:13:20] RECOVERY - MariaDB read only s4 #page on db1138 is OK: Version 10.1.43-MariaDB, Uptime 935s, read_only: False, event_scheduler: True, 1768.29 QPS, connection latency: 0.001838s, query latency: 0.000246s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:13:20] RECOVERY - MariaDB Replica IO: s4 on db1150 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:14:09] <_joe_> we should remember to kill changeprop when we go read-only maybe [11:14:22] <_joe_> although tbf it will retry [11:15:47] _joe_: sure, I'll update it [11:16:09] <_joe_> just because I saw you had op already :) [11:17:09] sure :) [11:17:12] also updated in -tech [11:17:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:17:50] RECOVERY - MariaDB Replica IO: s4 on db2090 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:18:34] _joe_: should i add that as a action item ( kill changeprop when we go read-only) [11:18:53] <_joe_> jbond42: yeah on second thoughts, it's not needed probably [11:19:29] _joe_: ack added thanks [11:22:17] RECOVERY - MariaDB Replica IO: s4 #page on db1146 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:30:43] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker [11:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:55] jayme: --^ [11:31:08] elukey: ack, thx! [11:31:48] RECOVERY - Device not healthy -SMART- on labstore1007 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1007&var-datasource=eqiad+prometheus/ops [11:33:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:36:49] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) [11:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 25%: Repool db1157', diff saved to https://phabricator.wikimedia.org/P15589 and previous config saved to /var/cache/conftool/dbconfig/20210427-114108-root.json [11:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:53] 10SRE, 10serviceops, 10Sustainability (Incident Followup): Ensure Changeprop is disabled when the databases are in read only mode - https://phabricator.wikimedia.org/T281240 (10jbond) [11:42:55] (03PS1) 10QChris: Add .gitreview [software/pipermail-redirector] - 10https://gerrit.wikimedia.org/r/682932 [11:42:57] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [software/pipermail-redirector] - 10https://gerrit.wikimedia.org/r/682932 (owner: 10QChris) [11:47:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:50:45] 10SRE, 10ChangeProp, 10serviceops, 10Sustainability (Incident Followup): Ensure Changeprop is disabled when the databases are in read only mode - https://phabricator.wikimedia.org/T281240 (10hnowlan) [11:51:03] (03PS1) 10Ladsgroup: Avoid reading primary unless absolutely necessary [extensions/GlobalUsage] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682813 (https://phabricator.wikimedia.org/T281238) [11:51:18] (03PS1) 10Ladsgroup: Avoid reading primary unless absolutely necessary [extensions/GlobalUsage] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/682814 (https://phabricator.wikimedia.org/T281238) [11:54:47] (03CR) 10Ladsgroup: [C: 03+2] Avoid reading primary unless absolutely necessary [extensions/GlobalUsage] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/682814 (https://phabricator.wikimedia.org/T281238) (owner: 10Ladsgroup) [11:54:52] (03CR) 10Ladsgroup: [C: 03+2] Avoid reading primary unless absolutely necessary [extensions/GlobalUsage] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682813 (https://phabricator.wikimedia.org/T281238) (owner: 10Ladsgroup) [11:56:06] (03CR) 10jerkins-bot: [V: 04-1] Avoid reading primary unless absolutely necessary [extensions/GlobalUsage] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/682814 (https://phabricator.wikimedia.org/T281238) (owner: 10Ladsgroup) [11:56:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 50%: Repool db1157', diff saved to https://phabricator.wikimedia.org/P15590 and previous config saved to /var/cache/conftool/dbconfig/20210427-115612-root.json [11:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:52] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 85 probes of 637 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:57:13] (03PS1) 10Ladsgroup: URGENT: Disable GlobalUsage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682935 [11:57:55] (03CR) 10Ladsgroup: [C: 03+2] URGENT: Disable GlobalUsage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682935 (owner: 10Ladsgroup) [11:58:46] (03PS2) 10Urbanecm: Avoid reading primary unless absolutely necessary [extensions/GlobalUsage] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/682814 (https://phabricator.wikimedia.org/T281238) (owner: 10Ladsgroup) [11:59:00] (03Merged) 10jenkins-bot: URGENT: Disable GlobalUsage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682935 (owner: 10Ladsgroup) [12:00:04] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210427T1200) [12:00:33] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on labstore1007.wikimedia.org with reason: T281045 [12:00:33] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on labstore1007.wikimedia.org with reason: T281045 [12:00:39] (03CR) 10Ladsgroup: [C: 03+2] Avoid reading primary unless absolutely necessary [extensions/GlobalUsage] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/682814 (https://phabricator.wikimedia.org/T281238) (owner: 10Ladsgroup) [12:00:41] 10SRE, 10ChangeProp, 10serviceops, 10Sustainability (Incident Followup): Ensure Changeprop is disabled when the databases are in read only mode - https://phabricator.wikimedia.org/T281240 (10Joe) To be clear, the idea came out of the fact that during read-only time we had a lot of jobs failing, but given w... [12:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:43] T281045: labstore1007 crashed after storage controller errors - https://phabricator.wikimedia.org/T281045 [12:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:32] (03Merged) 10jenkins-bot: Avoid reading primary unless absolutely necessary [extensions/GlobalUsage] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682813 (https://phabricator.wikimedia.org/T281238) (owner: 10Ladsgroup) [12:03:10] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 54 probes of 637 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:06:00] (03Merged) 10jenkins-bot: Avoid reading primary unless absolutely necessary [extensions/GlobalUsage] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/682814 (https://phabricator.wikimedia.org/T281238) (owner: 10Ladsgroup) [12:10:32] (03CR) 10ZPapierski: "> Patch Set 12:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [12:11:00] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [12:11:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: Repool db1157', diff saved to https://phabricator.wikimedia.org/P15591 and previous config saved to /var/cache/conftool/dbconfig/20210427-121115-root.json [12:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:04] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [12:12:38] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.3/extensions/GlobalUsage: Backport: [[gerrit:682813|Avoid reading primary unless absolutely necessary (T281238)]] (duration: 01m 09s) [12:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:49] T281238: GlobalUsage does selects on the master database - https://phabricator.wikimedia.org/T281238 [12:13:47] (03CR) 10ArielGlenn: "Will this work for timers like this one: https://github.com/wikimedia/puppet/blob/production/modules/snapshot/manifests/cron/pagetitles.pp" [puppet] - 10https://gerrit.wikimedia.org/r/682922 (owner: 10Jbond) [12:15:10] RECOVERY - snapshot of s3 in codfw on alert1001 is OK: Last snapshot for s3 at codfw (db2139.codfw.wmnet:3313) taken on 2021-04-27 08:43:32 (1037 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [12:20:12] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.1/extensions/GlobalUsage: Backport: [[gerrit:682814|Avoid reading primary unless absolutely necessary (T281238)]] (duration: 01m 09s) [12:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:21] T281238: GlobalUsage does selects on the master database - https://phabricator.wikimedia.org/T281238 [12:23:24] RECOVERY - Disk space on mwlog1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwlog1001&var-datasource=eqiad+prometheus/ops [12:24:12] fine fine. in two-three days you will bemuch happier, mwlog1001 [12:24:53] 10SRE, 10Sustainability (Incident Followup): ReEnable GlobalUsage - https://phabricator.wikimedia.org/T281242 (10jbond) [12:25:12] 10SRE, 10Sustainability (Incident Followup): ReEnable GlobalUsage - https://phabricator.wikimedia.org/T281242 (10jbond) [12:25:38] 10SRE, 10Sustainability (Incident Followup): ReEnable GlobalUsage - https://phabricator.wikimedia.org/T281242 (10jbond) [12:26:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: Repool db1157', diff saved to https://phabricator.wikimedia.org/P15592 and previous config saved to /var/cache/conftool/dbconfig/20210427-122619-root.json [12:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:33] apergos: you mean, replaced with mwlog1002? :) [12:26:45] I'll be much happier if mwlog1001 doesn't exist anymore in 2-3 days... [12:27:06] 10SRE, 10Sustainability (Incident Followup), 10User-Ladsgroup: ReEnable GlobalUsage - https://phabricator.wikimedia.org/T281242 (10Ladsgroup) a:03Ladsgroup [12:27:31] (03PS1) 10Ladsgroup: Revert "URGENT: Disable GlobalUsage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682815 [12:27:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:28:04] (03PS2) 10Ladsgroup: Revert "URGENT: Disable GlobalUsage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682815 (https://phabricator.wikimedia.org/T281242) [12:28:07] no, I mean that wmf.3 is this week's train, right? and it has the 'quit logging every cache miss for externalstore kthxbai" patch [12:28:20] ugh mismatched ' and " and not correctable, the worst [12:28:38] anyways that will save a few hundred gb right there [12:29:15] (03CR) 10Ladsgroup: [C: 03+2] Revert "URGENT: Disable GlobalUsage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682815 (https://phabricator.wikimedia.org/T281242) (owner: 10Ladsgroup) [12:29:59] (03Merged) 10jenkins-bot: Revert "URGENT: Disable GlobalUsage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682815 (https://phabricator.wikimedia.org/T281242) (owner: 10Ladsgroup) [12:38:30] (03CR) 10Filippo Giunchedi: [C: 03+2] Decom ms-be[1019-1026] [puppet] - 10https://gerrit.wikimedia.org/r/682920 (https://phabricator.wikimedia.org/T272836) (owner: 10Filippo Giunchedi) [12:38:36] (03PS2) 10Filippo Giunchedi: Decom ms-be[1019-1026] [puppet] - 10https://gerrit.wikimedia.org/r/682920 (https://phabricator.wikimedia.org/T272836) [12:39:07] 10SRE, 10MediaWiki-General, 10Traffic, 10Browser-Support-Apple-Safari: File:Chessboard480.svg WEBP thumbnail version not visible on safari when size is fixed at 208px - https://phabricator.wikimedia.org/T280439 (10Daimona) 05Open→03Resolved a:03ema Working as expected now, thank you! [12:44:05] !log Restarted CI Jenkins for plugins upgrade [12:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:27] 10SRE, 10observability, 10MW-1.37-notes (1.37.0-wmf.3; 2021-04-27), 10Patch-For-Review, and 2 others: mwlog1001 is running out of free space on /srv/mw-log - https://phabricator.wikimedia.org/T281048 (10AMooney) [12:45:18] (03PS2) 10Jbond: systemd::timer::job: quote command as it may contain arguments [puppet] - 10https://gerrit.wikimedia.org/r/682922 [12:46:25] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29214/console" [puppet] - 10https://gerrit.wikimedia.org/r/682922 (owner: 10Jbond) [12:46:27] !log ladsgroup@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:682815|Revert "URGENT: Disable GlobalUsage" (T281242)]] (duration: 01m 08s) [12:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:36] T281242: ReEnable GlobalUsage - https://phabricator.wikimedia.org/T281242 [12:48:05] (03CR) 10Effie Mouzeli: changeprop/changeprop-jobqueue/api-gateway: Use the new rdbs [deployment-charts] - 10https://gerrit.wikimedia.org/r/614901 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli) [12:50:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks! I 'll merge and deploy 1 by 1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/614901 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli) [12:52:40] 10SRE, 10DBA, 10Sustainability (Incident Followup): Collect metricts for Exec_Master_Log_Pos - https://phabricator.wikimedia.org/T281251 (10jbond) [12:54:27] 10SRE, 10DBA, 10Sustainability (Incident Followup): Collect metricts for Exec_Master_Log_Pos - https://phabricator.wikimedia.org/T281251 (10jcrespo) [12:55:08] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission for hosts ms-be1019.eqiad.wmnet [12:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:46] PROBLEM - Check systemd state on ms-be1020 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:29] (03PS1) 10Filippo Giunchedi: hieradata: remove ms-be2016.yml, host long gone [puppet] - 10https://gerrit.wikimedia.org/r/682936 [12:57:55] (03CR) 10Alexandros Kosiaris: [C: 03+2] api-gateway: Move networkpolicy to shared values [deployment-charts] - 10https://gerrit.wikimedia.org/r/682905 (owner: 10Alexandros Kosiaris) [12:59:03] (03PS1) 10Arturo Borrero Gonzalez: nftables: basefirewall: play with the order of the last counter rule [puppet] - 10https://gerrit.wikimedia.org/r/682937 [12:59:05] (03PS1) 10Arturo Borrero Gonzalez: nftables: basefirewall: introduce prometheus facility [puppet] - 10https://gerrit.wikimedia.org/r/682938 (https://phabricator.wikimedia.org/T281124) [12:59:07] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: enable prometheus scraping [puppet] - 10https://gerrit.wikimedia.org/r/682939 (https://phabricator.wikimedia.org/T281124) [12:59:19] (03Merged) 10jenkins-bot: api-gateway: Move networkpolicy to shared values [deployment-charts] - 10https://gerrit.wikimedia.org/r/682905 (owner: 10Alexandros Kosiaris) [12:59:21] (03Merged) 10jenkins-bot: changeprop/changeprop-jobqueue/api-gateway: Use the new rdbs [deployment-charts] - 10https://gerrit.wikimedia.org/r/614901 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli) [13:00:04] liw and longma: That opportune time is upon us again. Time for a MediaWiki train - European+American Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210427T1300). [13:01:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:01:26] liw: longma: please not yet [13:01:55] (03PS1) 10Lars Wirzenius: group0 wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682940 [13:01:57] (03CR) 10Lars Wirzenius: [C: 03+2] group0 wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682940 (owner: 10Lars Wirzenius) [13:02:02] (03PS2) 10JMeybohm: Swap zookeeper from conf2003 to conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/682667 (https://phabricator.wikimedia.org/T271573) [13:02:19] 10SRE, 10Patch-For-Review, 10Sustainability (Incident Followup), 10User-Ladsgroup: ReEnable GlobalUsage - https://phabricator.wikimedia.org/T281242 (10Ladsgroup) 05Open→03Resolved [13:02:20] (03CR) 10Elukey: "Good start! The way that I approach it is by layers:" [puppet] - 10https://gerrit.wikimedia.org/r/682785 (https://phabricator.wikimedia.org/T278423) (owner: 10Razzi) [13:02:27] Urbanecm, I killed deploy-promote, what's up? [13:02:58] liw: mediawiki-stagging is now in weird state. I merged a patch there, then an incient come, and now i need to either sync or revert :) [13:03:19] Urbanecm, which do you prefer? [13:03:34] (03Merged) 10jenkins-bot: group0 wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682940 (owner: 10Lars Wirzenius) [13:03:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1001/29215/" [puppet] - 10https://gerrit.wikimedia.org/r/682937 (owner: 10Arturo Borrero Gonzalez) [13:04:17] hrmph, can't abandon https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/682940 - apparently change is merged already [13:04:32] liw: sync, if possible. It fixes a bug you filled earlier today :) [13:04:56] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: remove ms-be2016.yml, host long gone [puppet] - 10https://gerrit.wikimedia.org/r/682936 (owner: 10Filippo Giunchedi) [13:04:59] Urbanecm, sync it is, what needs to be done? [13:05:18] I'll sync it and ping you, if that's ok :) [13:05:38] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-be1019.eqiad.wmnet [13:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:46] Urbanecm, absolutely - I'll go make a pot of tea meanwhile [13:06:23] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission for hosts ms-be[1020-1026].eqiad.wmnet [13:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:00] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on conf[2004-2006].codfw.wmnet with reason: for zookeeper migration [13:07:02] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on conf[2004-2006].codfw.wmnet with reason: for zookeeper migration [13:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:08] 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10ops-monitoring-bot) Icinga downtime set by jayme@cumin1001 for 2:00:00 3 host(s) and their services with reason: for zookeeper migration ` conf[2004-2006].codfw.wmnet ` [13:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:04] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.3/extensions/GrowthExperiments/includes/Config/WikiPageConfigValidation.php: fe2a0420fd884df7046c0c283bcb2e961e74e8e9: WikiPageConfigValidation: Mentor lists and help desk can be null (T281229) (duration: 01m 06s) [13:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:13] T281229: InvalidArgumentException: GrowthExperiments\Config\WikiPageConfigWriter::getCurrentWikiConfig failed to load config - https://phabricator.wikimedia.org/T281229 [13:09:18] (03CR) 10JMeybohm: [C: 03+2] Swap zookeeper from conf2003 to conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/682667 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [13:10:02] RECOVERY - Device not healthy -SMART- on ms-be1022 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1022&var-datasource=eqiad+prometheus/ops [13:10:33] liw: i'm done, thanks. floor is yours [13:10:56] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [13:11:02] (03PS3) 10Jbond: systemd::timer::job: quote command as it may contain arguments [puppet] - 10https://gerrit.wikimedia.org/r/682922 [13:11:04] (03CR) 10Ppchelko: "gosh... cmon envoy. This is so horrible it's almost doing a full circle to beautiful..." [deployment-charts] - 10https://gerrit.wikimedia.org/r/682921 (https://phabricator.wikimedia.org/T277585) (owner: 10Hnowlan) [13:11:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_zookeeper site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:11:39] (03PS1) 10Lars Wirzenius: group0 wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682941 [13:11:41] (03CR) 10Lars Wirzenius: [C: 03+2] group0 wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682941 (owner: 10Lars Wirzenius) [13:11:57] (03CR) 10Jbond: [V: 03+1] "I tested this on snapshot1008 pagetitles-ns0.service and the following command ended up getting issued" [puppet] - 10https://gerrit.wikimedia.org/r/682922 (owner: 10Jbond) [13:12:04] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [13:12:29] (03Merged) 10jenkins-bot: group0 wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682941 (owner: 10Lars Wirzenius) [13:13:43] !log liw@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.37.0-wmf.3 [13:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:54] train at group0 [13:19:46] !log jayme@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers [13:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:24] 10SRE, 10ChangeProp, 10serviceops, 10Sustainability (Incident Followup): Ensure Changeprop is disabled when the databases are in read only mode - https://phabricator.wikimedia.org/T281240 (10Pchelolo) yeah, that's correct. We can increase the additional delay if needed. Also, this particular additional del... [13:21:30] 10SRE, 10serviceops: Ubtade grafana link for mediawiki-error-rate-$cluster check - https://phabricator.wikimedia.org/T281261 (10jbond) [13:21:55] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-be[1020-1026].eqiad.wmnet [13:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:28] 10SRE, 10serviceops: Ubtade grafana link for mediawiki-error-rate-$cluster check - https://phabricator.wikimedia.org/T281261 (10jbond) @jijiki perhaps? [13:23:04] 10SRE, 10serviceops: Update grafana link for mediawiki-error-rate-$cluster in icinga check - https://phabricator.wikimedia.org/T281261 (10jbond) [13:23:18] !log jayme@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers [13:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:28] 10ops-eqiad, 10SRE-swift-storage, 10User-fgiunchedi: Decom ms-be[1019-1026] - https://phabricator.wikimedia.org/T272836 (10fgiunchedi) @Cmjohnson or @Jclark-ctr all yours, hosts ready for decom [13:23:44] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T280961 (10fgiunchedi) 05Open→03Declined Hosts is decom [13:23:46] 10ops-eqiad, 10SRE-swift-storage, 10User-fgiunchedi: Decom ms-be[1019-1026] - https://phabricator.wikimedia.org/T272836 (10fgiunchedi) [13:24:50] 10SRE, 10SRE-swift-storage: Some object-replicator log lines not making it to centrallog - https://phabricator.wikimedia.org/T264998 (10fgiunchedi) [13:24:52] 10SRE, 10SRE-swift-storage, 10Patch-For-Review, 10User-fgiunchedi: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 (10fgiunchedi) [13:26:32] (03PS2) 10JMeybohm: configcluster: No longer include zookeeper in old configcluster role [puppet] - 10https://gerrit.wikimedia.org/r/682669 (https://phabricator.wikimedia.org/T271573) [13:30:46] !log Upgrading CI Jenkins from 2.263.3 to 2.277.2 [13:30:50] (03PS1) 10Arturo Borrero Gonzalez: nftables: basefirewall: drop double-quote scaping [puppet] - 10https://gerrit.wikimedia.org/r/682942 [13:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] nftables: basefirewall: drop double-quote scaping [puppet] - 10https://gerrit.wikimedia.org/r/682942 (owner: 10Arturo Borrero Gonzalez) [13:33:01] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [13:33:01] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [13:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:10] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [13:34:10] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [13:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:26] kostajh, I see a fix was merged for T281226; would you be able to do a backport of it for train? [13:40:27] T281226: PHP Notice: Only variables should be assigned by reference - https://phabricator.wikimedia.org/T281226 [13:42:23] (03PS1) 10Alexandros Kosiaris: api-gateway: Clear-up the nutcracker configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/682945 [13:43:06] (03PS2) 10Alexandros Kosiaris: api-gateway: Clear-up the nutcracker configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/682945 [13:44:15] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [13:44:15] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'production' . [13:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:13] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'production' . [13:45:13] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [13:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:30] !log switchover api-gateway, changeprop, cpjobqueue to use the new redis cluster servers (rdb2007-rdb2010) [13:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:59] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [13:45:59] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'production' . [13:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:50] (03CR) 10Alexandros Kosiaris: [C: 03+2] api-gateway: Clear-up the nutcracker configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/682945 (owner: 10Alexandros Kosiaris) [13:48:39] !log uploaded openjdk-8 8u292-b10-0~deb10u1 (buster forward port of latest Java 8 security release) [13:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:08] or anyone else doing backports: would you be able to do a backport of it for train? [13:50:41] (03PS3) 10Alexandros Kosiaris: api-gateway: Clear-up the nutcracker configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/682945 [13:50:43] (03PS3) 10Alexandros Kosiaris: Remove rdb200{3,5} from netpols [deployment-charts] - 10https://gerrit.wikimedia.org/r/682912 (https://phabricator.wikimedia.org/T255250) [13:50:45] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1001.eqiad.wmnet [13:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:24] (03PS1) 10Andrew Bogott: Horizon: put into maintenance mode for Victoria upgrade [puppet] - 10https://gerrit.wikimedia.org/r/682948 (https://phabricator.wikimedia.org/T261137) [13:54:26] (03PS1) 10Andrew Bogott: cloud-vps eqiad1 -> version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/682949 (https://phabricator.wikimedia.org/T261137) [13:54:28] (03PS1) 10Andrew Bogott: Revert "Horizon: put into maintenance mode for Victoria upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/682950 (https://phabricator.wikimedia.org/T261137) [13:55:19] !log imported jenkins 2.277.3 to thirdparty/ci [13:55:24] RECOVERY - Too many messages in kafka logging-codfw #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-codfw&var-topic=All&var-consumer_group=All [13:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:42] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10Papaul) @fgiunchedi please see the document you requested {F34429804} [13:56:12] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1001.eqiad.wmnet [13:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:40] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: put into maintenance mode for Victoria upgrade [puppet] - 10https://gerrit.wikimedia.org/r/682948 (https://phabricator.wikimedia.org/T261137) (owner: 10Andrew Bogott) [13:58:54] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1002.eqiad.wmnet [13:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:29] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 9 hosts with reason: upgrading openstack [14:00:32] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 9 hosts with reason: upgrading openstack [14:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:15] (03CR) 10Jbond: [C: 03+2] "merging after talking to Antoine" [puppet] - 10https://gerrit.wikimedia.org/r/670990 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [14:01:18] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 105 hosts with reason: upgrading openstack [14:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:56] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 105 hosts with reason: upgrading openstack [14:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:21] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1002.eqiad.wmnet [14:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:35] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [14:08:35] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [14:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:13] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1003.eqiad.wmnet [14:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:10] 10SRE, 10observability: Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10MoritzMuehlenhoff) [14:10:21] 10SRE, 10observability: Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10MoritzMuehlenhoff) p:05Triage→03High [14:11:05] (03PS1) 10Hashar: zuul-gearman.py: response must be decoded [puppet] - 10https://gerrit.wikimedia.org/r/682953 [14:11:08] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps eqiad1 -> version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/682949 (https://phabricator.wikimedia.org/T261137) (owner: 10Andrew Bogott) [14:11:22] (03CR) 10Hashar: "Follow up: https://gerrit.wikimedia.org/r/c/operations/puppet/+/682953 zuul-gearman.py: response must be decoded" [puppet] - 10https://gerrit.wikimedia.org/r/670990 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [14:11:39] (03CR) 10Jbond: [C: 03+2] systemd::timer::job: quote command as it may contain arguments [puppet] - 10https://gerrit.wikimedia.org/r/682922 (owner: 10Jbond) [14:13:48] (03CR) 10Jbond: [C: 03+2] zuul-gearman.py: response must be decoded [puppet] - 10https://gerrit.wikimedia.org/r/682953 (owner: 10Hashar) [14:14:34] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1003.eqiad.wmnet [14:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:46] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [14:15:46] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [14:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:17] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [14:16:17] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [14:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:06] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [14:17:06] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [14:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:44] !log installing xen security updates [14:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:24] !log jayme@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) [14:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:46] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore2001.codfw.wmnet [14:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:57] !log jayme@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) [14:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:07] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore2001.codfw.wmnet [14:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:22] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore2002.codfw.wmnet [14:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10RobH) These are failing to partition correctly during the initial imaging. I ran out of bandwidth troubleshooting this yesterday evening, and will retu... [14:31:10] !log installing imagemagick security updates [14:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:34] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-me [14:32:42] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [14:32:50] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore2002.codfw.wmnet [14:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:28] !log dns2001 - depooling for T279457 (disable puppet + stop bird) [14:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:38] T279457: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 [14:33:46] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore2003.codfw.wmnet [14:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:16] PROBLEM - Sessionstore codfw on sessionstore.svc.codfw.wmnet is CRITICAL: /sessions/v1/{key} (Get value for key) is CRITICAL: Test Get value for key returned the unexpected status 500 (expecting: 200): /sessions/v1/{key} (Store value for key) is CRITICAL: Test Store value for key returned the unexpected status 500 (expecting: 201) https://www.mediawiki.org/wiki/Kask [14:34:58] hnowlan: --^ [14:36:39] !log cp203[56] - depool all etcd services via confctl - T279457 [14:36:44] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp203[56].codfw.wmnet [14:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:53] (03PS13) 10ZPapierski: rdf-streaming-updater: create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [14:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:24] elukey: ack, thanks [14:37:30] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: add newline at the end of file [puppet] - 10https://gerrit.wikimedia.org/r/682955 [14:38:42] (03PS1) 10Ayounsi: cloudsw: manage OSPF [homer/public] - 10https://gerrit.wikimedia.org/r/682956 [14:38:44] (03PS1) 10Ayounsi: cloudsw: policy-options [homer/public] - 10https://gerrit.wikimedia.org/r/682957 [14:38:58] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:39:10] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:39:12] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore2003.codfw.wmnet [14:39:19] BFD is from the dns2001 depool earlier, will ack [14:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:27] (03CR) 10jerkins-bot: [V: 04-1] cloudsw: manage OSPF [homer/public] - 10https://gerrit.wikimedia.org/r/682956 (owner: 10Ayounsi) [14:39:29] (03CR) 10jerkins-bot: [V: 04-1] cloudsw: policy-options [homer/public] - 10https://gerrit.wikimedia.org/r/682957 (owner: 10Ayounsi) [14:39:32] PROBLEM - Bird Internet Routing Daemon on dns2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:40:24] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:40:40] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:41:24] ACKNOWLEDGEMENT - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast Brandon Black T279457 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:41:24] ACKNOWLEDGEMENT - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast Brandon Black T279457 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:41:24] ACKNOWLEDGEMENT - Bird Internet Routing Daemon on dns2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird Brandon Black T279457 https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:41:24] ACKNOWLEDGEMENT - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 Brandon Black T279457 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:41:24] ACKNOWLEDGEMENT - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 Brandon Black T279457 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:41:30] RECOVERY - Sessionstore codfw on sessionstore.svc.codfw.wmnet is OK: All endpoints are healthy https://www.mediawiki.org/wiki/Kask [14:42:37] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: add newline at the end of file [puppet] - 10https://gerrit.wikimedia.org/r/682955 [14:43:59] (03PS1) 10Giuseppe Lavagetto: k8s::deployment_server: add ip addresses to discovery data [puppet] - 10https://gerrit.wikimedia.org/r/682959 [14:45:12] (03CR) 10jerkins-bot: [V: 04-1] k8s::deployment_server: add ip addresses to discovery data [puppet] - 10https://gerrit.wikimedia.org/r/682959 (owner: 10Giuseppe Lavagetto) [14:46:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] Have rdb2010 replicate from rdb2009 [puppet] - 10https://gerrit.wikimedia.org/r/682891 (https://phabricator.wikimedia.org/T281216) (owner: 10Legoktm) [14:47:20] !log lvs2009 - disable puppet + stop pybal (internal services will move to lvs2010, please avoid LVS service definition changes for now!) - T279457 [14:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:29] T279457: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 [14:47:34] (03PS5) 10Alexandros Kosiaris: site.pp: make rdb2007, rdb2008 a redis cluster [puppet] - 10https://gerrit.wikimedia.org/r/614897 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli) [14:47:58] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Services have been migrated successfully, merging" [puppet] - 10https://gerrit.wikimedia.org/r/614897 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli) [14:48:08] (03PS2) 10Giuseppe Lavagetto: k8s::deployment_server: add ip addresses to discovery data [puppet] - 10https://gerrit.wikimedia.org/r/682959 [14:48:15] (03PS3) 10Alexandros Kosiaris: Have rdb2010 replicate from rdb2009 [puppet] - 10https://gerrit.wikimedia.org/r/682891 (https://phabricator.wikimedia.org/T281216) (owner: 10Legoktm) [14:48:17] (03CR) 10Hashar: [C: 03+1] "I have cherry picked this change on the integration puppet master, ran puppet on integration-agent-pkgbuilder-1002 and then ran the servic" [puppet] - 10https://gerrit.wikimedia.org/r/676133 (owner: 10Jbond) [14:48:35] !log installing file/libmagic updates from buster point release [14:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:54] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Have rdb2010 replicate from rdb2009 [puppet] - 10https://gerrit.wikimedia.org/r/682891 (https://phabricator.wikimedia.org/T281216) (owner: 10Legoktm) [14:49:14] 10SRE, 10Dumps-Generation, 10observability: various weekly and daily dumps run from systemd timers are broken - https://phabricator.wikimedia.org/T281267 (10jbond) [14:49:32] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:49:42] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [14:49:55] (03PS3) 10Hashar: R:pbuilder_base: add extra packages to updates as well [puppet] - 10https://gerrit.wikimedia.org/r/676133 (https://phabricator.wikimedia.org/T240430) (owner: 10Jbond) [14:51:01] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: add newline at the end of file [puppet] - 10https://gerrit.wikimedia.org/r/682955 [14:51:10] (03PS3) 10Giuseppe Lavagetto: k8s::deployment_server: add ip addresses to discovery data [puppet] - 10https://gerrit.wikimedia.org/r/682959 [14:51:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10ArielGlenn) Ah ok! I didn't mean to be hasty, just saw the reimaging script runs and got excited :-) [14:52:31] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29219/console" [puppet] - 10https://gerrit.wikimedia.org/r/682959 (owner: 10Giuseppe Lavagetto) [14:52:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: add newline at the end of file [puppet] - 10https://gerrit.wikimedia.org/r/682955 (owner: 10Arturo Borrero Gonzalez) [14:52:48] (03PS1) 10Hashar: cloud - hieradata: add eatmydata to sid/bullseye [puppet] - 10https://gerrit.wikimedia.org/r/682962 (https://phabricator.wikimedia.org/T240430) [14:53:55] (03PS1) 10Ahmon Dancy: rcfeed: Remove reference assignment [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682818 (https://phabricator.wikimedia.org/T281226) [14:54:10] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10BBlack) Traffic stuff (lvs/cp/dns) is depooled, downtimed, and ready for the network fixups. [14:54:40] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] k8s::deployment_server: add ip addresses to discovery data [puppet] - 10https://gerrit.wikimedia.org/r/682959 (owner: 10Giuseppe Lavagetto) [14:56:11] (03CR) 10Hashar: [C: 03+1] "I have cherry picked it on the integration puppet master and ran puppet on integration-agent-pkgbuilder-1002 . That has properly updated c" [puppet] - 10https://gerrit.wikimedia.org/r/682962 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar) [14:56:52] 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Doing), 10Technical-Debt: debian-glue jobs ignored error messages about libeatmydata.so in LD_PRELOAD - https://phabricator.wikimedia.org/T240430 (10hashar) [14:57:50] (03PS1) 10Andrew Bogott: validatelabsfqdn.py: update to python3 and run through black [puppet] - 10https://gerrit.wikimedia.org/r/682965 [14:58:21] (03PS1) 10Arturo Borrero Gonzalez: nftables: basefirewall: reverse quotation [puppet] - 10https://gerrit.wikimedia.org/r/682966 [15:01:45] (03CR) 10Jbond: [C: 03+2] R:pbuilder_base: add extra packages to updates as well [puppet] - 10https://gerrit.wikimedia.org/r/676133 (https://phabricator.wikimedia.org/T240430) (owner: 10Jbond) [15:01:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] nftables: basefirewall: reverse quotation [puppet] - 10https://gerrit.wikimedia.org/r/682966 (owner: 10Arturo Borrero Gonzalez) [15:01:55] (03CR) 10Jbond: [C: 03+2] cloud - hieradata: add eatmydata to sid/bullseye [puppet] - 10https://gerrit.wikimedia.org/r/682962 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar) [15:02:03] (03CR) 10Kosta Harlan: [C: 03+1] rcfeed: Remove reference assignment [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682818 (https://phabricator.wikimedia.org/T281226) (owner: 10Ahmon Dancy) [15:02:51] arturo: i have merged you change as well seemd pretty harmless [15:03:01] thanks [15:03:05] jbond42: 👍 [15:05:41] (03PS1) 10Hashar: Do not merge: dummy change to test CI [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/682967 [15:06:38] PROBLEM - configured eth on sretest1002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.139: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [15:06:40] PROBLEM - Check systemd state on sretest1002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.139: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:09:09] 10SRE, 10ops-codfw: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10herron) [15:10:03] 10SRE: Integrate Buster 10.8 point update - https://phabricator.wikimedia.org/T274099 (10MoritzMuehlenhoff) [15:10:25] (03PS1) 10Arturo Borrero Gonzalez: nftables: basefirewall: nftables doesn't like strings with single quotes [puppet] - 10https://gerrit.wikimedia.org/r/682969 [15:10:59] 10SRE, 10ops-codfw: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10jcrespo) `ms-backup2002` and `ms-backup2001` are not yet fully into production -they will be soon (T276442), so they can be shutdown at any time. I got confused with backup* hosts, which can be shutdown, b... [15:11:04] 10SRE, 10ops-codfw: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10jcrespo) [15:11:08] ^ sretest1002 is expected, fixing [15:11:26] RECOVERY - Check systemd state on sretest1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:12] (03PS1) 10Giuseppe Lavagetto: kubernetes::deployment_server: only pass ipv4 addresses to egress rules [puppet] - 10https://gerrit.wikimedia.org/r/682971 [15:12:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1001/29221/" [puppet] - 10https://gerrit.wikimedia.org/r/682969 (owner: 10Arturo Borrero Gonzalez) [15:12:56] (03PS11) 10Volans: clustershell: allow to choose different reporters [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [15:12:58] (03PS4) 10Volans: CLI/clustershell: allow to disable progress bars [software/cumin] - 10https://gerrit.wikimedia.org/r/681692 (https://phabricator.wikimedia.org/T212783) [15:13:00] (03PS4) 10Volans: setup.py: support more recent PyParsing versions [software/cumin] - 10https://gerrit.wikimedia.org/r/681758 [15:13:02] (03PS3) 10Volans: clustershell: instantiate progress bar earlier [software/cumin] - 10https://gerrit.wikimedia.org/r/682588 [15:18:56] !log Upgraded all Jenkins to 2.277.3 (latest LTS) # T279033 [15:23:03] !log cr1-codfw# set interfaces ae3 disable (to asw-c2-codfw) - T279457 [15:24:03] papaul: ^ [15:25:19] XioNoX: ok [15:25:27] just waiting on bblack [15:28:01] !log jayme@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker [15:28:14] elukey: ^^ [15:28:17] !log asw-c-codfw> request system power-off member 2 - T279457 [15:30:04] jayme: ack! [15:30:26] jayme: now you are an owner of Kafka and Mirror Maker, this task gets better and better for you :D [15:30:51] ouch [15:31:04] PROBLEM - Host elastic2045 is DOWN: PING CRITICAL - Packet loss = 100% [15:31:38] PROBLEM - Host ms-be2035 is DOWN: PING CRITICAL - Packet loss = 100% [15:31:52] PROBLEM - Host elastic2046 is DOWN: PING CRITICAL - Packet loss = 100% [15:32:12] PROBLEM - Host elastic2047 is DOWN: PING CRITICAL - Packet loss = 100% [15:32:40] PROBLEM - Host ms-be2034 is DOWN: PING CRITICAL - Packet loss = 100% [15:32:40] PROBLEM - Host ms-be2042 is DOWN: PING CRITICAL - Packet loss = 100% [15:32:46] (Emergency syslog message) firing: Emergency syslog message - https://alerts.wikimedia.org [15:32:52] PROBLEM - Host ms-be2048 is DOWN: PING CRITICAL - Packet loss = 100% [15:32:52] PROBLEM - Host ms-be2055 is DOWN: PING CRITICAL - Packet loss = 100% [15:32:56] PROBLEM - Host ms-fe2007 is DOWN: PING CRITICAL - Packet loss = 100% [15:33:02] these are all expected I think [15:33:20] not the emergency syslog message perhaps [15:33:39] is eqiad back into pool, BTW? [15:33:47] eqiad ms? [15:34:02] emergency syslog is from librenms, most likely the switch saying that one node went down [15:34:06] yeah we repooled yesterday [15:34:07] !log jayme@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) [15:34:27] XioNoX: ack, thanks [15:35:22] PROBLEM - Juniper virtual chassis ports on asw-c-codfw is CRITICAL: CRIT: Down: 5 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [15:37:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:05] (03PS2) 10Jbond: P:tlsproxy::envoy: refactor ssl configuertion [puppet] - 10https://gerrit.wikimedia.org/r/682982 [16:14:09] (03CR) 10jerkins-bot: [V: 04-1] cloudsw: loopback firewall filter [homer/public] - 10https://gerrit.wikimedia.org/r/682972 (owner: 10Ayounsi) [16:18:25] !log uploading cap_3.17.1-1 [16:18:30] !log uploading scap_3.17.1-1 [16:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:39] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/29227/" [puppet] - 10https://gerrit.wikimedia.org/r/682982 (owner: 10Jbond) [16:19:56] RECOVERY - configured eth on lvs2008 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [16:20:43] jouncebot is missing :/ [16:21:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 35): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29227/console" [puppet] - 10https://gerrit.wikimedia.org/r/682982 (owner: 10Jbond) [16:21:27] ah no! [16:21:30] jouncebot now [16:21:48] jouncebot now [16:21:48] For the next 1 hour(s) and 38 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210427T1600) [16:21:52] jouncebot next [16:21:52] In 0 hour(s) and 38 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210427T1700) [16:22:24] PROBLEM - nova-compute proc minimum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:22:50] !log upgrading scap 3.17.1-1 on mediawiki canaries - T279695 [16:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:02] T279695: Deploy Scap version 3.17.1-1 - https://phabricator.wikimedia.org/T279695 [16:23:12] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 39 hosts with reason: upgrading openstack [16:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:26] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 39 hosts with reason: upgrading openstack [16:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:32] (03CR) 10Ayounsi: "recheck" [homer/public] - 10https://gerrit.wikimedia.org/r/682956 (owner: 10Ayounsi) [16:25:15] (03CR) 10jerkins-bot: [V: 04-1] cloudsw: loopback firewall filter [homer/public] - 10https://gerrit.wikimedia.org/r/682972 (owner: 10Ayounsi) [16:25:20] (03CR) 10Ayounsi: "recheck" [homer/public] - 10https://gerrit.wikimedia.org/r/682957 (owner: 10Ayounsi) [16:25:22] (03CR) 10jerkins-bot: [V: 04-1] cloudsw: policy-options [homer/public] - 10https://gerrit.wikimedia.org/r/682957 (owner: 10Ayounsi) [16:26:20] (03PS1) 10WMDE-Fisch: Separate reference preview settings in beta & non-beta [extensions/Popups] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682819 (https://phabricator.wikimedia.org/T281235) [16:27:41] (03PS3) 10Ayounsi: cloudsw: policy-options [homer/public] - 10https://gerrit.wikimedia.org/r/682957 [16:27:44] (03PS3) 10Ayounsi: cloudsw: loopback firewall filter [homer/public] - 10https://gerrit.wikimedia.org/r/682972 [16:28:16] 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10Dzahn) >>! In T280989#7036967, @jcrespo wrote: > If this is temporary, no problem, if it is long term, it should be added to the list of ignoring monitoring for backups It's definitely temporary and a fres... [16:28:59] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Separate reference preview settings in beta & non-beta [extensions/Popups] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682819 (https://phabricator.wikimedia.org/T281235) (owner: 10WMDE-Fisch) [16:29:09] (03PS2) 10Arturo Borrero Gonzalez: nftables: basefirewall: introduce prometheus facility [puppet] - 10https://gerrit.wikimedia.org/r/682938 (https://phabricator.wikimedia.org/T281124) [16:29:11] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: enable prometheus scraping [puppet] - 10https://gerrit.wikimedia.org/r/682939 (https://phabricator.wikimedia.org/T281124) [16:29:42] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [16:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:52] (03CR) 10Ayounsi: "recheck" [homer/public] - 10https://gerrit.wikimedia.org/r/682957 (owner: 10Ayounsi) [16:30:49] (03CR) 10jerkins-bot: [V: 04-1] cloudsw: policy-options [homer/public] - 10https://gerrit.wikimedia.org/r/682957 (owner: 10Ayounsi) [16:30:51] (03CR) 10jerkins-bot: [V: 04-1] cloudsw: loopback firewall filter [homer/public] - 10https://gerrit.wikimedia.org/r/682972 (owner: 10Ayounsi) [16:32:34] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10BBlack) Note to our future selves: we forgot to consider the cross-row LVS connections in this downtime: lvs2008 and lvs2010 do not live in row C at all, but had cross-row connections via C2 to... [16:34:34] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:34:37] (03PS1) 10David Caro: ceph: pull all the packages except dbg [puppet] - 10https://gerrit.wikimedia.org/r/682988 [16:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ceph: pull all the packages except dbg [puppet] - 10https://gerrit.wikimedia.org/r/682988 (owner: 10David Caro) [16:36:37] 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10jcrespo) No problem. Sadly, it is my job to bother people from time to time, making sure backups are working 0:-). [16:38:06] (03CR) 10David Caro: [C: 03+2] ceph: pull all the packages except dbg [puppet] - 10https://gerrit.wikimedia.org/r/682988 (owner: 10David Caro) [16:38:17] (03CR) 10Andrew Bogott: [C: 03+1] ceph: pull all the packages except dbg [puppet] - 10https://gerrit.wikimedia.org/r/682988 (owner: 10David Caro) [16:39:36] !log reprepro updating packages on thirdparty/ceph-nautilus-buster [16:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:04] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [16:41:24] (03PS3) 10Arturo Borrero Gonzalez: nftables: basefirewall: introduce prometheus facility [puppet] - 10https://gerrit.wikimedia.org/r/682938 (https://phabricator.wikimedia.org/T281124) [16:41:24] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [16:44:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1001/29229/" [puppet] - 10https://gerrit.wikimedia.org/r/682938 (https://phabricator.wikimedia.org/T281124) (owner: 10Arturo Borrero Gonzalez) [16:49:02] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: enable prometheus scraping [puppet] - 10https://gerrit.wikimedia.org/r/682939 (https://phabricator.wikimedia.org/T281124) [16:49:52] !log powerdown ms-be2042 for maintenance [16:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:20] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [16:52:23] !log powerdown elastic2045 for maintenance [16:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:15] (03PS1) 10MSantos: wikifeeds: bump to 2021-04-24-180651-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/682990 [16:55:16] (03PS1) 10Herron: remove all references to icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/682992 (https://phabricator.wikimedia.org/T279601) [16:55:35] (03PS1) 10MSantos: proton: bump to 2021-04-19-114221-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/682993 [16:57:19] (03CR) 10MSantos: [C: 03+2] wikifeeds: bump to 2021-04-24-180651-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/682990 (owner: 10MSantos) [16:59:16] (03Merged) 10jenkins-bot: wikifeeds: bump to 2021-04-24-180651-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/682990 (owner: 10MSantos) [16:59:28] PROBLEM - Host ms-be2042.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:00:04] chrisalbon and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210427T1700). [17:00:07] (03PS1) 10Urbanecm: Add vrt-wiki.wikimedia.org and vrt-wiki.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/682996 (https://phabricator.wikimedia.org/T280400) [17:01:20] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [17:03:04] RECOVERY - Host ms-be2042.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.45 ms [17:03:06] (03PS1) 10Herron: remove all references to icinga2001 [puppet] - 10https://gerrit.wikimedia.org/r/682999 (https://phabricator.wikimedia.org/T279602) [17:03:46] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [17:03:56] RECOVERY - Host ms-be2042 is UP: PING OK - Packet loss = 0%, RTA = 33.10 ms [17:04:14] !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [17:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:13] (03CR) 10MSantos: [C: 03+2] proton: bump to 2021-04-19-114221-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/682993 (owner: 10MSantos) [17:05:27] (03PS1) 10Urbanecm: Add vrt-wiki.wikimedia.org to mediawiki.yaml [puppet] - 10https://gerrit.wikimedia.org/r/683000 (https://phabricator.wikimedia.org/T280400) [17:06:52] (03Merged) 10jenkins-bot: proton: bump to 2021-04-19-114221-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/682993 (owner: 10MSantos) [17:07:39] !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [17:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:33] (03CR) 10Volans: [C: 03+2] clustershell: allow to choose different reporters [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [17:09:12] !log mbsantos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [17:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:27] (03CR) 10Volans: [C: 03+2] "No diff since last +1, just rebase with conflict resolution. self-merging." [software/cumin] - 10https://gerrit.wikimedia.org/r/681692 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans) [17:09:36] (03CR) 10Volans: [C: 03+2] setup.py: support more recent PyParsing versions [software/cumin] - 10https://gerrit.wikimedia.org/r/681758 (owner: 10Volans) [17:09:36] PROBLEM - Host elastic2045.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:10:28] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmne... [17:10:35] (03CR) 10Volans: [C: 03+2] "No changes since last +1, just rebase conflict resolution, self-merging." [software/cumin] - 10https://gerrit.wikimedia.org/r/682588 (owner: 10Volans) [17:10:58] !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [17:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:16] (03PS1) 10Andrew Bogott: cloud-vps: set cloudvirt nodes to OpenStack U [puppet] - 10https://gerrit.wikimedia.org/r/683002 (https://phabricator.wikimedia.org/T261137) [17:11:44] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps: set cloudvirt nodes to OpenStack U [puppet] - 10https://gerrit.wikimedia.org/r/683002 (https://phabricator.wikimedia.org/T261137) (owner: 10Andrew Bogott) [17:12:28] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: enable prometheus scraping [puppet] - 10https://gerrit.wikimedia.org/r/682939 (https://phabricator.wikimedia.org/T281124) [17:13:53] (03PS2) 10Andrew Bogott: cloud-vps: set cloudvirt nodes to OpenStack U [puppet] - 10https://gerrit.wikimedia.org/r/683002 (https://phabricator.wikimedia.org/T261137) [17:14:07] !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' . [17:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:30] RECOVERY - Host elastic2045 is UP: PING OK - Packet loss = 0%, RTA = 33.11 ms [17:14:39] !log powerdown kafka-logging2003 for maintenance [17:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:30] (03PS5) 10Arturo Borrero Gonzalez: cloudgw: enable prometheus scraping [puppet] - 10https://gerrit.wikimedia.org/r/682939 (https://phabricator.wikimedia.org/T281124) [17:16:04] RECOVERY - Host elastic2045.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.13 ms [17:16:22] (03PS1) 10MSantos: mobileapps: bump to 2021-04-27-171008-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/683003 [17:16:33] !log mbsantos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' . [17:16:37] (03Merged) 10jenkins-bot: clustershell: allow to choose different reporters [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [17:16:39] (03Merged) 10jenkins-bot: CLI/clustershell: allow to disable progress bars [software/cumin] - 10https://gerrit.wikimedia.org/r/681692 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans) [17:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:12] (03Merged) 10jenkins-bot: setup.py: support more recent PyParsing versions [software/cumin] - 10https://gerrit.wikimedia.org/r/681758 (owner: 10Volans) [17:17:14] (03Merged) 10jenkins-bot: clustershell: instantiate progress bar earlier [software/cumin] - 10https://gerrit.wikimedia.org/r/682588 (owner: 10Volans) [17:17:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1001/29234/" [puppet] - 10https://gerrit.wikimedia.org/r/682939 (https://phabricator.wikimedia.org/T281124) (owner: 10Arturo Borrero Gonzalez) [17:17:44] (03PS3) 10Andrew Bogott: cloud-vps: set cloudvirt nodes to OpenStack U [puppet] - 10https://gerrit.wikimedia.org/r/683002 (https://phabricator.wikimedia.org/T261137) [17:18:38] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2021-04-27-171008-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/683003 (owner: 10MSantos) [17:19:34] !log T281215 Banned `elastic2043` from codfw cirrussearch cluster [17:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:43] T281215: elastic2043 doesn't power up - https://phabricator.wikimedia.org/T281215 [17:20:20] (03Merged) 10jenkins-bot: mobileapps: bump to 2021-04-27-171008-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/683003 (owner: 10MSantos) [17:20:25] 10SRE, 10ops-codfw, 10Discovery: elastic2043 doesn't power up - https://phabricator.wikimedia.org/T281215 (10RKemper) `ryankemper@elastic2044:~$ curl -s localhost:9600/_cluster/health {"cluster_name":"production-search-psi-codfw","status":"green","timed_out":false,"number_of_nodes":17,"number_of_data_nodes":... [17:20:44] PROBLEM - Host kafka-logging2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:21:14] robh: https://netbox.wikimedia.org/ipam/prefixes/132/ip-addresses/ [17:21:24] !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [17:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:29] (03Abandoned) 10Hashar: Do not merge: dummy change to test CI [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/682967 (owner: 10Hashar) [17:23:01] !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [17:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:41] (03PS4) 10Andrew Bogott: cloud-vps: set cloudvirt nodes to OpenStack U [puppet] - 10https://gerrit.wikimedia.org/r/683002 (https://phabricator.wikimedia.org/T261137) [17:24:21] 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Doing), 10Technical-Debt: debian-glue jobs ignored error messages about libeatmydata.so in LD_PRELOAD - https://phabricator.wikimedia.org/T240430 (10hashar) 05Open→03Resolved I have confirmed that eatmydat... [17:24:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "xD" [puppet] - 10https://gerrit.wikimedia.org/r/683002 (https://phabricator.wikimedia.org/T261137) (owner: 10Andrew Bogott) [17:25:13] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: set cloudvirt nodes to OpenStack U [puppet] - 10https://gerrit.wikimedia.org/r/683002 (https://phabricator.wikimedia.org/T261137) (owner: 10Andrew Bogott) [17:25:22] !log mbsantos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [17:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:25] (03CR) 10Jcrespo: "Solution worked nicely, prepare now is much faster (probably due to parallelism)." [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682916 (https://phabricator.wikimedia.org/T281094) (owner: 10Jcrespo) [17:29:29] (03CR) 10Jcrespo: [C: 03+2] Release new v0.5 version [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682916 (https://phabricator.wikimedia.org/T281094) (owner: 10Jcrespo) [17:29:51] !log robh@cumin1001 START - Cookbook sre.dns.netbox [17:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:17] (03PS1) 10Arturo Borrero Gonzalez: nftables: indicate that service has restart [puppet] - 10https://gerrit.wikimedia.org/r/683011 [17:30:59] (03CR) 10jerkins-bot: [V: 04-1] nftables: indicate that service has restart [puppet] - 10https://gerrit.wikimedia.org/r/683011 (owner: 10Arturo Borrero Gonzalez) [17:31:46] (03PS1) 10Herron: kafka-logging: migrate logstash2001 broker to kafka-logging2001 [puppet] - 10https://gerrit.wikimedia.org/r/683012 (https://phabricator.wikimedia.org/T279342) [17:31:48] (03PS1) 10Herron: kafka-logging: migrate logstash2002 broker to kafka-logging2002 [puppet] - 10https://gerrit.wikimedia.org/r/683013 (https://phabricator.wikimedia.org/T279342) [17:31:50] (03PS1) 10Herron: kafka-logging: migrate logstash2003 broker to kafka-logging2003 [puppet] - 10https://gerrit.wikimedia.org/r/683014 (https://phabricator.wikimedia.org/T279342) [17:32:07] (03PS2) 10Arturo Borrero Gonzalez: nftables: indicate that service has restart [puppet] - 10https://gerrit.wikimedia.org/r/683011 [17:32:20] !log robh@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [17:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:44] RECOVERY - Host kafka-logging2003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms [17:32:57] (03CR) 10Dzahn: [C: 03+2] Add vrt-wiki.wikimedia.org and vrt-wiki.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/682996 (https://phabricator.wikimedia.org/T280400) (owner: 10Urbanecm) [17:34:01] thanks mutante :) [17:34:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/29237/" [puppet] - 10https://gerrit.wikimedia.org/r/683011 (owner: 10Arturo Borrero Gonzalez) [17:34:55] !log powerdown moss-fe2001 for maintenance [17:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:08] PROBLEM - swift codfw container availability low on alert1001 is CRITICAL: cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw [17:42:32] PROBLEM - Host moss-fe2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:44:56] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:06] RECOVERY - Host moss-fe2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.63 ms [17:45:32] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:47:02] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:52:04] (03PS2) 10Jdlrobson: Enable language in header for office and testwiki users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682757 (https://phabricator.wikimedia.org/T280526) [17:54:01] (03PS3) 10Jdlrobson: Enable language in header for office and testwiki users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682757 (https://phabricator.wikimedia.org/T280526) [17:55:18] (03PS1) 10Aaron Schulz: Add "mcrouter-master-dc" to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683022 (https://phabricator.wikimedia.org/T278392) [17:55:20] (03PS1) 10Aaron Schulz: Set $wgChronologyProtectorStash to "mcrouter-master-dc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683023 [17:55:33] (03PS3) 10Aaron Schulz: Use $region for default mcrouter routes [puppet] - 10https://gerrit.wikimedia.org/r/654330 [17:55:46] 10SRE, 10Traffic, 10Patch-For-Review: Implement machine-local forwarding DNS caches - https://phabricator.wikimedia.org/T171498 (10mmodell) [17:55:56] jouncebot: refresh [17:55:57] I refreshed my knowledge about deployments. [17:56:11] (03CR) 10Aaron Schulz: "Blocked on https://gerrit.wikimedia.org/r/c/operations/puppet/+/654330" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683023 (owner: 10Aaron Schulz) [17:56:45] elukey: can you CR https://gerrit.wikimedia.org/r/c/operations/puppet/+/654330 ? [17:57:48] jouncebot: refresh [17:57:49] I refreshed my knowledge about deployments. [17:58:21] !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1004.eqiad.wmnet with reason: REIMAGE [17:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:10] (03PS2) 10Jdlrobson: Rename RelatedArticlesFooterWhitelistedSkins to RelatedArticlesFooterAllowedSkins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681598 (https://phabricator.wikimedia.org/T277958) (owner: 10Phuedx) [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210427T1800). [18:00:04] Jdlrobson: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:26] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1004.eqiad.wmnet with reason: REIMAGE [18:00:29] o/ present [18:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:06] RECOVERY - Host ms-fe2007 is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms [18:01:21] robh: logmsgbot: [18:02:32] (03PS1) 10BBlack: [noop] remove eqiad upload storage override [puppet] - 10https://gerrit.wikimedia.org/r/683025 [18:02:40] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [18:02:46] (03PS1) 10BBlack: Puppetize cp501[3456] [puppet] - 10https://gerrit.wikimedia.org/r/683026 [18:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:12] (03PS2) 10BBlack: Puppetize cp501[3456] [puppet] - 10https://gerrit.wikimedia.org/r/683026 (https://phabricator.wikimedia.org/T278182) [18:03:45] Is anybody able to run the backport window? [18:03:49] Urbanecm: are you around? [18:03:58] Jdlrobson: yes [18:04:04] (03CR) 10jerkins-bot: [V: 04-1] Puppetize cp501[3456] [puppet] - 10https://gerrit.wikimedia.org/r/683026 (https://phabricator.wikimedia.org/T278182) (owner: 10BBlack) [18:04:05] let's get the wheel out :) [18:04:09] I can deploy today [18:04:15] (also is the list of deployers accurate? I'm pretty sure Niharika doesn't do backports any more) [18:04:38] (03CR) 10Urbanecm: [C: 03+2] Enable language in header for office and testwiki users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682757 (https://phabricator.wikimedia.org/T280526) (owner: 10Jdlrobson) [18:04:59] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:27] Jdlrobson: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/681598 has a -2 by Phuedx [18:06:03] from a technical point of view, who _can_ deploy it is accurate in the public repo [18:06:08] and she is still in it [18:06:33] (03Merged) 10jenkins-bot: Enable language in header for office and testwiki users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682757 (https://phabricator.wikimedia.org/T280526) (owner: 10Jdlrobson) [18:06:35] if people do not actually use their deployment access though, it is a good idea to ask for it to be removed [18:06:48] Urbanecm: ill ping him [18:06:52] there is little "offboarding" when it comes to that [18:07:44] Jdlrobson: thanks, I'm reluctant to override an explicit -2. [18:07:50] Urbanecm: yeh we can skip that one if necessary [18:08:11] I think it's valid because of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/RelatedArticles/+/680812 [18:08:15] i'll move this backport to thursday [18:08:28] okay [18:08:35] thanks for noticing that :) [18:08:39] i clearly need more coffee [18:08:41] Jdlrobson: the first patch is pulled onto mwdebug1001, please test :) [18:08:45] on it [18:09:51] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-main1004.eqiad.wmnet'] ` a... [18:10:24] LGTM! [18:10:29] syncing it [18:10:34] oh wait... [18:10:37] wait wait wait [18:10:40] okay, waiting [18:10:44] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmne... [18:10:54] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [18:10:55] something unexpected [18:11:04] !log dns2001 - restarting bird to repool, then re-enabling puppet - T279457 [18:11:06] take your time [18:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:12] `'default` doesn't seem to be applying correctly [18:11:13] T279457: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 [18:11:24] Could you check the value of `wgVectorLanguageInHeader` on English Wikipedia? [18:11:30] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [18:11:34] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:11:53] give me a sec [18:12:00] When I visit https://en.wikipedia.org/wiki/Peter_D%C3%B6ring on debug1001 for some strange reason I'm seeing a language button in the top right and that's not expected [18:12:13] this is what i see https://www.irccloud.com/pastebin/4G4lfCWX/ [18:12:32] https://en.wikipedia.org/wiki/Peter_Döring?useskinversion=2 sorry [18:12:40] RECOVERY - Bird Internet Routing Daemon on dns2001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [18:12:46] !log robh@cumin1001 START - Cookbook sre.dns.netbox [18:12:47] hmmmm very odd [18:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:53] and that seems to match the default from your patch [18:12:59] when you visit https://en.wikipedia.org/wiki/Peter_Döring?useskinversion=2 in debug1001 do you see a button in the top right? [18:13:12] https://usercontent.irccloud-cdn.com/file/YYFMhbrf/Screen%20Shot%202021-04-27%20at%2011.13.07%20AM.png [18:13:41] this is what i see https://usercontent.irccloud-cdn.com/file/sTSB3v0c/image.png [18:13:53] I see that in the top right [18:14:18] when i disable mwdebug, I don't see the "languages" thing [18:14:24] not sure if that's what your patch is supposed to touch [18:14:30] ohhhhh I think i see what's happening [18:14:51] I think the config value changed. It needs to be a boolean on group 1+2 wikis. [18:14:59] this is $wgVectorLanguageInHeader at mwdebug1002 https://www.irccloud.com/pastebin/6JpQHMGQ/ [18:15:01] ah rats [18:15:10] Can I remove the default line? [18:15:11] ...and you just noticed it as well :) [18:15:17] Or will that create more problems? [18:15:38] Jdlrobson: I _think_ it should work. Let me livehack it on mwdebug1002, one second. [18:15:38] not entirely sure if the configuration is smart enough to not have a default but have officewiki and testwiki overrides [18:16:25] !log robh@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:14] Jdlrobson: I applied this change on mwdebug1002, can you test if it works as you would expect? https://www.irccloud.com/pastebin/RIcqpOpo/ [18:17:21] !log robh@cumin1001 START - Cookbook sre.dns.netbox [18:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:38] !log robh@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [18:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:46] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 754 bytes in 1.047 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:18:05] perfect Urbanecm [18:18:14] (03PS1) 10Jdlrobson: Drop default value for wgVectorLanguageInHeader [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683036 (https://phabricator.wikimedia.org/T280526) [18:18:24] sorry, does that mean "it works"? [18:18:25] ^ so here's the patch to do that [18:18:33] yep it works great on debug1002 and as expected [18:19:16] cool [18:19:44] !log robh@cumin1001 START - Cookbook sre.dns.netbox [18:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:55] Jdlrobson: before I merge it: what will happen if train is undeployed? Will it cause more errors? [18:20:32] !log cp203[56] - repooling in etcd - T279457 [18:20:38] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp203[56].codfw.wmnet [18:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:41] T279457: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 [18:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:11] Urbanecm: when the train rolls the default will change [18:21:40] officewiki and testwiki will be unaffected as train has already rolled for them [18:22:02] if we roll back the train, presumably office and test wiki will throw errors and we'd need to revert the change we already merged [18:22:54] hmm. I'm not sure if it is wise to deploy a change that makes train rollback to generate more errors. [18:23:04] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:15] Urbanecm: if you prefer we can use the boolean value for all of them [18:23:26] I'm just verifying but we should have backwards compatibility [18:23:46] if it will work, I'd prefer that, as it will guarantee clean rollbacks. [18:24:18] yeh let's do that [18:24:19] 1s [18:24:23] sure [18:26:15] (03PS2) 10Jdlrobson: Use boolean values for wgVectorLanguageInHeader [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683036 (https://phabricator.wikimedia.org/T280526) [18:26:22] ^ that should do it [18:26:36] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:26:56] (03CR) 10Urbanecm: [C: 03+2] Use boolean values for wgVectorLanguageInHeader [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683036 (https://phabricator.wikimedia.org/T280526) (owner: 10Jdlrobson) [18:26:58] (03CR) 10Jdlrobson: [C: 04-1] "Blocked until 1.37.0-wmf.4 train." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682758 (https://phabricator.wikimedia.org/T280526) (owner: 10Jdlrobson) [18:26:59] looks good, merging [18:27:41] (03Merged) 10jenkins-bot: Use boolean values for wgVectorLanguageInHeader [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683036 (https://phabricator.wikimedia.org/T280526) (owner: 10Jdlrobson) [18:28:47] Jdlrobson: pulled onto mwdebug1001, can you test, please? [18:28:52] Urbanecm: on it [18:29:52] please sync [18:29:57] syncing [18:30:00] RECOVERY - Host ms-be2035 is UP: PING OK - Packet loss = 0%, RTA = 33.03 ms [18:30:49] and sorry this didn't go as smoothly as thought. I really appreciate your scrutiny and advice on this one. [18:31:33] no problem, this is the reason why we do testing before deploying a patch :) [18:32:19] !log lvs2009 - restart pybal + re-run puppet agent - T279457 [18:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:28] T279457: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 [18:32:29] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 91a85f2: ac770bf: Enable language in header for office and testwiki users (T280526) (duration: 01m 19s) [18:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:38] T280526: Deploy new language switching functionality to logged-in users - https://phabricator.wikimedia.org/T280526 [18:32:39] Jdlrobson: should be live. Anything else (besides the -2'ed patch)? [18:32:52] hurray! [18:32:57] nope that's great. Thanks for all your help here! [18:33:05] Any time :) [18:33:06] PROBLEM - Host ms-fe2007 is DOWN: PING CRITICAL - Packet loss = 100% [18:33:10] !log Morning B&C window done [18:33:14] !log people1003 - rebooting, trying to get new VM to work [18:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:36] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 66, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:33:56] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 93, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:34:22] RECOVERY - Host ms-fe2007 is UP: PING OK - Packet loss = 0%, RTA = 33.10 ms [18:35:32] !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1005.eqiad.wmnet with reason: REIMAGE [18:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:26] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:37:37] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1005.eqiad.wmnet with reason: REIMAGE [18:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:55] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10BBlack) Traffic lvs/cp/dns are all repooled, un-downtimed, and green. Waiting until the other C2 hosts are fully reconfigured (network ports) before re-pooling codfw at the public traffic level. [18:39:56] RECOVERY - Host ms-be2034 is UP: PING OK - Packet loss = 0%, RTA = 33.08 ms [18:40:52] RECOVERY - Host ms-be2048 is UP: PING WARNING - Packet loss = 33%, RTA = 33.06 ms [18:41:12] RECOVERY - Host elastic2046 is UP: PING OK - Packet loss = 0%, RTA = 33.14 ms [18:45:58] RECOVERY - Host elastic2047 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms [18:46:30] RECOVERY - Host ms-be2055 is UP: PING OK - Packet loss = 0%, RTA = 33.33 ms [18:46:40] PROBLEM - SSH on phab2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:46:57] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-main1005.eqiad.wmnet'] ` a... [18:47:52] RECOVERY - swift codfw container availability low on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw [18:48:12] PROBLEM - Host elastic2047 is DOWN: PING CRITICAL - Packet loss = 100% [18:48:16] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [18:50:19] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts people1003.eqiad.wmnet [18:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:48] !log people1003 - destroying VM and recreating again from scratch to test if issue of no console and no access is repeatable [18:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1077.eqiad.wmnet - https://phabricator.wikimedia.org/T281075 (10wiki_willy) a:05wiki_willy→03Cmjohnson [18:58:52] (03CR) 10Andrew Bogott: [C: 03+2] validatelabsfqdn.py: update to python3 and run through black [puppet] - 10https://gerrit.wikimedia.org/r/682965 (owner: 10Andrew Bogott) [19:00:04] liw and longma: That opportune time is upon us again. Time for a MediaWiki train - European+American Version (secondary timeslot) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210427T1900). [19:00:07] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts people1003.eqiad.wmnet [19:00:13] 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `people1003.eqiad.wmnet` - people1003.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - Found Ganeti VM - V... [19:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:13] I will be deploying a backport during the train window [19:03:31] RECOVERY - Host elastic2047 is UP: PING OK - Packet loss = 0%, RTA = 33.06 ms [19:03:36] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host people1003.eqiad.wmnet [19:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:11] RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [19:07:08] !log powerdown logstash2035 for maintenance [19:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:04] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmne... [19:14:11] (03PS1) 10BBlack: Revert "Depool codfw traffic" [dns] - 10https://gerrit.wikimedia.org/r/683041 (https://phabricator.wikimedia.org/T279457) [19:17:47] (Primary inbound port utilisation over 80% #page) firing: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [19:18:11] Is that a real page? [19:19:05] PROBLEM - Host logstash2035.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:20:30] (03PS1) 10Herron: kafka-main: deploy kafka::main role to kafka-main[12]00[45] [puppet] - 10https://gerrit.wikimedia.org/r/683044 (https://phabricator.wikimedia.org/T225005) [19:21:11] looking [19:22:14] XioNoX: https://librenms.wikimedia.org/graphs/to=1619551200/id=8766/type=port_bits/from=1619529600/ ?? [19:22:47] (Primary inbound port utilisation over 80% #page) resolved: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [19:23:45] monitoring for the other side of that connection doesn't show the massive spike: https://librenms.wikimedia.org/graphs/to=1619551200/id=8328/type=port_bits/from=1619529600/ [19:24:02] if I had to guess, I'd say an organic traffic spike on the newly-replaced C2 switch, from some cluster or other resyncing something after all the C2 hosts rejoined? [19:24:15] et/ is juniper's prefix for a 40G interface, so that number on the switch side is physically possible... [19:24:34] bblack: I thought that work was long done, though? [19:24:50] the last elastic host just came online for the last time ~20 minutes ago [19:25:11] still, it's hard to imagine one hosts joining a cluster driving more than the 10G of its own interface [19:26:31] it is a mystery [19:27:25] (03CR) 10Jeena Huneidi: [C: 03+2] rcfeed: Remove reference assignment [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682818 (https://phabricator.wikimedia.org/T281226) (owner: 10Ahmon Dancy) [19:29:10] elastic2047 port: [19:29:11] https://librenms.wikimedia.org/graphs/to=1619551500/id=21523/type=port_bits/from=1619529900/ [19:29:29] ~5.4Gbps and rising? [19:29:40] RECOVERY - Host logstash2035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms [19:30:26] assuming those are bytes, now I don't remember [19:30:39] no it's bits but with a capital B [19:30:50] ms-fe had some bigger spikes: [19:30:52] https://librenms.wikimedia.org/graphs/to=1619551800/id=21527/type=port_bits/from=1619530200/ [19:32:27] maybe some spike in cp2* -> ms-fe? I'm really at a loss, but it seems to have been transient in any case [19:33:39] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1003/29238/" [puppet] - 10https://gerrit.wikimedia.org/r/683044 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [19:35:03] !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2004.codfw.wmnet with reason: REIMAGE [19:35:07] will wait a bit longer before re-pooling codfw public traffic JIC [19:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:17] !log powerdown ms-backup2001 for maintenance [19:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:28] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 97 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [19:37:14] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2004.codfw.wmnet with reason: REIMAGE [19:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:01] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [19:40:05] PROBLEM - Host ms-backup2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:40:47] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [19:42:30] (03PS1) 10Herron: eventgate-logging-external: add new codfw kafka-logging hosts to network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/683047 (https://phabricator.wikimedia.org/T279342) [19:44:38] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host people1003.eqiad.wmnet [19:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:35] RECOVERY - Host ms-backup2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.67 ms [19:47:09] RECOVERY - SSH on phab2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:47:36] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-main2004.codfw.wmnet'] ` a... [19:47:50] (03PS1) 10Dzahn: DHCP: update MAC address of people1003 [puppet] - 10https://gerrit.wikimedia.org/r/683049 [19:48:06] (03CR) 10jerkins-bot: [V: 04-1] DHCP: update MAC address of people1003 [puppet] - 10https://gerrit.wikimedia.org/r/683049 (owner: 10Dzahn) [19:48:13] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmne... [19:48:39] leaving DC [19:48:54] (03PS2) 10Dzahn: DHCP: update MAC address of people1003 [puppet] - 10https://gerrit.wikimedia.org/r/683049 [19:50:42] (03PS1) 10Herron: add kafka-logging200[123] to kafka term [homer/public] - 10https://gerrit.wikimedia.org/r/683050 (https://phabricator.wikimedia.org/T279342) [19:55:53] (03Merged) 10jenkins-bot: rcfeed: Remove reference assignment [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682818 (https://phabricator.wikimedia.org/T281226) (owner: 10Ahmon Dancy) [19:56:34] cdanis, bblack, I'd put that in monitoring glitch [19:56:35] (03CR) 10Dzahn: [C: 03+2] DHCP: update MAC address of people1003 [puppet] - 10https://gerrit.wikimedia.org/r/683049 (owner: 10Dzahn) [19:56:56] yeah, agreed, librenms has done it before [20:06:07] (03PS1) 10Ottomata: test/data_purge - add drop_event job [puppet] - 10https://gerrit.wikimedia.org/r/683053 (https://phabricator.wikimedia.org/T273789) [20:06:10] !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2005.codfw.wmnet with reason: REIMAGE [20:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:21] (03CR) 10jerkins-bot: [V: 04-1] test/data_purge - add drop_event job [puppet] - 10https://gerrit.wikimedia.org/r/683053 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [20:08:20] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2005.codfw.wmnet with reason: REIMAGE [20:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:36] !log jhuneidi@deploy1002 Synchronized php-1.37.0-wmf.3/includes/rcfeed/IRCColourfulRCFeedFormatter.php: Backport rcfeed: Remove reference assignment (T281226) to 1.37.0-wmf.3 (duration: 01m 12s) [20:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:45] T281226: PHP Notice: Only variables should be assigned by reference - https://phabricator.wikimedia.org/T281226 [20:17:30] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-main2005.codfw.wmnet'] ` a... [20:24:12] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:27:18] 10SRE, 10Wikimedia-Planet: Find a replacement for RSS aggregator for planet.wikimedia.org - https://phabricator.wikimedia.org/T281219 (10Legoktm) [20:31:54] (03CR) 10BBlack: [C: 03+2] Revert "Depool codfw traffic" [dns] - 10https://gerrit.wikimedia.org/r/683041 (https://phabricator.wikimedia.org/T279457) (owner: 10BBlack) [20:32:46] !log re-pooling codfw public traffic - T279457 [20:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:55] T279457: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 [20:39:21] (03PS1) 10Bartosz Dziewoński: realm.pp: Add discussiontools_subscription to private tables [puppet] - 10https://gerrit.wikimedia.org/r/683070 (https://phabricator.wikimedia.org/T263817) [20:40:32] (03PS1) 10Legoktm: site.pp: Decomission rdb200[3456].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/683074 (https://phabricator.wikimedia.org/T273140) [20:40:32] PROBLEM - Disk space on mwlog1001 is CRITICAL: DISK CRITICAL - free space: /srv 274385 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwlog1001&var-datasource=eqiad+prometheus/ops [20:42:56] !log legoktm@cumin1001 START - Cookbook sre.hosts.decommission for hosts rdb[2003-2004].codfw.wmnet [20:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:30] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:40] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts rdb[2003-2004].codfw.wmnet [20:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:16] !log legoktm@cumin1001 START - Cookbook sre.hosts.decommission for hosts rdb[2005-2006].codfw.wmnet [20:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:54] (03CR) 10Ottomata: "Great, if these are active, they will also need to be added to the metadata.broker.list in values-codfw.wmnet." [deployment-charts] - 10https://gerrit.wikimedia.org/r/683047 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [20:57:52] (03PS2) 10Ottomata: test/data_purge - add drop_event job [puppet] - 10https://gerrit.wikimedia.org/r/683053 (https://phabricator.wikimedia.org/T273789) [21:03:24] (03PS3) 10Ottomata: test/data_purge - add drop_event job [puppet] - 10https://gerrit.wikimedia.org/r/683053 (https://phabricator.wikimedia.org/T273789) [21:04:59] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29240/console" [puppet] - 10https://gerrit.wikimedia.org/r/683053 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [21:06:17] (03CR) 10Ottomata: [V: 03+1] "Elukey it looks like you created this test/data_purge.pp class..but it was never applied! Ok to apply it?" [puppet] - 10https://gerrit.wikimedia.org/r/683053 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [21:07:02] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts rdb[2005-2006].codfw.wmnet [21:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:25] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [21:19:09] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [21:21:32] (03PS1) 10Tchanders: Enable partial action blocks on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683088 (https://phabricator.wikimedia.org/T280528) [21:21:34] (03PS1) 10Tchanders: Enable partial action blocks on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683089 (https://phabricator.wikimedia.org/T280528) [21:26:40] 10SRE, 10ops-codfw: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [21:32:46] 10SRE, 10Traffic, 10decommission-hardware: decommission cescout1001.eqiad.wmnet - https://phabricator.wikimedia.org/T275696 (10ssingh) p:05Medium→03High [21:40:50] (03CR) 10Legoktm: [C: 03+2] site.pp: Decomission rdb200[3456].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/683074 (https://phabricator.wikimedia.org/T273140) (owner: 10Legoktm) [21:48:43] (03PS1) 10Andrew Bogott: Trove: set low default quotas per project. [puppet] - 10https://gerrit.wikimedia.org/r/683092 (https://phabricator.wikimedia.org/T212595) [21:52:03] (03PS2) 10Andrew Bogott: Trove: set low default quotas per project but big potential DB size [puppet] - 10https://gerrit.wikimedia.org/r/683092 (https://phabricator.wikimedia.org/T212595) [21:59:28] 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10serviceops: decommission rdb200[3456].codfw.wmnet - https://phabricator.wikimedia.org/T273140 (10Legoktm) This is ready for #DC-ops now. [22:08:16] 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10serviceops: decommission rdb200[3456].codfw.wmnet - https://phabricator.wikimedia.org/T273140 (10Papaul) p:05Medium→03Low a:03Papaul [22:13:22] PROBLEM - HTTPS-peopleweb on people1003 is CRITICAL: connect to address 10.64.0.8 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/People.wikimedia.org [22:16:00] (03CR) 10Dzahn: "Ran into this when trying bullseye on a host with envoy. Profile::Tlsproxy::Envoy/Sslcert::Certificate will fail because it uses this and " [puppet] - 10https://gerrit.wikimedia.org/r/670978 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [22:22:23] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10Papaul) switch replace, onsite work complete and Netbox updated. Will be shipping the faulty switch tomorrow. [22:22:41] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10Papaul) p:05High→03Low [22:22:48] PROBLEM - Check no envoy runtime configuration is left persistent on people1003 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [22:25:32] PROBLEM - Check that envoy is running on people1003 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is inactive https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [22:25:59] (03PS2) 10Legoktm: site.pp: Setup rdb1011, rdb1012 [puppet] - 10https://gerrit.wikimedia.org/r/682892 (https://phabricator.wikimedia.org/T281217) [22:26:01] (03PS2) 10Legoktm: Have rdb1012 replicate from rdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/682893 (https://phabricator.wikimedia.org/T281217) [22:27:56] PROBLEM - people.wikimedia.org requires authentication on people1003 is CRITICAL: connect to address 10.64.0.8 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [22:28:18] ACK [22:28:51] (03CR) 10Legoktm: [C: 03+2] site.pp: Setup rdb1011, rdb1012 [puppet] - 10https://gerrit.wikimedia.org/r/682892 (https://phabricator.wikimedia.org/T281217) (owner: 10Legoktm) [22:29:15] ACKNOWLEDGEMENT - Check no envoy runtime configuration is left persistent on people1003 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused daniel_zahn bullseye - needs https://gerrit.wikimedia.org/r/c/operations/puppet/+/670978 https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [22:29:15] ACKNOWLEDGEMENT - Check that envoy is running on people1003 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is inactive daniel_zahn bullseye - needs https://gerrit.wikimedia.org/r/c/operations/puppet/+/670978 https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [22:29:15] ACKNOWLEDGEMENT - HTTPS-peopleweb on people1003 is CRITICAL: connect to address 10.64.0.8 and port 443: Connection refused daniel_zahn bullseye - needs https://gerrit.wikimedia.org/r/c/operations/puppet/+/670978 https://wikitech.wikimedia.org/wiki/People.wikimedia.org [22:29:15] ACKNOWLEDGEMENT - people.wikimedia.org requires authentication on people1003 is CRITICAL: connect to address 10.64.0.8 and port 443: Connection refused daniel_zahn bullseye - needs https://gerrit.wikimedia.org/r/c/operations/puppet/+/670978 https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [22:38:47] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10Papaul) [22:42:56] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (people1003), Fresh: 97 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [22:50:03] (03PS2) 10Legoktm: mariadb: Allow lists1001.wikimedia.org to talk to m5 [puppet] - 10https://gerrit.wikimedia.org/r/681753 (https://phabricator.wikimedia.org/T278614) [23:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210427T2300) [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:41] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [23:02:05] (03PS1) 10Legoktm: [WIP] Initial commit [software/pipermail-redirector] - 10https://gerrit.wikimedia.org/r/683108 (https://phabricator.wikimedia.org/T280731) [23:04:01] 10SRE, 10ops-codfw: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [23:06:26] (03PS2) 10Legoktm: [WIP] Initial commit [software/pipermail-redirector] - 10https://gerrit.wikimedia.org/r/683108 (https://phabricator.wikimedia.org/T280731) [23:10:01] (03CR) 10Dzahn: "I manually made the same changes this makes to x509-bundle on people1003 and then manually ran the command that puppet would run:" [puppet] - 10https://gerrit.wikimedia.org/r/670978 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [23:12:52] (03CR) 10Dzahn: "> TypeError: a bytes-like object is required, not 'str'" [puppet] - 10https://gerrit.wikimedia.org/r/670978 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [23:17:18] (03CR) 10Dzahn: x509-bundle.py: Port to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670978 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [23:18:19] (03CR) 10Dzahn: "It works when opening the file with "w" instead of "wb". in: with open(args.output, "wb") as outfile:" [puppet] - 10https://gerrit.wikimedia.org/r/670978 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [23:20:16] legoktm: I applied that change manually on people1003, then also manually ran the command puppet would run. found one more issue ^. But also the fix, i think [23:20:48] (03PS2) 10Jdlrobson: Enable new language button for all logged in users outside test projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682758 (https://phabricator.wikimedia.org/T280526) [23:20:55] (03CR) 10Jdlrobson: [C: 04-1] "Probably blocked until Tues 4th." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682758 (https://phabricator.wikimedia.org/T280526) (owner: 10Jdlrobson) [23:24:18] the "unless" part of the puppet exec is too smart though to easily fool it and make puppet happy [23:24:38] it tests not only if chained cert exists but also which files are older than others [23:33:07] (03CR) 10STran: Enable partial action blocks on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683089 (https://phabricator.wikimedia.org/T280528) (owner: 10Tchanders) [23:38:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['snapshot1011.eqiad.wmnet', 'snapshot1012.eqiad.... [23:47:18] (03PS1) 10Dzahn: Revert "site: add peopleweb role to people1003" [puppet] - 10https://gerrit.wikimedia.org/r/683126 [23:47:37] (03CR) 10Tchanders: "> Enable or disable?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683089 (https://phabricator.wikimedia.org/T280528) (owner: 10Tchanders) [23:51:32] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1011.eqiad.wmnet with reason: REIMAGE [23:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:07] (03PS2) 10Dzahn: Revert "site: add peopleweb role to people1003" [puppet] - 10https://gerrit.wikimedia.org/r/683126 [23:52:26] (03CR) 10Dzahn: [C: 03+2] Revert "site: add peopleweb role to people1003" [puppet] - 10https://gerrit.wikimedia.org/r/683126 (owner: 10Dzahn) [23:52:36] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1012.eqiad.wmnet with reason: REIMAGE [23:52:39] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [23:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:49] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [23:53:35] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1011.eqiad.wmnet with reason: REIMAGE [23:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:37] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1013.eqiad.wmnet with reason: REIMAGE [23:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:41] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1012.eqiad.wmnet with reason: REIMAGE [23:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:33] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1014.eqiad.wmnet with reason: REIMAGE [23:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:50] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1013.eqiad.wmnet with reason: REIMAGE [23:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:37] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1015.eqiad.wmnet with reason: REIMAGE [23:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log