[00:00:34] <icinga-wm>	 RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:07:10] <icinga-wm>	 PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_exclude_backups.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:19:56] <wikibugs>	 (03PS1) 10Andrew Bogott: radosgw: remove "rgw dns name" setting [puppet] - 10https://gerrit.wikimedia.org/r/682780
[00:20:35] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] radosgw: remove "rgw dns name" setting [puppet] - 10https://gerrit.wikimedia.org/r/682780 (owner: 10Andrew Bogott)
[00:20:45] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] Move ExternalStore log group from debug to error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682322 (https://phabricator.wikimedia.org/T281048) (owner: 10Reedy)
[00:21:06] <wikibugs>	 (03PS1) 10Krinkle: externalstore: convert some log messages to WARNING [core] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/682720 (https://phabricator.wikimedia.org/T281048)
[00:23:34] <wikibugs>	 (03CR) 10Reedy: Move ExternalStore log group from debug to error (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682322 (https://phabricator.wikimedia.org/T281048) (owner: 10Reedy)
[00:27:54] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 1.073 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[00:50:42] <wikibugs>	 (03CR) 10LMata: [C: 03+2] replace mwlog1001 with new mwlog[12]002 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron)
[01:13:47] <wikibugs>	 10SRE, 10CommRel-Specialists-Support, 10Datacenter-Switchover: CommRel support for June 2021 Switchover - https://phabricator.wikimedia.org/T281209 (10Legoktm)
[01:21:05] <icinga-wm>	 ACKNOWLEDGEMENT - WDQS SPARQL on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 1.063 second response time Ryan Kemper https://phabricator.wikimedia.org/T280382 https://wikitech.wikimedia.org/wiki/Wikidata_qu
[01:21:05] <icinga-wm>	 ok
[01:21:38] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[01:21:42] <ryankemper>	 !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1006.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin1001` tmux session `reimage`
[01:21:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:21:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:21:57] <stashbot>	 T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382
[01:27:06] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[01:27:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:29:29] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[01:29:32] <ryankemper>	 !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1006.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph --task-id T280382` on `ryankemper@cumin1001` tmux session `reimage`
[01:29:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:29:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:29:45] <stashbot>	 T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382
[02:06:02] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-import-siteinfo-dumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:07:45] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.3 [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682784
[02:07:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.3 [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682784 (owner: 10TrainBranchBot)
[02:13:04] <wikibugs>	 (03PS1) 10Razzi: netboot: Add reuse recipe to preserve /srv on an-master [puppet] - 10https://gerrit.wikimedia.org/r/682785 (https://phabricator.wikimedia.org/T278423)
[02:19:20] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:21:13] <wikibugs>	 (03PS2) 10Razzi: netboot: Add reuse recipe to preserve /srv on an-master [puppet] - 10https://gerrit.wikimedia.org/r/682785 (https://phabricator.wikimedia.org/T278423)
[02:21:30] <wikibugs>	 (03CR) 10Razzi: "This is probably missing something, but I've been stuck on this for a while and could use some input. Here's what I know:" [puppet] - 10https://gerrit.wikimedia.org/r/682785 (https://phabricator.wikimedia.org/T278423) (owner: 10Razzi)
[02:21:40] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:26:07] <wikibugs>	 (03PS1) 10Razzi: Revert "sqoop: switch to single grouped_wikis.csv" [puppet] - 10https://gerrit.wikimedia.org/r/682790 (https://phabricator.wikimedia.org/T279564)
[02:32:52] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.3 [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682784 (owner: 10TrainBranchBot)
[02:34:05] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] Revert "sqoop: switch to single grouped_wikis.csv" [puppet] - 10https://gerrit.wikimedia.org/r/682790 (https://phabricator.wikimedia.org/T279564) (owner: 10Razzi)
[02:41:29] <wikibugs>	 (03PS1) 10Razzi: Revert "Revert "sqoop: switch to single grouped_wikis.csv"" [puppet] - 10https://gerrit.wikimedia.org/r/682791 (https://phabricator.wikimedia.org/T279564)
[02:53:25] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:54:31] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:56:45] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[02:56:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:58:01] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.075 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:17:15] <ryankemper>	 !log T280382 `wdqs1006` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to raid0: `/dev/md2        2.6T  998G  1.5T  40% /srv`
[03:17:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:17:25] <stashbot>	 T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382
[03:25:41] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:27:46] <ryankemper>	 !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.70`. Pre-deploy tests passing on canary `wdqs1003`
[03:27:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:28:01] <logmsgbot>	 !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@08ad17a]: 0.3.70
[03:28:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:28:45] <ryankemper>	 !log [WDQS Deploy] Tests passing following deploy of `0.3.70` on canary `wdqs1003`; proceeding to rest of fleet
[03:28:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:35:07] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:36:20] <logmsgbot>	 !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@08ad17a]: 0.3.70 (duration: 08m 18s)
[03:36:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:37:01] <ryankemper>	 !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'`
[03:37:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:37:09] <ryankemper>	 !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'`
[03:37:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:37:27] <ryankemper>	 !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'`
[03:37:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:17:30] <marostegui>	 I am going to put phabricator in read only for a couple of minutes to restart the db primary master
[04:17:47] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (people1003), Fresh: 97 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[04:19:18] <marostegui>	 !log Set phabricator on read only T279625
[04:19:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:19:27] <stashbot>	 T279625: Upgrade mysql on db1132 (phabricator db master) - https://phabricator.wikimedia.org/T279625
[04:20:50] <legoktm>	 "Woe! This request had its journey cut short by unexpected circumstances (Can Not Connect to MySQL)." guess that's intentional?
[04:21:50] <legoktm>	 works now :)
[04:22:39] <marostegui>	 legoktm: yep, see my !log above :)
[04:24:28] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1 C: 03+2] mailman3: Use backported packages from component/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/678134 (https://phabricator.wikimedia.org/T278905) (owner: 10Legoktm)
[04:25:51] <legoktm>	 !log upgrading lists-next.wikimedia.org to mailman3-from-bullseye (T280887)
[04:26:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:26:01] <stashbot>	 T280887: Upgrade lists-next to bullseye mailman versions - https://phabricator.wikimedia.org/T280887
[04:28:34] <legoktm>	 marostegui: ok, running the updates now
[04:29:22] <legoktm>	 should be done
[04:30:15] <marostegui>	 that was fast!
[04:31:18] <legoktm>	 I guess we don't have enough data in our database yet? ;)
[04:31:29] <marostegui>	 hehe yeah
[04:31:42] <legoktm>	 but I double checked that it actually ran the migrations and it did. tbh I didn't actually check how many migrations there were, just that some did exist
[04:32:48] * legoktm tries sending some emails
[04:37:15] <apergos>	 ah, that's what tehhiccup was, I got the "can't contact db server" error and wondered what was happening
[04:37:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 25%: Repool db1158', diff saved to https://phabricator.wikimedia.org/P15539 and previous config saved to /var/cache/conftool/dbconfig/20210427-043725-root.json
[04:37:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:38:11] <wikibugs>	 (03PS1) 10Marostegui: db1124: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/682794 (https://phabricator.wikimedia.org/T258361)
[04:38:26] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[04:38:47] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1124: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/682794 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[04:41:07] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1124 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/682795 (https://phabricator.wikimedia.org/T258361)
[04:43:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1124 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/682795 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[04:45:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1124 to dbctl, depooled, T258361', diff saved to https://phabricator.wikimedia.org/P15540 and previous config saved to /var/cache/conftool/dbconfig/20210427-044520-marostegui.json
[04:45:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:45:29] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[04:46:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1124 with minimal weight for the first time in s7 T258361', diff saved to https://phabricator.wikimedia.org/P15541 and previous config saved to /var/cache/conftool/dbconfig/20210427-044609-marostegui.json
[04:46:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:46:35] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) Pooled db1124 with minimal weight for the first time in s7
[04:47:28] <wikibugs>	 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Upgrade lists-next to bullseye mailman versions - https://phabricator.wikimedia.org/T280887 (10Legoktm) 05Open→03Resolved a:03Legoktm Upgraded, thanks to @Marostegui for supervising!
[04:47:31] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Reconsider which mailman3 version we're running - https://phabricator.wikimedia.org/T278905 (10Legoktm)
[04:48:39] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Reconsider which mailman3 version we're running - https://phabricator.wikimedia.org/T278905 (10Legoktm) 05Open→03Resolved We're going with bullseye packages, but it has introduced some regressions.
[04:52:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 50%: Repool db1158', diff saved to https://phabricator.wikimedia.org/P15543 and previous config saved to /var/cache/conftool/dbconfig/20210427-045229-root.json
[04:52:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:53:50] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1077.eqiad.wmnet
[04:53:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:55:21] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission db1077 [puppet] - 10https://gerrit.wikimedia.org/r/682796 (https://phabricator.wikimedia.org/T281075)
[04:55:43] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Mailman3 tries to load Google Fonts, but blocked by CSP - https://phabricator.wikimedia.org/T281213 (10Legoktm) https://lists-next.wikimedia.org/mailman3/static/CACHE/css/54a97321b5f1.css  ` @font-face {   font-family: 'Droid Sans';   font-style: normal;   font-weight: 400;...
[04:57:25] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Mailman3 tries to load Google Fonts, but blocked by CSP - https://phabricator.wikimedia.org/T281213 (10Legoktm) This was already fixed upstream at https://gitlab.com/mailman/hyperkitty/-/commit/b35d20f45aafbd152e059abe3d4052485ffae305
[04:57:36] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1077 [puppet] - 10https://gerrit.wikimedia.org/r/682796 (https://phabricator.wikimedia.org/T281075) (owner: 10Marostegui)
[04:59:52] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1077.eqiad.wmnet - https://phabricator.wikimedia.org/T281075 (10Marostegui) a:03wiki_willy
[05:00:31] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[05:03:04] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1077.eqiad.wmnet
[05:03:07] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1077.eqiad.wmnet - https://phabricator.wikimedia.org/T281075 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1077.eqiad.wmnet` - db1077.eqiad.wmnet (**PASS**)   - Downtimed host on Icinga   -...
[05:03:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:04:27] <icinga-wm>	 PROBLEM - Disk space on backup2002 is CRITICAL: DISK CRITICAL - free space: /srv 3002177 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup2002&var-datasource=codfw+prometheus/ops
[05:07:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 75%: Repool db1158', diff saved to https://phabricator.wikimedia.org/P15544 and previous config saved to /var/cache/conftool/dbconfig/20210427-050732-root.json
[05:07:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:08:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1124 with minimal weight for the first time in s7 T258361', diff saved to https://phabricator.wikimedia.org/P15545 and previous config saved to /var/cache/conftool/dbconfig/20210427-050826-marostegui.json
[05:08:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:08:38] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[05:18:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 5%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15546 and previous config saved to /var/cache/conftool/dbconfig/20210427-051802-root.json
[05:18:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:18:31] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) I am automatically pooling db1124 into s7.
[05:18:42] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[05:20:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1114 temporarily as db1087 will be depooled', diff saved to https://phabricator.wikimedia.org/P15547 and previous config saved to /var/cache/conftool/dbconfig/20210427-052026-marostegui.json
[05:20:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:21:25] <marostegui>	 !log Stop mysql on db1087 to clone db1167 (lag will appear on wikidata on wikireplicas) T258361
[05:21:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:21:34] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[05:22:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 100%: Repool db1158', diff saved to https://phabricator.wikimedia.org/P15549 and previous config saved to /var/cache/conftool/dbconfig/20210427-052236-root.json
[05:22:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:26:28] <icinga-wm>	 PROBLEM - snapshot of s6 in eqiad on alert1001 is CRITICAL: snapshot for s6 at eqiad taken more than 3 days ago: Most recent backup 2021-04-24 05:13:41 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[05:27:25] <legoktm>	 !log imported hyperkitty_1.3.4-2~bpo10+2 to apt.wm.o (T281213)
[05:27:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:27:35] <stashbot>	 T281213: Mailman3 tries to load Google Fonts, but blocked by CSP - https://phabricator.wikimedia.org/T281213
[05:30:33] <XioNoX>	 !log push pfw fw policies - T281137
[05:30:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:32:09] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Mailman3 tries to load Google Fonts, but blocked by CSP - https://phabricator.wikimedia.org/T281213 (10Legoktm) I installed the new package, but I guess there's some command I need to run to force it to regenerate the CSS?
[05:33:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 10%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15550 and previous config saved to /var/cache/conftool/dbconfig/20210427-053306-root.json
[05:33:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:37:54] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 on clouddb1020 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 978.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:39:04] <marostegui>	 ^ known
[05:40:36] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Mailman3 tries to load Google Fonts, but blocked by CSP - https://phabricator.wikimedia.org/T281213 (10Legoktm) 05Open→03Resolved a:03Legoktm Also filed in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=987654  >>! In T281213#7036776, @Legoktm wrote: > I installed the...
[05:48:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 15%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15551 and previous config saved to /var/cache/conftool/dbconfig/20210427-054809-root.json
[05:48:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:48:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10ArielGlenn) Hey, this looks almost done, am I reading that right? :-) :-)
[05:50:13] <wikibugs>	 (03PS1) 10Marostegui: install_server: Reimage db1118 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/682798 (https://phabricator.wikimedia.org/T278214)
[05:51:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1118 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/682798 (https://phabricator.wikimedia.org/T278214) (owner: 10Marostegui)
[06:03:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 20%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15552 and previous config saved to /var/cache/conftool/dbconfig/20210427-060313-root.json
[06:03:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:06:00] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:07:03] <wikibugs>	 (03PS3) 10Legoktm: hosts: assign puppet role for rdb2007,rdb2008 [puppet] - 10https://gerrit.wikimedia.org/r/614894 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli)
[06:07:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] hosts: assign puppet role for rdb2007,rdb2008 [puppet] - 10https://gerrit.wikimedia.org/r/614894 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli)
[06:09:24] <wikibugs>	 (03PS4) 10Legoktm: hosts: assign puppet role for rdb2007,rdb2008 [puppet] - 10https://gerrit.wikimedia.org/r/614894 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli)
[06:09:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] hosts: assign puppet role for rdb2007,rdb2008 [puppet] - 10https://gerrit.wikimedia.org/r/614894 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli)
[06:11:24] <wikibugs>	 (03PS5) 10Legoktm: site.pp: assign puppet role for rdb2007,rdb2008 [puppet] - 10https://gerrit.wikimedia.org/r/614894 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli)
[06:11:29] <elukey>	 !log powercycle elastic2043 - no ssh, no tty remote console available
[06:11:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:14:01] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] site.pp: assign puppet role for rdb2007,rdb2008 [puppet] - 10https://gerrit.wikimedia.org/r/614894 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli)
[06:14:22] <wikibugs>	 (03CR) 10Legoktm: site.pp: assign puppet role for rdb2007,rdb2008 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/614894 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli)
[06:18:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 25%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15553 and previous config saved to /var/cache/conftool/dbconfig/20210427-061817-root.json
[06:18:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:26:35] <wikibugs>	 10ops-codfw, 10Discovery: elastic2043 doesn't power up - https://phabricator.wikimedia.org/T281215 (10elukey)
[06:27:05] <icinga-wm>	 ACKNOWLEDGEMENT - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% Elukey T281215
[06:29:39] <wikibugs>	 (03Abandoned) 10Legoktm: hiera: switch nutcracker shard from rdb2003 to rdb2007 [puppet] - 10https://gerrit.wikimedia.org/r/615163 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli)
[06:31:01] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] install_server: kafka-main[12]00[1-5] use default release installer [puppet] - 10https://gerrit.wikimedia.org/r/682731 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron)
[06:31:07] <wikibugs>	 (03PS2) 10Elukey: install_server: kafka-main[12]00[1-5] use default release installer [puppet] - 10https://gerrit.wikimedia.org/r/682731 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron)
[06:31:33] <wikibugs>	 (03PS3) 10Legoktm: site.pp: make rdb2007, rdb2008 a redis cluster [puppet] - 10https://gerrit.wikimedia.org/r/614897 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli)
[06:33:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 30%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15554 and previous config saved to /var/cache/conftool/dbconfig/20210427-063320-root.json
[06:33:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:33:48] <liw>	 log version 1.37.0-wmf.3 was branched at 20ab303fd1d883592b4d2ec2468dfaccad7a9e10 for T278347
[06:33:49] <stashbot>	 T278347: 1.37.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T278347
[06:34:26] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Remove site.pp scb entries [puppet] - 10https://gerrit.wikimedia.org/r/676906 (https://phabricator.wikimedia.org/T275759)
[06:34:37] <ryankemper>	 liw: missing the ! in your log
[06:35:56] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Remove site.pp scb entries [puppet] - 10https://gerrit.wikimedia.org/r/676906 (https://phabricator.wikimedia.org/T275759)
[06:37:08] <liw>	 !log version 1.37.0-wmf.3 was branched at 20ab303fd1d883592b4d2ec2468dfaccad7a9e10 for T278347
[06:37:11] <liw>	 ryankemper, thanks
[06:37:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:41:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "very ignorant about this bit of code but LGTM" [software/cumin] - 10https://gerrit.wikimedia.org/r/681758 (owner: 10Volans)
[06:48:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 40%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15555 and previous config saved to /var/cache/conftool/dbconfig/20210427-064824-root.json
[06:48:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:50:16] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/682881 (https://phabricator.wikimedia.org/T278214)
[06:50:47] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/682881 (https://phabricator.wikimedia.org/T278214) (owner: 10Marostegui)
[06:51:24] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/682881 (https://phabricator.wikimedia.org/T278214)
[06:52:42] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Update s1-master to the right master [dns] - 10https://gerrit.wikimedia.org/r/682882 (https://phabricator.wikimedia.org/T278214)
[06:52:50] <wikibugs>	 (03PS1) 10Lars Wirzenius: testwikis wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682883
[06:52:52] <wikibugs>	 (03CR) 10Lars Wirzenius: [C: 03+2] testwikis wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682883 (owner: 10Lars Wirzenius)
[06:53:21] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/682882 (https://phabricator.wikimedia.org/T278214) (owner: 10Marostegui)
[06:53:35] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682883 (owner: 10Lars Wirzenius)
[06:54:40] <logmsgbot>	 !log liw@deploy1002 Started scap: testwikis wikis to 1.37.0-wmf.3
[06:54:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:55:28] <elukey>	 !log upgrade mariadb to 10.4.18-1 + reboot on db1108 - T279281
[06:55:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:55:36] <stashbot>	 T279281: Upgrade 10.4.13 hosts to a higher version - https://phabricator.wikimedia.org/T279281
[06:56:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1179 for schema change', diff saved to https://phabricator.wikimedia.org/P15556 and previous config saved to /var/cache/conftool/dbconfig/20210427-065628-marostegui.json
[06:56:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:01:50] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db1167 [puppet] - 10https://gerrit.wikimedia.org/r/682885 (https://phabricator.wikimedia.org/T258361)
[07:02:22] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1167 [puppet] - 10https://gerrit.wikimedia.org/r/682885 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[07:03:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 50%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15557 and previous config saved to /var/cache/conftool/dbconfig/20210427-070328-root.json
[07:03:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:03:48] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Fix db1167 role [puppet] - 10https://gerrit.wikimedia.org/r/682887
[07:04:25] <wikibugs>	 (03PS1) 10Majavah: beta: Switchover to deployment-sessionstore04 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682888 (https://phabricator.wikimedia.org/T263617)
[07:04:58] <wikibugs>	 (03PS2) 10Marostegui: site.pp: Fix db1167 role [puppet] - 10https://gerrit.wikimedia.org/r/682887
[07:05:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Fix db1167 role [puppet] - 10https://gerrit.wikimedia.org/r/682887 (owner: 10Marostegui)
[07:06:11] <Majavah>	 liw: hi, I have a beta-only config patch https://gerrit.wikimedia.org/r/682888 that I'd like to get merged as soon as possible to unbreak https://phabricator.wikimedia.org/T263617, could you ping me after that scap is done and that patch can be merged?
[07:08:31] <wikibugs>	 (03PS1) 10Elukey: Enable the Yarn Labels for Hadoop Analytics [puppet] - 10https://gerrit.wikimedia.org/r/682889 (https://phabricator.wikimedia.org/T277062)
[07:09:08] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Enable the Yarn Labels for Hadoop Analytics [puppet] - 10https://gerrit.wikimedia.org/r/682889 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey)
[07:10:24] <wikibugs>	 10SRE, 10serviceops: Replace rdb2005, rdb2006 with rdb2009, rdb2010 - https://phabricator.wikimedia.org/T281216 (10Legoktm) p:05Triage→03High
[07:11:34] <wikibugs>	 (03PS12) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221)
[07:11:47] <wikibugs>	 (03PS3) 10Marostegui: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/682881 (https://phabricator.wikimedia.org/T278214)
[07:12:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: Repool db1179', diff saved to https://phabricator.wikimedia.org/P15558 and previous config saved to /var/cache/conftool/dbconfig/20210427-071227-root.json
[07:12:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:54] <wikibugs>	 (03PS1) 10Legoktm: site.pp: Setup rdb2009, rdb2010 [puppet] - 10https://gerrit.wikimedia.org/r/682890 (https://phabricator.wikimedia.org/T281216)
[07:12:56] <wikibugs>	 (03PS1) 10Legoktm: Have rdb2010 replicate from rdb2009 [puppet] - 10https://gerrit.wikimedia.org/r/682891 (https://phabricator.wikimedia.org/T281216)
[07:13:06] <wikibugs>	 (03PS3) 10Ladsgroup: lists: Send error logs of apache2/exim4 to logstash [puppet] - 10https://gerrit.wikimedia.org/r/682736 (https://phabricator.wikimedia.org/T276697)
[07:13:38] <wikibugs>	 (03PS4) 10Ladsgroup: mailman3: Increase the log level to WARNING and send them to logstash [puppet] - 10https://gerrit.wikimedia.org/r/682737 (https://phabricator.wikimedia.org/T276697)
[07:14:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper)
[07:17:02] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) Checking tables on db1167
[07:18:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 60%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15559 and previous config saved to /var/cache/conftool/dbconfig/20210427-071831-root.json
[07:18:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:18:56] <wikibugs>	 (03PS2) 10JMeybohm: Swap zookeeper from conf2002 to conf2005 [puppet] - 10https://gerrit.wikimedia.org/r/682666 (https://phabricator.wikimedia.org/T271573)
[07:19:47] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on conf[2002-2003].codfw.wmnet with reason: for zookeeper migration
[07:19:49] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on conf[2002-2003].codfw.wmnet with reason: for zookeeper migration
[07:19:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:56] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10ops-monitoring-bot) Icinga downtime set by jayme@cumin1001 for 1 day, 0:00:00 2 host(s) and their services with reason: for zookeeper migration ` conf[2002-2003].codfw.wmnet `
[07:20:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:36] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on conf[2004-2006].codfw.wmnet with reason: for zookeeper migration
[07:21:38] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on conf[2004-2006].codfw.wmnet with reason: for zookeeper migration
[07:21:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:45] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10ops-monitoring-bot) Icinga downtime set by jayme@cumin1001 for 2:00:00 3 host(s) and their services with reason: for zookeeper migration ` conf[2004-2006].codfw.wmnet `
[07:21:51] <wikibugs>	 10SRE, 10serviceops: Replace rdb1005, rdb1006 with rdb1011, rdb1012 - https://phabricator.wikimedia.org/T281217 (10Legoktm) p:05Triage→03High
[07:21:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:22:06] <wikibugs>	 10SRE, 10serviceops: Replace rdb1005, rdb1006 with rdb1011, rdb1012 - https://phabricator.wikimedia.org/T281217 (10Legoktm)
[07:24:28] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Swap zookeeper from conf2002 to conf2005 [puppet] - 10https://gerrit.wikimedia.org/r/682666 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm)
[07:24:31] <logmsgbot>	 !log liw@deploy1002 Finished scap: testwikis wikis to 1.37.0-wmf.3 (duration: 30m 54s)
[07:24:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:26:05] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Reenable notifications on db1156 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/682668 (https://phabricator.wikimedia.org/T280492)
[07:26:10] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_zookeeper site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:26:43] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "compare from db1074 was successful." [puppet] - 10https://gerrit.wikimedia.org/r/682668 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo)
[07:26:57] <godog>	 !log swift eqiad-prod: less weight for ms-be[1019-1026] / more weight to ms-be106[0-3] - T272836
[07:27:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:27:06] <stashbot>	 T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836
[07:27:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: Repool db1179', diff saved to https://phabricator.wikimedia.org/P15560 and previous config saved to /var/cache/conftool/dbconfig/20210427-072731-root.json
[07:27:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:28:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1087 (re)pooling @ 25%: Repool db1087', diff saved to https://phabricator.wikimedia.org/P15561 and previous config saved to /var/cache/conftool/dbconfig/20210427-072814-root.json
[07:28:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:34] <wikibugs>	 (03PS4) 10Legoktm: site.pp: make rdb2007, rdb2008 a redis cluster [puppet] - 10https://gerrit.wikimedia.org/r/614897 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli)
[07:31:36] <wikibugs>	 (03PS2) 10Legoktm: site.pp: Setup rdb2009, rdb2010 [puppet] - 10https://gerrit.wikimedia.org/r/682890 (https://phabricator.wikimedia.org/T281216)
[07:31:38] <wikibugs>	 (03PS2) 10Legoktm: Have rdb2010 replicate from rdb2009 [puppet] - 10https://gerrit.wikimedia.org/r/682891 (https://phabricator.wikimedia.org/T281216)
[07:31:40] <wikibugs>	 (03PS1) 10Legoktm: site.pp: Setup rdb1011, rdb1012 [puppet] - 10https://gerrit.wikimedia.org/r/682892 (https://phabricator.wikimedia.org/T281217)
[07:31:42] <wikibugs>	 (03PS1) 10Legoktm: Have rdb1012 replicate from rdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/682893 (https://phabricator.wikimedia.org/T281217)
[07:32:10] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 on clouddb1020 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[07:33:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 75%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15562 and previous config saved to /var/cache/conftool/dbconfig/20210427-073335-root.json
[07:33:37] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10observability: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10fgiunchedi) Thank you @papaul, could you forward the attached mib? I'll take a look, though I think a call will be best
[07:33:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:37] <wikibugs>	 (03PS13) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221)
[07:38:21] <wikibugs>	 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10jcrespo) FYI: people1003 is failing to be backed up.  https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&var-site=eqiad&var-job=people1003.eqiad.wmnet-Monthly-1st-Sun-production-home&from=1619492511586...
[07:40:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper)
[07:41:27] <wikibugs>	 (03PS1) 10Legoktm: Reimage rdb2007, rdb2008 as bullseye [puppet] - 10https://gerrit.wikimedia.org/r/682894 (https://phabricator.wikimedia.org/T255250)
[07:42:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: Repool db1179', diff saved to https://phabricator.wikimedia.org/P15563 and previous config saved to /var/cache/conftool/dbconfig/20210427-074234-root.json
[07:42:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1087 (re)pooling @ 50%: Repool db1087', diff saved to https://phabricator.wikimedia.org/P15564 and previous config saved to /var/cache/conftool/dbconfig/20210427-074318-root.json
[07:43:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:22] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:44:23] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Reimage rdb2007, rdb2008 as bullseye [puppet] - 10https://gerrit.wikimedia.org/r/682894 (https://phabricator.wikimedia.org/T255250) (owner: 10Legoktm)
[07:48:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 80%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15565 and previous config saved to /var/cache/conftool/dbconfig/20210427-074839-root.json
[07:48:40] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-codfw #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-codfw group=cdanis-kafkacat instance=kafkamon2002 job=burrow partition={0,1,2,3,4,5} prometheus=ops site=codfw topic=codfw.w3c.reportingapi.network_error https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-
[07:48:40] <icinga-wm>	 &var-cluster=logging-codfw&var-topic=All&var-consumer_group=All
[07:48:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:16] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) Vanilla 6.0.1 was performing worse than 5.1.3 and similarly to 6.0.7 when we tested it in January:  >>! In T264398#673141...
[07:52:12] <wikibugs>	 (03PS1) 10Legoktm: Reimage rdb1011, rdb1012, rdb2009, rdb2010 as bullseye [puppet] - 10https://gerrit.wikimedia.org/r/682896 (https://phabricator.wikimedia.org/T281216)
[07:52:50] <logmsgbot>	 !log liw@deploy1002 Pruned MediaWiki: 1.36.0-wmf.38 (duration: 03m 17s)
[07:52:56] <wikibugs>	 10SRE, 10netops: mr1 port utilization alerts shouldn't mention hash page in their IRC logs - https://phabricator.wikimedia.org/T281055 (10fgiunchedi) >>! In T281055#7034863, @CDanis wrote: > Moving to AM sounds good to me.  But if needed, in the interim we could change the magic string we use in `check_librenm...
[07:52:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:34] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: Remove site.pp scb entries [puppet] - 10https://gerrit.wikimedia.org/r/676906 (https://phabricator.wikimedia.org/T275759)
[07:53:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove site.pp scb entries [puppet] - 10https://gerrit.wikimedia.org/r/676906 (https://phabricator.wikimedia.org/T275759) (owner: 10Alexandros Kosiaris)
[07:53:47] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Remove site.pp scb entries [puppet] - 10https://gerrit.wikimedia.org/r/676906 (https://phabricator.wikimedia.org/T275759) (owner: 10Alexandros Kosiaris)
[07:55:29] <wikibugs>	 (03CR) 10Jcrespo: "FYI" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682537 (https://phabricator.wikimedia.org/T281094) (owner: 10Jcrespo)
[07:55:53] <wikibugs>	 (03PS5) 10Jcrespo: Xtrabackup: Increase default open-files-limit to match production [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682537 (https://phabricator.wikimedia.org/T281094)
[07:56:17] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs.openstack: Use Icinga directly [cookbooks] - 10https://gerrit.wikimedia.org/r/682663 (owner: 10David Caro)
[07:57:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: Repool db1179', diff saved to https://phabricator.wikimedia.org/P15566 and previous config saved to /var/cache/conftool/dbconfig/20210427-075738-root.json
[07:57:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166 for schema change', diff saved to https://phabricator.wikimedia.org/P15567 and previous config saved to /var/cache/conftool/dbconfig/20210427-075759-marostegui.json
[07:58:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:10] <godog>	 mmhh the kafka lag alert is due to 'cdanis-kafkacat' consumer group for network errors, looking
[07:58:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1087 (re)pooling @ 75%: Repool db1087', diff saved to https://phabricator.wikimedia.org/P15568 and previous config saved to /var/cache/conftool/dbconfig/20210427-075822-root.json
[07:58:26] <godog>	 only in codfw though
[07:58:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:45] <Majavah>	 jouncebot: now
[07:58:46] <jouncebot>	 No deployments scheduled for the next 3 hour(s) and 1 minute(s)
[07:59:29] <Majavah>	 is someone around that could get a beta-only config patch (https://gerrit.wikimedia.org/r/682888) merged? I'd like to unbreak beta clusters session storage
[07:59:57] <wikibugs>	 (03Merged) 10jenkins-bot: wmcs.openstack: Use Icinga directly [cookbooks] - 10https://gerrit.wikimedia.org/r/682663 (owner: 10David Caro)
[08:01:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: Repool db1166', diff saved to https://phabricator.wikimedia.org/P15569 and previous config saved to /var/cache/conftool/dbconfig/20210427-080119-root.json
[08:01:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:10] <wikibugs>	 (03PS3) 10Jcrespo: Increase default memory usage of xtrabackup --prepare to 40GB [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682536 (https://phabricator.wikimedia.org/T281094)
[08:03:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 90%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15570 and previous config saved to /var/cache/conftool/dbconfig/20210427-080342-root.json
[08:03:49] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Increase default memory usage of xtrabackup --prepare to 40GB [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682536 (https://phabricator.wikimedia.org/T281094) (owner: 10Jcrespo)
[08:03:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:59] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers
[08:04:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:06:03] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) It's great that we narrowed this down and confirmed it, excellent work! The change's claimed behaviour is definitely c...
[08:06:50] <wikibugs>	 (03PS6) 10Jcrespo: WIP [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/665383
[08:06:52] <wikibugs>	 (03PS6) 10Jcrespo: Xtrabackup: Increase default open-files-limit to match production [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682537 (https://phabricator.wikimedia.org/T281094)
[08:07:01] <wikibugs>	 (03PS7) 10Jcrespo: Xtrabackup: Increase default open-files-limit to match production [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682537 (https://phabricator.wikimedia.org/T281094)
[08:07:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/665383 (owner: 10Jcrespo)
[08:07:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Xtrabackup: Increase default open-files-limit to match production [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682537 (https://phabricator.wikimedia.org/T281094) (owner: 10Jcrespo)
[08:07:42] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Xtrabackup: Increase default open-files-limit to match production [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682537 (https://phabricator.wikimedia.org/T281094) (owner: 10Jcrespo)
[08:08:24] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb2007.codfw.wmnet with reason: REIMAGE
[08:08:29] <wikibugs>	 (03CR) 10Muehlenhoff: "> Patch Set 11:" [puppet] - 10https://gerrit.wikimedia.org/r/675124 (owner: 10Jbond)
[08:08:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:24] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb2008.codfw.wmnet with reason: REIMAGE
[08:10:26] <logmsgbot>	 !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb2007.codfw.wmnet with reason: REIMAGE
[08:10:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:12] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers
[08:11:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:12:34] <logmsgbot>	 !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb2008.codfw.wmnet with reason: REIMAGE
[08:12:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1087 (re)pooling @ 100%: Repool db1087', diff saved to https://phabricator.wikimedia.org/P15571 and previous config saved to /var/cache/conftool/dbconfig/20210427-081325-root.json
[08:13:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:08] <wikibugs>	 10SRE: Find a replacement for RSS aggregator for planet.wikimedia.org - https://phabricator.wikimedia.org/T281219 (10MoritzMuehlenhoff)
[08:14:14] <wikibugs>	 10SRE: Find a replacement for RSS aggregator for planet.wikimedia.org - https://phabricator.wikimedia.org/T281219 (10MoritzMuehlenhoff) p:05Triage→03Medium
[08:15:26] <wikibugs>	 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10MoritzMuehlenhoff) >>! In T280989#7035799, @gerritbot wrote: > Change 682739 **merged** by Dzahn: > %%%[operations/puppet@production] site/DHCP: remove planet1003%%% > https://gerrit.wikimedia.org/r/682739...
[08:16:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: Repool db1166', diff saved to https://phabricator.wikimedia.org/P15572 and previous config saved to /var/cache/conftool/dbconfig/20210427-081623-root.json
[08:16:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1124 (re)pooling @ 100%: Slowly pool into s7 db1124', diff saved to https://phabricator.wikimedia.org/P15573 and previous config saved to /var/cache/conftool/dbconfig/20210427-081846-root.json
[08:18:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1114 into main and traffic', diff saved to https://phabricator.wikimedia.org/P15574 and previous config saved to /var/cache/conftool/dbconfig/20210427-081911-marostegui.json
[08:19:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:20] <wikibugs>	 (03PS1) 10Legoktm: debian: Fix typo in NaN error message [puppet] - 10https://gerrit.wikimedia.org/r/682898
[08:22:27] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Homer: get Capirca definitions from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/681775 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi)
[08:24:06] <hashar>	 !log Restarting CI Jenkins for plugins upgrade
[08:24:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:11] <wikibugs>	 (03PS2) 10Legoktm: debian: Fix typo in NaN error message [puppet] - 10https://gerrit.wikimedia.org/r/682898
[08:31:13] <wikibugs>	 (03PS1) 10Legoktm: redis: Add redis-bullseye.conf [puppet] - 10https://gerrit.wikimedia.org/r/682900
[08:31:15] <wikibugs>	 (03PS1) 10Legoktm: redis: Get rid of distro-specific config [puppet] - 10https://gerrit.wikimedia.org/r/682901
[08:31:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: Repool db1166', diff saved to https://phabricator.wikimedia.org/P15575 and previous config saved to /var/cache/conftool/dbconfig/20210427-083126-root.json
[08:31:31] <wikibugs>	 (03CR) 10Legoktm: [V: 03+2 C: 03+2] debian: Fix typo in NaN error message [puppet] - 10https://gerrit.wikimedia.org/r/682898 (owner: 10Legoktm)
[08:31:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1114 into main and traffic', diff saved to https://phabricator.wikimedia.org/P15576 and previous config saved to /var/cache/conftool/dbconfig/20210427-083145-marostegui.json
[08:31:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:38] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] redis: Add redis-bullseye.conf [puppet] - 10https://gerrit.wikimedia.org/r/682900 (owner: 10Legoktm)
[08:35:05] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] redis: Get rid of distro-specific config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682901 (owner: 10Legoktm)
[08:36:08] <wikibugs>	 (03PS2) 10Legoktm: redis: Get rid of distro-specific config [puppet] - 10https://gerrit.wikimedia.org/r/682901
[08:36:10] <wikibugs>	 (03CR) 10Legoktm: redis: Get rid of distro-specific config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682901 (owner: 10Legoktm)
[08:36:43] <XioNoX>	 !log standardize management routers ACLs with Capirca
[08:36:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1114 into main and api', diff saved to https://phabricator.wikimedia.org/P15577 and previous config saved to /var/cache/conftool/dbconfig/20210427-083910-marostegui.json
[08:39:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:53] <icinga-wm>	 PROBLEM - Host asw2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[08:40:09] <XioNoX>	 er that's me
[08:40:25] <XioNoX>	 (just monitoring)
[08:41:01] <icinga-wm>	 PROBLEM - Host re0.cr4-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[08:41:01] <icinga-wm>	 PROBLEM - Host re0.cr3-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[08:41:04] <icinga-wm>	 ACKNOWLEDGEMENT - Host asw2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% ayounsi ack
[08:41:10] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Reimage rdb1011, rdb1012, rdb2009, rdb2010 as bullseye [puppet] - 10https://gerrit.wikimedia.org/r/682896 (https://phabricator.wikimedia.org/T281216) (owner: 10Legoktm)
[08:41:30] <XioNoX>	 (rolling back)
[08:41:37] <icinga-wm>	 RECOVERY - Host asw2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 69.22 ms
[08:41:38] <XioNoX>	 (done)
[08:41:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] redis: Get rid of distro-specific config [puppet] - 10https://gerrit.wikimedia.org/r/682901 (owner: 10Legoktm)
[08:42:16] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] site.pp: Setup rdb2009, rdb2010 [puppet] - 10https://gerrit.wikimedia.org/r/682890 (https://phabricator.wikimedia.org/T281216) (owner: 10Legoktm)
[08:42:50] <XioNoX>	 found the issue
[08:43:00] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "LGTM, but -1 until the applications have been switched over" [puppet] - 10https://gerrit.wikimedia.org/r/682891 (https://phabricator.wikimedia.org/T281216) (owner: 10Legoktm)
[08:43:42] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] site.pp: Setup rdb1011, rdb1012 [puppet] - 10https://gerrit.wikimedia.org/r/682892 (https://phabricator.wikimedia.org/T281217) (owner: 10Legoktm)
[08:44:14] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "LGTM, but -1 until the applications that talk to this have been switched over" [puppet] - 10https://gerrit.wikimedia.org/r/682893 (https://phabricator.wikimedia.org/T281217) (owner: 10Legoktm)
[08:44:21] <XioNoX>	 godog: are the icinga* hosts still running icinga or everything is on alert* now?
[08:44:33] <elukey>	 XioNoX: wow Capirca??
[08:45:28] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Reimage rdb1011, rdb1012, rdb2009, rdb2010 as bullseye [puppet] - 10https://gerrit.wikimedia.org/r/682896 (https://phabricator.wikimedia.org/T281216) (owner: 10Legoktm)
[08:46:13] <XioNoX>	 elukey: yay :)
[08:46:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: Repool db1166', diff saved to https://phabricator.wikimedia.org/P15578 and previous config saved to /var/cache/conftool/dbconfig/20210427-084630-root.json
[08:46:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:46:46] <godog>	 XioNoX: only alert* are active, icinga* are pending decom
[08:46:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1175 for schema change', diff saved to https://phabricator.wikimedia.org/P15579 and previous config saved to /var/cache/conftool/dbconfig/20210427-084651-marostegui.json
[08:46:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:10] <XioNoX>	 godog: so I can remove everthing related to icinga* from the management routers?
[08:47:11] <icinga-wm>	 RECOVERY - Host re0.cr4-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.64 ms
[08:47:11] <icinga-wm>	 RECOVERY - Host re0.cr3-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 70.26 ms
[08:47:15] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:47:21] <godog>	 XioNoX: yes, definitely
[08:47:24] <XioNoX>	 cool1
[08:47:25] <XioNoX>	 !
[08:48:56] <wikibugs>	 10SRE, 10serviceops: Put rdb20[09|10] into service - https://phabricator.wikimedia.org/T281225 (10akosiaris)
[08:49:07] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Replace rdb2005, rdb2006 with rdb2009, rdb2010 - https://phabricator.wikimedia.org/T281216 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by legoktm on cumin1001.eqiad.wmnet for hosts: ` ['rdb2009.codfw.wmnet', 'rdb2010.codfw.wmnet'] ` The log can be foun...
[08:49:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: Repool db1175', diff saved to https://phabricator.wikimedia.org/P15580 and previous config saved to /var/cache/conftool/dbconfig/20210427-084950-root.json
[08:49:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:47] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Replace rdb1005, rdb1006 with rdb1011, rdb1012 - https://phabricator.wikimedia.org/T281217 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by legoktm on cumin1001.eqiad.wmnet for hosts: ` ['rdb1011.eqiad.wmnet', 'rdb1012.eqiad.wmnet'] ` The log can be foun...
[08:51:12] <XioNoX>	 2nd try
[08:53:30] <wikibugs>	 (03CR) 10Lars Wirzenius: [C: 03+2] beta: Switchover to deployment-sessionstore04 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682888 (https://phabricator.wikimedia.org/T263617) (owner: 10Majavah)
[08:53:58] <XioNoX>	 looks like it worked
[08:54:28] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Switchover to deployment-sessionstore04 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682888 (https://phabricator.wikimedia.org/T263617) (owner: 10Majavah)
[08:57:07] <icinga-wm>	 PROBLEM - Host mr1-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[08:57:09] <icinga-wm>	 PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[08:58:32] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: changeprop/changeprop-jobqueue/api-gateway: Use the new rdbs [deployment-charts] - 10https://gerrit.wikimedia.org/r/614901 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli)
[08:58:34] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: api-gateway: Move networkpolicy to shared values [deployment-charts] - 10https://gerrit.wikimedia.org/r/682905
[08:59:31] <XioNoX>	 maybe not
[09:00:32] <XioNoX>	 (rolled back)
[09:01:17] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2007.codfw.wmnet
[09:01:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:44] <wikibugs>	 (03PS3) 10Ayounsi: Use Capirca to generate mgmt SRX security policies [homer/public] - 10https://gerrit.wikimedia.org/r/681708
[09:01:52] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Switchover ORES and docker-registry to new redis servers [puppet] - 10https://gerrit.wikimedia.org/r/682906 (https://phabricator.wikimedia.org/T255250)
[09:02:39] <icinga-wm>	 RECOVERY - Host mr1-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 70.77 ms
[09:02:41] <icinga-wm>	 RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 69.41 ms
[09:03:46] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb1011.eqiad.wmnet with reason: REIMAGE
[09:03:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:14] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Replace rdb1005, rdb1006 with rdb1011, rdb1012 - https://phabricator.wikimedia.org/T281217 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['rdb1011.eqiad.wmnet', 'rdb1012.eqiad.wmnet'] `  Of which those **FAILED**: ` ['rdb1011.eqiad.wmnet', 'rdb1012.eqiad.w...
[09:04:47] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0)
[09:04:51] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb2009.codfw.wmnet with reason: REIMAGE
[09:04:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: Repool db1175', diff saved to https://phabricator.wikimedia.org/P15581 and previous config saved to /var/cache/conftool/dbconfig/20210427-090454-root.json
[09:04:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:22] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Replace rdb2005, rdb2006 with rdb2009, rdb2010 - https://phabricator.wikimedia.org/T281216 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['rdb2009.codfw.wmnet', 'rdb2010.codfw.wmnet'] `  Of which those **FAILED**: ` ['rdb2009.codfw.wmnet', 'rdb2010.codfw.w...
[09:05:46] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb1012.eqiad.wmnet with reason: REIMAGE
[09:05:47] <logmsgbot>	 !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb1011.eqiad.wmnet with reason: REIMAGE
[09:05:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:48] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb2010.codfw.wmnet with reason: REIMAGE
[09:06:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:51] <logmsgbot>	 !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb1012.eqiad.wmnet with reason: REIMAGE
[09:07:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:52] <logmsgbot>	 !log legoktm@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on rdb2009.codfw.wmnet with reason: REIMAGE
[09:10:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:48] <logmsgbot>	 !log legoktm@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on rdb2010.codfw.wmnet with reason: REIMAGE
[09:11:49] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0)
[09:11:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:04] <XioNoX>	 3rd time will do it?
[09:14:29] <wikibugs>	 (03PS4) 10Ayounsi: Use Capirca to generate mgmt SRX security policies [homer/public] - 10https://gerrit.wikimedia.org/r/681708
[09:15:44] <volans>	 XioNoX: 3rd times the charm :-P
[09:15:52] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:16:00] <logmsgbot>	 !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2007.codfw.wmnet
[09:16:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:19] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2008.codfw.wmnet
[09:16:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:33] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2009.codfw.wmnet
[09:16:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:48] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2010.codfw.wmnet
[09:16:49] <logmsgbot>	 !log legoktm@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host rdb2010.codfw.wmnet
[09:16:51] <XioNoX>	 at least I fixed all the alerting ones so far
[09:16:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:41] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2010.codfw.wmnet
[09:17:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:08] <wikibugs>	 (03CR) 10Marostegui: [C: 04-1] "I will take care of this - I am doing some last checks before repooling it." [puppet] - 10https://gerrit.wikimedia.org/r/682668 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo)
[09:19:37] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1011.eqiad.wmnet
[09:19:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:52] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1012.eqiad.wmnet
[09:19:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: Repool db1175', diff saved to https://phabricator.wikimedia.org/P15582 and previous config saved to /var/cache/conftool/dbconfig/20210427-091957-root.json
[09:19:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:07] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:21:55] <icinga-wm>	 PROBLEM - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dumps/XML-SQL_Dumps%23A_labstore_host_dies_%28web_or_nfs_server_for_dumps%29
[09:23:08] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Remove rdb200{3,5} from netpols [deployment-charts] - 10https://gerrit.wikimedia.org/r/682912 (https://phabricator.wikimedia.org/T255250)
[09:23:10] <XioNoX>	 that's not me, I didn't push anything to eqsin ^
[09:26:45] <wikibugs>	 (03PS2) 10Tonina Zhelyazkova: wikidata: post edit constraint jobs on 70% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682608 (https://phabricator.wikimedia.org/T204031)
[09:28:35] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: rdb: use buster on newer servers [puppet] - 10https://gerrit.wikimedia.org/r/670850 (owner: 10Giuseppe Lavagetto)
[09:28:47] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: redis: also configure the new rdb servers [puppet] - 10https://gerrit.wikimedia.org/r/670846 (owner: 10Giuseppe Lavagetto)
[09:29:20] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: site.pp: Setup rdb2009, rdb2010 [puppet] - 10https://gerrit.wikimedia.org/r/682890 (https://phabricator.wikimedia.org/T281216) (owner: 10Legoktm)
[09:29:35] <icinga-wm>	 RECOVERY - HTTP on labstore1007 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 383 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Dumps/XML-SQL_Dumps%23A_labstore_host_dies_%28web_or_nfs_server_for_dumps%29
[09:30:09] <XioNoX>	 alright, it looks all good, pushing the same to mr1-codfw
[09:30:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] site.pp: Setup rdb2009, rdb2010 [puppet] - 10https://gerrit.wikimedia.org/r/682890 (https://phabricator.wikimedia.org/T281216) (owner: 10Legoktm)
[09:31:03] <logmsgbot>	 !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2008.codfw.wmnet
[09:31:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:18] <logmsgbot>	 !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2009.codfw.wmnet
[09:31:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:49] <icinga-wm>	 RECOVERY - snapshot of s6 in eqiad on alert1001 is OK: Last snapshot for s6 at eqiad (db1139.eqiad.wmnet:3316) taken on 2021-04-27 08:33:51 (560 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[09:32:28] <logmsgbot>	 !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2010.codfw.wmnet
[09:32:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:42] <moritzm>	 !log rolling restart of elastic in relforge* to pick up Java updates
[09:33:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:11] <logmsgbot>	 !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1011.eqiad.wmnet
[09:34:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:27] <logmsgbot>	 !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1012.eqiad.wmnet
[09:34:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:50] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: changeprop/changeprop-jobqueue/api-gateway: Use the new rdbs [deployment-charts] - 10https://gerrit.wikimedia.org/r/614901 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli)
[09:34:52] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Remove rdb200{3,5} from netpols [deployment-charts] - 10https://gerrit.wikimedia.org/r/682912 (https://phabricator.wikimedia.org/T255250)
[09:35:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: Repool db1175', diff saved to https://phabricator.wikimedia.org/P15583 and previous config saved to /var/cache/conftool/dbconfig/20210427-093501-root.json
[09:35:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:22] <XioNoX>	 !log standardize management routers ACLs with Capirca - mr1-eqsin
[09:35:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1157 for schema change', diff saved to https://phabricator.wikimedia.org/P15584 and previous config saved to /var/cache/conftool/dbconfig/20210427-093536-marostegui.json
[09:35:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:46] <wikibugs>	 (03PS5) 10Ayounsi: Use Capirca to generate mgmt SRX security policies [homer/public] - 10https://gerrit.wikimedia.org/r/681708
[09:37:51] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[09:39:55] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[09:41:16] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Use Capirca to generate mgmt SRX security policies [homer/public] - 10https://gerrit.wikimedia.org/r/681708 (owner: 10Ayounsi)
[09:41:32] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Switchover ORES and docker-registry to new redis servers [puppet] - 10https://gerrit.wikimedia.org/r/682906 (https://phabricator.wikimedia.org/T255250) (owner: 10Alexandros Kosiaris)
[09:43:11] <wikibugs>	 (03Merged) 10jenkins-bot: Use Capirca to generate mgmt SRX security policies [homer/public] - 10https://gerrit.wikimedia.org/r/681708 (owner: 10Ayounsi)
[09:43:55] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:52:32] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 1889 threshold =0.15 breach: active_shards_percent_as_number: 63.47641144624904, initializing_shards: 2, number_of_nodes: 2, status: yellow, cluster_name: relforge-eqiad, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0, relocating_shards: 0, active_shards: 3283, active_primary_shards: 2586, task
[09:52:32] <icinga-wm>	 ueue_millis: 0, timed_out: False, number_of_data_nodes: 2, number_of_pending_tasks: 0, unassigned_shards: 1887 https://wikitech.wikimedia.org/wiki/Search%23Administration
[09:52:58] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 1813 threshold =0.15 breach: unassigned_shards: 1811, number_of_in_flight_fetch: 0, initializing_shards: 2, relocating_shards: 0, number_of_pending_tasks: 0, active_primary_shards: 2586, active_shards_percent_as_number: 64.94586233565352, timed_out: False, delayed_unassigned_shards: 0, task_max_waiting_in_
[09:52:58] <icinga-wm>	 active_shards: 3359, cluster_name: relforge-eqiad, status: yellow, number_of_nodes: 2, number_of_data_nodes: 2 https://wikitech.wikimedia.org/wiki/Search%23Administration
[09:56:18] <logmsgbot>	 !log ayounsi@deploy1002 Started deploy [homer/deploy@759f82c]: Homer release v0.2.7
[09:56:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:40] <logmsgbot>	 !log ayounsi@deploy1002 Finished deploy [homer/deploy@759f82c]: Homer release v0.2.7 (duration: 00m 22s)
[09:56:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:54] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: yellow, delayed_unassigned_shards: 0, timed_out: False, relocating_shards: 0, number_of_pending_tasks: 0, unassigned_shards: 753, initializing_shards: 2, number_of_nodes: 2, active_shards_percent_as_number: 85.40216550657385, number_of_in_flight_fetch: 0, task_max_waiting_
[09:58:54] <icinga-wm>	 0, number_of_data_nodes: 2, active_primary_shards: 2586, active_shards: 4417 https://wikitech.wikimedia.org/wiki/Search%23Administration
[09:59:20] <logmsgbot>	 !log ayounsi@deploy1002 Started deploy [homer/deploy@759f82c]: Homer release v0.2.7
[09:59:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:44] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: active_shards_percent_as_number: 88.16705336426914, number_of_nodes: 2, active_shards: 4560, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, delayed_unassigned_shards: 0, number_of_data_nodes: 2, number_of_pending_tasks: 0, initializing_shards: 2, status: yellow, relocating_shards: 0, unassi
[09:59:44] <icinga-wm>	  timed_out: False, number_of_in_flight_fetch: 0, active_primary_shards: 2586 https://wikitech.wikimedia.org/wiki/Search%23Administration
[10:00:24] <wikibugs>	 (03PS1) 10Jcrespo: Release new v0.5 version [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682916 (https://phabricator.wikimedia.org/T281094)
[10:01:36] <logmsgbot>	 !log ayounsi@deploy1002 Finished deploy [homer/deploy@759f82c]: Homer release v0.2.7 (duration: 02m 16s)
[10:01:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:06:01] <XioNoX>	 !log standardize management routers ACLs with Capirca - mr1-eqiad (last one)
[10:06:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:11:54] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[10:13:56] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[10:17:38] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:18:35] <godog>	 !log swift eqiad-prod: less weight for ms-be[1019-1026] / more weight to ms-be106[0-3] - T272836
[10:18:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:44] <stashbot>	 T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836
[10:22:47] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Decom ms-be[1019-1026] [puppet] - 10https://gerrit.wikimedia.org/r/682920 (https://phabricator.wikimedia.org/T272836)
[10:23:04] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[10:23:14] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[10:23:53] <wikibugs>	 (03PS1) 10Urbanecm: WikiPageConfigValidation: Mentor lists and help desk can be null [extensions/GrowthExperiments] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682812 (https://phabricator.wikimedia.org/T281229)
[10:30:14] <Urbanecm>	 jouncebot: now
[10:30:14] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 29 minute(s)
[10:30:21] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] WikiPageConfigValidation: Mentor lists and help desk can be null [extensions/GrowthExperiments] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682812 (https://phabricator.wikimedia.org/T281229) (owner: 10Urbanecm)
[10:31:06] <godog>	 sth is wrong in upload eqsin, seeing lots of 5xx
[10:31:23] <godog>	 no please ignore me, wrong time on the dashboard :(
[10:31:23] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: Create individual cluster definitions for read and write [deployment-charts] - 10https://gerrit.wikimedia.org/r/682921 (https://phabricator.wikimedia.org/T277585)
[10:32:18] <wikibugs>	 (03PS1) 10Jbond: systemd::timer::job: quote command as it may contain arguments [puppet] - 10https://gerrit.wikimedia.org/r/682922
[10:33:00] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:33:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/682920 (https://phabricator.wikimedia.org/T272836) (owner: 10Filippo Giunchedi)
[10:33:38] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.9524 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[10:33:43] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s4 #page on db1143 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2013, Errmsg: error reconnecting to master repl@db1138.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Lost connection to MySQL server at waiting for initial communication packet, system error: 110 Connection timed out https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:33:48] <marostegui>	 what?
[10:33:50] <marostegui>	 checking
[10:34:00] <volans>	 marostegui: need help?
[10:34:07] <_joe_>	 in a meeting, but around if needed
[10:34:09] <marostegui>	 the master looks unreachable
[10:34:09] <Amir1>	 the write on s4 is now zero
[10:34:09] <jbond42>	 also here
[10:34:10] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me
[10:34:13] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s4 #page on db1141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2013, Errmsg: error reconnecting to master repl@db1138.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Lost connection to MySQL server at waiting for initial communication packet, system error: 110 Connection timed out https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:34:17] <moritzm>	 around
[10:34:18] <marostegui>	 they are all going to page
[10:34:23] <_joe_>	 yeah
[10:34:24] * volans acking the pages
[10:34:25] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s4 #page on db1146 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2013, Errmsg: error reconnecting to master repl@db1138.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Lost connection to MySQL server at waiting for initial communication packet, system error: 110 Connection timed out https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:34:26] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s4 #page on db1148 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2013, Errmsg: error connecting to master repl@db1138.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Lost connection to MySQL server at waiting for initial communication packet, system error: 110 Connection timed out https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:34:27] <marostegui>	 the master is overloaded
[10:34:30] <_joe_>	 also mw is in shambles
[10:34:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw1396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:34:41] <Majavah>	 yeah (test)commons is ro
[10:34:51] <Amir1>	 my query is once every five seconds
[10:34:54] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29213/console" [puppet] - 10https://gerrit.wikimedia.org/r/682922 (owner: 10Jbond)
[10:34:59] <apergos>	 uh oh
[10:34:59] <godog>	 here too, checking
[10:35:00] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[10:35:04] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:35:07] <jynus>	 I cannot connect to the master
[10:35:10] <marostegui>	 I am in
[10:35:11] <_joe_>	 everything is down I'd say
[10:35:14] <marostegui>	 thousands of SELECTs
[10:35:18] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s4 on db1150 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2013, Errmsg: error connecting to master repl@db1138.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Lost connection to MySQL server at waiting for initial communication packet, system error: 110 Connection timed out https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:35:20] <icinga-wm>	 PROBLEM - Apache HTTP on mw1380 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:35:30] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1383 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:35:31] <_joe_>	 marostegui: can we kill em all?
[10:35:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:35:36] <marostegui>	 lots of | 2420632378 | wikiuser             | 10.64.32.50:43834    | commonswiki        | Query       |     413 | statistics                                                            | SELECT /* MediaWiki\Extension\GlobalUsage\GlobalUsage::getLinksFromPage  */  gil_to  FROM `globalimagelinks`    WHERE gil_wiki = 'ptwiki' AND gil_page = 396261                                                                                  
[10:35:36] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1410 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:35:36] <marostegui>	                                                                                                                                                                                                                                                                                                                                                                                                                                            
[10:35:36] <marostegui>	                                                                                                                                                                                                                                                                                                                                                                      |    0.000 |
[10:35:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:35:44] <icinga-wm>	 PROBLEM - Apache HTTP on mw1390 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:35:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:35:46] <icinga-wm>	 PROBLEM - Apache HTTP on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:35:48] <icinga-wm>	 PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[10:35:52] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.8769 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[10:35:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:35:56] <icinga-wm>	 PROBLEM - Apache HTTP on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:35:56] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:35:56] <icinga-wm>	 PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:35:57] <jynus>	 bad stuff on log
[10:35:58] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[10:35:58] <icinga-wm>	 PROBLEM - Apache HTTP on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:35:58] <icinga-wm>	 PROBLEM - Apache HTTP on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:36:00] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1390 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:36:00] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:36:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:36:02] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1363 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:36:04] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:36:04] <icinga-wm>	 PROBLEM - Apache HTTP on mw1412 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:36:04] <icinga-wm>	 PROBLEM - Apache HTTP on mw1394 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:36:06] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1404 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:36:08] <icinga-wm>	 PROBLEM - Apache HTTP on mw1388 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:36:08] <icinga-wm>	 PROBLEM - Apache HTTP on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:36:08] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:36:12] <icinga-wm>	 PROBLEM - Apache HTTP on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:36:12] <icinga-wm>	 PROBLEM - Apache HTTP on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:36:12] <icinga-wm>	 PROBLEM - Apache HTTP on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:36:12] <icinga-wm>	 PROBLEM - Apache HTTP on mw1406 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:36:14] <icinga-wm>	 PROBLEM - Apache HTTP on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:36:14] <icinga-wm>	 PROBLEM - Apache HTTP on mw1362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:36:18] <icinga-wm>	 PROBLEM - Apache HTTP on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:36:18] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1377 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:36:18] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:36:18] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1394 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:36:20] <icinga-wm>	 PROBLEM - Apache HTTP on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:36:22] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:36:22] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1398 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:36:22] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:36:22] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:36:22] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:36:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw1379 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:36:23] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1356 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:36:23] <icinga-wm>	 PROBLEM - Apache HTTP on mw1382 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:36:24] <icinga-wm>	 PROBLEM - Apache HTTP on mw1360 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:36:24] <marostegui>	 I am killing all the selects
[10:36:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:36:26] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - api_80: Servers mw1374.eqiad.wmnet, mw1344.eqiad.wmnet, mw1348.eqiad.wmnet, mw1386.eqiad.wmnet, mw1378.eqiad.wmnet, mw1390.eqiad.wmnet, mw1388.eqiad.wmnet, mw1345.eqiad.wmnet, mw1282.eqiad.wmnet, mw1408.eqiad.wmnet, mw1398.eqiad.wmnet, mw1357.eqiad.wmnet, mw1317.eqiad.wmnet, mw1290.eqiad.wmnet, mw1316.eqiad.wmnet, mw1342.eqiad.wmnet, mw1382.
[10:36:26] <icinga-wm>	 89.eqiad.wmnet, mw1341.eqiad.wmnet, mw1360.eqiad.wmnet, mw1313.eqiad.wmnet, mw1346.eqiad.wmnet, mw1361.eqiad.wmnet, mw1392.eqiad.wmnet, mw1287.eqiad.wmnet, mw1288.eqiad.wmnet, mw1281.eqiad.wmnet, mw1314.eqiad.wmnet, mw1412.eqiad.wmnet, mw1396.eqiad.wmnet, mw1404.eqiad.wmnet, mw1283.eqiad.wmnet, mw1381.eqiad.wmnet, mw1362.eqiad.wmnet, mw1340.eqiad.wmnet, mw1343.eqiad.wmnet, mw1347.eqiad.wmnet, mw1377.eqiad.wmnet, mw1286.eqiad.wmne
[10:36:26] <icinga-wm>	 mnet, mw1363.eqiad.wmnet, mw1359.eqiad.wmnet, mw1400.eqiad.wmnet, mw1383.eqiad.wmnet, mw1297.eqiad.wmnet, mw1375.eqiad.wmnet, mw1315.eqiad.wmnet, mw1285.eqiad.wmnet, mw1402.eqiad.wmnet, https://wikitech.wikimedia.org/wiki/PyBal
[10:36:28] <icinga-wm>	 PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[10:36:28] <icinga-wm>	 PROBLEM - Apache HTTP on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:36:28] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:36:28] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[10:36:28] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1379 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:36:29] <icinga-wm>	 PROBLEM - Apache HTTP on mw1374 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:36:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:36:40] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:36:40] <icinga-wm>	 PROBLEM - Apache HTTP on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:36:40] <icinga-wm>	 PROBLEM - Apache HTTP on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:36:40] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:36:40] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:36:42] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1358 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:36:42] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:36:48] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[10:36:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:36:50] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/description/{title} (Get description for test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata
[10:36:50] <icinga-wm>	 e on English Wikipedia returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML fo
[10:36:50] <icinga-wm>	 ned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[10:36:50] <icinga-wm>	 PROBLEM - Apache HTTP on mw1377 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:36:50] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:36:51] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:36:51] <icinga-wm>	 PROBLEM - Apache HTTP on mw1376 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:36:52] <icinga-wm>	 PROBLEM - Apache HTTP on mw1400 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:37:08] <icinga-wm>	 PROBLEM - Apache HTTP on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:37:08] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1378 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:37:08] <icinga-wm>	 PROBLEM - Apache HTTP on mw1363 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:37:10] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1284.eqiad.wmnet, mw1346.eqiad.wmnet, mw1380.eqiad.wmnet, mw1390.eqiad.wmnet, mw1357.eqiad.wmnet, mw1362.eqiad.wmnet, mw1361.eqiad.wmnet, mw1392.eqiad.wmnet, mw1374.eqiad.wmnet, mw1344.eqiad.wmnet, mw1287.eqiad.wmnet, mw1348.eqiad.wmnet, mw1288.eqiad.wmnet, mw1281.eqiad.wmnet, mw1314.eqiad.wmnet, mw1386.eqiad.wmnet, mw1410
[10:37:10] <icinga-wm>	 402.eqiad.wmnet, mw1404.eqiad.wmnet, mw1283.eqiad.wmnet, mw1381.eqiad.wmnet, mw1388.eqiad.wmnet, mw1340.eqiad.wmnet, mw1343.eqiad.wmnet, mw1347.eqiad.wmnet, mw1377.eqiad.wmnet, mw1345.eqiad.wmnet, mw1339.eqiad.wmnet, mw1286.eqiad.wmnet, mw1282.eqiad.wmnet, mw1412.eqiad.wmnet, mw1398.eqiad.wmnet, mw1408.eqiad.wmnet, mw1384.eqiad.wmnet, mw1376.eqiad.wmnet, mw1363.eqiad.wmnet, mw1315.eqiad.wmnet, mw1317.eqiad.wmnet, mw1290.eqiad.wmn
[10:37:10] <icinga-wm>	 wmnet, mw1379.eqiad.wmnet, mw1396.eqiad.wmnet, mw1312.eqiad.wmnet, mw1394.eqiad.wmnet, mw1 https://wikitech.wikimedia.org/wiki/PyBal
[10:37:12] <icinga-wm>	 PROBLEM - Apache HTTP on mw1287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:37:20] <icinga-wm>	 RECOVERY - Apache HTTP on mw1347 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 9.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:37:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw1339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:37:22] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1376 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:37:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:37:34] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:37:34] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:37:34] <icinga-wm>	 RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[10:37:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:37:35] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1357 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:37:36] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1380 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:37:36] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1382 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:37:36] <icinga-wm>	 PROBLEM - Apache HTTP on mw1386 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:37:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:37:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:37:38] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[10:37:38] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[10:37:40] <icinga-wm>	 PROBLEM - Apache HTTP on mw1375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:37:44] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:37:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:37:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:37:48] <icinga-wm>	 PROBLEM - Apache HTTP on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:37:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:37:50] <icinga-wm>	 PROBLEM - Apache HTTP on mw1361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:37:50] <icinga-wm>	 RECOVERY - Apache HTTP on mw1345 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 8.563 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:37:52] <icinga-wm>	 RECOVERY - Apache HTTP on mw1394 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 8.449 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:37:54] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1404 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 7.699 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:37:56] <icinga-wm>	 RECOVERY - Apache HTTP on mw1388 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 8.570 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:37:58] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1346 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 9.903 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:37:58] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1381 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:37:58] <icinga-wm>	 RECOVERY - Apache HTTP on mw1406 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 6.989 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:38:04] <icinga-wm>	 RECOVERY - Apache HTTP on mw1360 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:38:04] <icinga-wm>	 RECOVERY - Apache HTTP on mw1396 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 7.793 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:38:04] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - api_80: Servers mw1284.eqiad.wmnet, mw1346.eqiad.wmnet, mw1404.eqiad.wmnet, mw1357.eqiad.wmnet, mw1361.eqiad.wmnet, mw1392.eqiad.wmnet, mw1374.eqiad.wmnet, mw1344.eqiad.wmnet, mw1287.eqiad.wmnet, mw1388.eqiad.wmnet, mw1288.eqiad.wmnet, mw1281.eqiad.wmnet, mw1314.eqiad.wmnet, mw1386.eqiad.wmnet, mw1348.eqiad.wmnet, mw1402.eqiad.wmnet, mw1390.
[10:38:04] <icinga-wm>	 83.eqiad.wmnet, mw1381.eqiad.wmnet, mw1362.eqiad.wmnet, mw1340.eqiad.wmnet, mw1343.eqiad.wmnet, mw1313.eqiad.wmnet, mw1347.eqiad.wmnet, mw1377.eqiad.wmnet, mw1345.eqiad.wmnet, mw1286.eqiad.wmnet, mw1282.eqiad.wmnet, mw1412.eqiad.wmnet, mw1398.eqiad.wmnet, mw1408.eqiad.wmnet, mw1376.eqiad.wmnet, mw1363.eqiad.wmnet, mw1359.eqiad.wmnet, mw1317.eqiad.wmnet, mw1290.eqiad.wmnet, mw1316.eqiad.wmnet, mw1379.eqiad.wmnet, mw1400.eqiad.wmne
[10:38:04] <icinga-wm>	 mnet, mw1406.eqiad.wmnet, mw1375.eqiad.wmnet, mw1342.eqiad.wmnet, mw1378.eqiad.wmnet, mw1382.eqiad.wmnet, mw1289.eqiad.wmnet, mw1315.eqiad.wmnet, mw1341.eqiad.wmnet, mw1285.eqiad.wmnet, https://wikitech.wikimedia.org/wiki/PyBal
[10:38:05] <icinga-wm>	 PROBLEM - Apache HTTP on mw1398 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:38:06] <icinga-wm>	 RECOVERY - Apache HTTP on mw1382 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 3.992 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:38:06] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1356 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 5.250 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:38:06] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:38:06] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1394 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 9.584 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:38:08] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1289 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 8.350 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:38:10] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1342 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 9.075 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:38:10] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1398 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 9.398 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:38:14] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1340 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 9.094 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:38:15] <icinga-wm>	 RECOVERY - Apache HTTP on mw1342 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 9.244 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:38:15] <icinga-wm>	 RECOVERY - Apache HTTP on mw1374 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 8.456 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:38:22] <icinga-wm>	 RECOVERY - Apache HTTP on mw1315 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 5.198 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:38:24] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1396 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 4.205 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:38:24] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1345 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 7.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:38:24] <icinga-wm>	 RECOVERY - Apache HTTP on mw1348 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 7.319 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:38:26] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1312 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 9.204 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:38:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:38:30] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most 
[10:38:30] <icinga-wm>	  January 1, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[10:38:32] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1347 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 2.708 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:38:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:38:38] <icinga-wm>	 RECOVERY - Apache HTTP on mw1377 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 8.990 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:38:40] <icinga-wm>	 RECOVERY - Apache HTTP on mw1400 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 8.762 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:38:40] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1361 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 7.513 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:38:52] <icinga-wm>	 RECOVERY - Apache HTTP on mw1363 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:38:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1341 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.193 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:38:54] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1378 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 2.163 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:39:00] <icinga-wm>	 RECOVERY - Apache HTTP on mw1380 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:39:00] <icinga-wm>	 RECOVERY - Apache HTTP on mw1287 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 3.071 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:39:10] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1383 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:39:14] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1376 is OK: HTTP OK: HTTP/1.1 302 Found - 655 bytes in 0.298 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:39:14] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 990 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:39:14] <icinga-wm>	 RECOVERY - Apache HTTP on mw1339 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 2.725 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:39:18] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1410 is OK: HTTP OK: HTTP/1.1 302 Found - 655 bytes in 0.749 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:39:26] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1288 is OK: HTTP OK: HTTP/1.1 302 Found - 655 bytes in 0.422 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:39:28] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1357 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:39:28] <icinga-wm>	 RECOVERY - Apache HTTP on mw1343 is OK: HTTP OK: HTTP/1.1 302 Found - 641 bytes in 0.265 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:39:28] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:39:28] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1283 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 2.188 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:39:28] <icinga-wm>	 RECOVERY - Apache HTTP on mw1316 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:39:28] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1380 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:39:29] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1382 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:39:30] <icinga-wm>	 RECOVERY - Apache HTTP on mw1386 is OK: HTTP OK: HTTP/1.1 302 Found - 641 bytes in 0.960 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:39:32] <icinga-wm>	 RECOVERY - Apache HTTP on mw1390 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 4.598 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:39:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:39:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:39:34] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[10:39:34] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[10:39:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:39:38] <icinga-wm>	 RECOVERY - Apache HTTP on mw1314 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:39:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:39:40] <icinga-wm>	 RECOVERY - Apache HTTP on mw1375 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 6.060 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:39:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:39:40] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1285 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 1.989 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:39:40] <icinga-wm>	 RECOVERY - Apache HTTP on mw1317 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:39:40] <icinga-wm>	 RECOVERY - Apache HTTP on mw1286 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 2.167 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:39:40] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1344 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 2.510 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:39:42] <icinga-wm>	 RECOVERY - Apache HTTP on mw1346 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:39:42] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1390 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:39:42] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1317 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:39:44] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:39:44] <icinga-wm>	 RECOVERY - Apache HTTP on mw1340 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 3.343 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:39:44] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1363 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:39:44] <icinga-wm>	 RECOVERY - Apache HTTP on mw1361 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.497 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:39:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:39:46] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1375 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:39:46] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:39:46] <icinga-wm>	 RECOVERY - Apache HTTP on mw1412 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:39:52] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1381 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 2.838 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:39:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1290 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:39:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1297 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 4.662 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:39:56] <icinga-wm>	 RECOVERY - Apache HTTP on mw1283 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.099 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:39:56] <icinga-wm>	 RECOVERY - Apache HTTP on mw1285 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.747 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:39:58] <icinga-wm>	 RECOVERY - Apache HTTP on mw1398 is OK: HTTP OK: HTTP/1.1 302 Found - 641 bytes in 0.347 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:39:58] <icinga-wm>	 RECOVERY - Apache HTTP on mw1362 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.460 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:40:00] <icinga-wm>	 RECOVERY - Apache HTTP on mw1344 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.919 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:40:00] <icinga-wm>	 RECOVERY - Apache HTTP on mw1288 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:40:00] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1348 is OK: HTTP OK: HTTP/1.1 302 Found - 655 bytes in 0.312 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:40:02] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1377 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 1.491 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:40:04] <icinga-wm>	 RECOVERY - Apache HTTP on mw1379 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.148 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:40:05] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1313 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 1.239 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:40:05] <icinga-wm>	 RECOVERY - Apache HTTP on mw1313 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 2.554 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:40:06] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1286 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 1.451 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:40:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:40:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:40:12] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1379 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 1.363 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:40:18] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:40:18] <icinga-wm>	 RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[10:40:18] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[10:40:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:40:25] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s4 #page on db1142 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1040, Errmsg: error reconnecting to master repl@db1138.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Too many connections https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:40:25] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:40:26] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1358 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 1.765 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:40:26] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1339 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 3.296 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:40:30] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:40:38] <icinga-wm>	 RECOVERY - Apache HTTP on mw1376 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 2.965 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:40:38] <icinga-wm>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[10:40:38] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1362 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 3.726 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:40:40] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[10:40:57] <logmsgbot>	 !log volans@cumin1001 dbctl commit (dc=all): 'S4 RO, outage', diff saved to https://phabricator.wikimedia.org/P15585 and previous config saved to /var/cache/conftool/dbconfig/20210427-104057-volans.json
[10:41:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:50] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST
[10:42:30] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[10:42:45] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s4 #page on db1144 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1040, Errmsg: error connecting to master repl@db1138.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Too many connections https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:42:46] <icinga-wm>	 PROBLEM - MariaDB read only s4 #page on db1138 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[10:42:53] <icinga-wm>	 PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2548 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver
[10:43:06] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[10:43:34] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 1.746e+04 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:43:36] <wikibugs>	 (03Merged) 10jenkins-bot: WikiPageConfigValidation: Mentor lists and help desk can be null [extensions/GrowthExperiments] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682812 (https://phabricator.wikimedia.org/T281229) (owner: 10Urbanecm)
[10:43:41] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s4 #page on db1160 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1053, Errmsg: error connecting to master repl@db1138.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Server shutdown in progress https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:44:04] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s4 on db2090 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1053, Errmsg: error connecting to master repl@db1138.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Server shutdown in progress https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:44:24] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:44:28] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s4 on db1145 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error connecting to master repl@db1138.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1138.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:44:53] <icinga-wm>	 RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.5643 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver
[10:45:00] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[10:45:13] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s4 #page on db1121 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error connecting to master repl@db1138.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1138.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:45:58] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.01538 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[10:46:44] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[10:46:48] <apergos>	 NOTE: all deployments are on hold until a further announcement is made
[10:47:12] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[10:47:30] <Urbanecm>	 apergos: I put that into the topic, this will get flooded by icinga
[10:47:34] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:47:43] <apergos>	 good call
[10:47:58] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:48:18] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[10:48:25] <Majavah>	 Urbanecm: can you put s4/commons RO being known there too? /me is surprised that no-one has been asking about that yet
[10:48:34] <Urbanecm>	 good point
[10:48:43] <Majavah>	 and maybe -tech if you have access there
[10:49:28] <Urbanecm>	 Majavah: done both chans
[10:49:32] <Majavah>	 ty
[10:49:35] <Urbanecm>	 np
[10:51:54] <wikibugs>	 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Datacenter-Switchover: CommRel support for June 2021 Switchover - https://phabricator.wikimedia.org/T281209 (10Elitre)
[10:52:40] <wikibugs>	 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Datacenter-Switchover: CommRel support for June 2021 Switchover - https://phabricator.wikimedia.org/T281209 (10Elitre) @sgrabarczuk @Trizek-WMF ^^^
[10:55:12] <wikibugs>	 (03CR) 10Michael Große: [C: 03+1] "seems reasonable to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682608 (https://phabricator.wikimedia.org/T204031) (owner: 10Tonina Zhelyazkova)
[10:55:28] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 128 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:56:34] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:57:24] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:59:15] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s4 #page on db1143 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:59:43] <jem>	 Problems when deleting in eswiki
[10:59:48] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s4 on db1145 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:59:58] <marostegui>	 jem: we are having issues with commons, so might be related
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: (Dis)respected human, time to deploy European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210427T1100). Please do the needful.
[11:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[11:00:05] <jem>	 Ok
[11:00:11] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s4 #page on db1160 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:00:19] <jem>	 Just to note that it's not just Commons :)
[11:00:19] <Amir1>	 nope, no deploy rn
[11:00:28] <Urbanecm>	 no deploys
[11:00:28] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on labstore1007 is CRITICAL: cluster=wmcs device=1I:1:5 instance=labstore1007 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1007&var-datasource=eqiad+prometheus/ops
[11:00:48] <Amir1>	 jem: do you have a concrete problem?
[11:00:52] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 554 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:01:10] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 140 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:01:32] <icinga-wm>	 RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[11:02:01] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s4 #page on db1142 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:03:43] <Majavah>	 Amir1: maybe image global usage? some reason deletion is not being done in a job, just directly in the hook https://github.com/wikimedia/mediawiki-extensions-GlobalUsage/blob/fd85afae25cab78bc40991fab79a61b5b42c1ed4/includes/Hooks.php#L125
[11:04:02] <icinga-wm>	 RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[11:04:19] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s4 #page on db1148 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:05:08] <jem>	 Amir1: Error when deleting a page, let me try again
[11:05:20] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 42 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:05:27] <Amir1>	 jem: in which wiki?
[11:05:31] <jem>	 "[f72940ef-e506-4cf5-89cc-703d37572f9c] 2021-04-27 11:05:16: Excepción grave de tipo "Wikimedia\Rdbms\DBReadOnlyError"
[11:05:34] <jem>	 Amir1: eswiki
[11:06:18] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[11:06:31] <Majavah>	 Amir1: yeah confirmed on testwiki, globalusage blocks deletion because commons is ro
[11:06:32] <jem>	 Recentchanges has activity
[11:06:35] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s4 #page on db1141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:06:58] <Urbanecm>	 jem: yup, it's known, wait for a while until we fix the commons issue, and then it should work again :)
[11:07:19] <jem>	 Ok, thanks, Urbanecm :)
[11:07:27] <jem>	 I'll keep an eye here
[11:07:51] <wikibugs>	 (03Abandoned) 10Hnowlan: site: set role for eventlog1003 to eventlog [puppet] - 10https://gerrit.wikimedia.org/r/681652 (https://phabricator.wikimedia.org/T280679) (owner: 10Hnowlan)
[11:09:37] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/CommonSettings.php: Disable GlobalUsage (duration: 01m 08s)
[11:09:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:49] <Amir1>	 marostegui: synced
[11:09:56] <marostegui>	 ok, going to remove RW
[11:10:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove RW from commonswiki', diff saved to https://phabricator.wikimedia.org/P15588 and previous config saved to /var/cache/conftool/dbconfig/20210427-111016-marostegui.json
[11:10:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:31] <Majavah>	 Amir1: nO DePLoYs :D
[11:10:49] <Majavah>	 (sorry)
[11:11:04] <Amir1>	 haha
[11:11:09] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s4 #page on db1144 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:11:54] <jem>	 Deletion worked :)
[11:12:03] <apergos>	 \o/
[11:12:37] <Majavah>	 commons recent changes is now flowing
[11:12:47] <_joe_>	 Urbanecm: update topic? commons is not ro anymore
[11:13:06] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:13:19] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s4 #page on db1121 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:13:20] <icinga-wm>	 RECOVERY - MariaDB read only s4 #page on db1138 is OK: Version 10.1.43-MariaDB, Uptime 935s, read_only: False, event_scheduler: True, 1768.29 QPS, connection latency: 0.001838s, query latency: 0.000246s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[11:13:20] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s4 on db1150 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:14:09] <_joe_>	 we should remember to kill changeprop when we go read-only maybe
[11:14:22] <_joe_>	 although tbf it will retry
[11:15:47] <Urbanecm>	 _joe_: sure, I'll update it
[11:16:09] <_joe_>	 just because I saw you had op already :)
[11:17:09] <Urbanecm>	 sure :)
[11:17:12] <Urbanecm>	 also updated in -tech
[11:17:30] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:17:50] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s4 on db2090 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:18:34] <jbond42>	 _joe_: should i add that as a action item ( kill changeprop when we go read-only)
[11:18:53] <_joe_>	 jbond42: yeah on second thoughts, it's not needed probably
[11:19:29] <jbond42>	 _joe_: ack added thanks
[11:22:17] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s4 #page on db1146 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:30:43] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker
[11:30:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:55] <elukey>	 jayme: --^
[11:31:08] <jayme>	 elukey: ack, thx!
[11:31:48] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on labstore1007 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1007&var-datasource=eqiad+prometheus/ops
[11:33:32] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:36:49] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0)
[11:36:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 25%: Repool db1157', diff saved to https://phabricator.wikimedia.org/P15589 and previous config saved to /var/cache/conftool/dbconfig/20210427-114108-root.json
[11:41:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:53] <wikibugs>	 10SRE, 10serviceops, 10Sustainability (Incident Followup): Ensure Changeprop is disabled when the databases are in read only mode - https://phabricator.wikimedia.org/T281240 (10jbond)
[11:42:55] <wikibugs>	 (03PS1) 10QChris: Add .gitreview [software/pipermail-redirector] - 10https://gerrit.wikimedia.org/r/682932
[11:42:57] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [software/pipermail-redirector] - 10https://gerrit.wikimedia.org/r/682932 (owner: 10QChris)
[11:47:38] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:50:45] <wikibugs>	 10SRE, 10ChangeProp, 10serviceops, 10Sustainability (Incident Followup): Ensure Changeprop is disabled when the databases are in read only mode - https://phabricator.wikimedia.org/T281240 (10hnowlan)
[11:51:03] <wikibugs>	 (03PS1) 10Ladsgroup: Avoid reading primary unless absolutely necessary [extensions/GlobalUsage] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682813 (https://phabricator.wikimedia.org/T281238)
[11:51:18] <wikibugs>	 (03PS1) 10Ladsgroup: Avoid reading primary unless absolutely necessary [extensions/GlobalUsage] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/682814 (https://phabricator.wikimedia.org/T281238)
[11:54:47] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Avoid reading primary unless absolutely necessary [extensions/GlobalUsage] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/682814 (https://phabricator.wikimedia.org/T281238) (owner: 10Ladsgroup)
[11:54:52] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Avoid reading primary unless absolutely necessary [extensions/GlobalUsage] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682813 (https://phabricator.wikimedia.org/T281238) (owner: 10Ladsgroup)
[11:56:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Avoid reading primary unless absolutely necessary [extensions/GlobalUsage] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/682814 (https://phabricator.wikimedia.org/T281238) (owner: 10Ladsgroup)
[11:56:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 50%: Repool db1157', diff saved to https://phabricator.wikimedia.org/P15590 and previous config saved to /var/cache/conftool/dbconfig/20210427-115612-root.json
[11:56:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:52] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 85 probes of 637 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:57:13] <wikibugs>	 (03PS1) 10Ladsgroup: URGENT: Disable GlobalUsage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682935
[11:57:55] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] URGENT: Disable GlobalUsage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682935 (owner: 10Ladsgroup)
[11:58:46] <wikibugs>	 (03PS2) 10Urbanecm: Avoid reading primary unless absolutely necessary [extensions/GlobalUsage] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/682814 (https://phabricator.wikimedia.org/T281238) (owner: 10Ladsgroup)
[11:59:00] <wikibugs>	 (03Merged) 10jenkins-bot: URGENT: Disable GlobalUsage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682935 (owner: 10Ladsgroup)
[12:00:04] <jouncebot>	 Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210427T1200)
[12:00:33] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on labstore1007.wikimedia.org with reason: T281045
[12:00:33] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on labstore1007.wikimedia.org with reason: T281045
[12:00:39] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Avoid reading primary unless absolutely necessary [extensions/GlobalUsage] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/682814 (https://phabricator.wikimedia.org/T281238) (owner: 10Ladsgroup)
[12:00:41] <wikibugs>	 10SRE, 10ChangeProp, 10serviceops, 10Sustainability (Incident Followup): Ensure Changeprop is disabled when the databases are in read only mode - https://phabricator.wikimedia.org/T281240 (10Joe) To be clear, the idea came out of the fact that during read-only time we had a lot of jobs failing, but given w...
[12:00:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:43] <stashbot>	 T281045: labstore1007 crashed after storage controller errors - https://phabricator.wikimedia.org/T281045
[12:00:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:32] <wikibugs>	 (03Merged) 10jenkins-bot: Avoid reading primary unless absolutely necessary [extensions/GlobalUsage] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682813 (https://phabricator.wikimedia.org/T281238) (owner: 10Ladsgroup)
[12:03:10] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 54 probes of 637 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:06:00] <wikibugs>	 (03Merged) 10jenkins-bot: Avoid reading primary unless absolutely necessary [extensions/GlobalUsage] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/682814 (https://phabricator.wikimedia.org/T281238) (owner: 10Ladsgroup)
[12:10:32] <wikibugs>	 (03CR) 10ZPapierski: "> Patch Set 12:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles)
[12:11:00] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[12:11:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: Repool db1157', diff saved to https://phabricator.wikimedia.org/P15591 and previous config saved to /var/cache/conftool/dbconfig/20210427-121115-root.json
[12:11:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:04] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[12:12:38] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.3/extensions/GlobalUsage: Backport: [[gerrit:682813|Avoid reading primary unless absolutely necessary (T281238)]] (duration: 01m 09s)
[12:12:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:49] <stashbot>	 T281238: GlobalUsage does selects on the master database - https://phabricator.wikimedia.org/T281238
[12:13:47] <wikibugs>	 (03CR) 10ArielGlenn: "Will this work for timers like this one: https://github.com/wikimedia/puppet/blob/production/modules/snapshot/manifests/cron/pagetitles.pp" [puppet] - 10https://gerrit.wikimedia.org/r/682922 (owner: 10Jbond)
[12:15:10] <icinga-wm>	 RECOVERY - snapshot of s3 in codfw on alert1001 is OK: Last snapshot for s3 at codfw (db2139.codfw.wmnet:3313) taken on 2021-04-27 08:43:32 (1037 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[12:20:12] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.1/extensions/GlobalUsage: Backport: [[gerrit:682814|Avoid reading primary unless absolutely necessary (T281238)]] (duration: 01m 09s)
[12:20:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:21] <stashbot>	 T281238: GlobalUsage does selects on the master database - https://phabricator.wikimedia.org/T281238
[12:23:24] <icinga-wm>	 RECOVERY - Disk space on mwlog1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwlog1001&var-datasource=eqiad+prometheus/ops
[12:24:12] <apergos>	 fine fine. in two-three days you will bemuch happier, mwlog1001
[12:24:53] <wikibugs>	 10SRE, 10Sustainability (Incident Followup): ReEnable GlobalUsage - https://phabricator.wikimedia.org/T281242 (10jbond)
[12:25:12] <wikibugs>	 10SRE, 10Sustainability (Incident Followup): ReEnable GlobalUsage - https://phabricator.wikimedia.org/T281242 (10jbond)
[12:25:38] <wikibugs>	 10SRE, 10Sustainability (Incident Followup): ReEnable GlobalUsage - https://phabricator.wikimedia.org/T281242 (10jbond)
[12:26:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: Repool db1157', diff saved to https://phabricator.wikimedia.org/P15592 and previous config saved to /var/cache/conftool/dbconfig/20210427-122619-root.json
[12:26:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:26:33] <Urbanecm>	 apergos: you mean, replaced with mwlog1002? :)
[12:26:45] <moritzm>	 I'll be much happier if mwlog1001 doesn't exist anymore in 2-3 days...
[12:27:06] <wikibugs>	 10SRE, 10Sustainability (Incident Followup), 10User-Ladsgroup: ReEnable GlobalUsage - https://phabricator.wikimedia.org/T281242 (10Ladsgroup) a:03Ladsgroup
[12:27:31] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "URGENT: Disable GlobalUsage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682815
[12:27:54] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:28:04] <wikibugs>	 (03PS2) 10Ladsgroup: Revert "URGENT: Disable GlobalUsage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682815 (https://phabricator.wikimedia.org/T281242)
[12:28:07] <apergos>	 no, I mean that wmf.3 is this week's train, right? and it has the 'quit logging every cache miss for externalstore kthxbai" patch
[12:28:20] <apergos>	 ugh mismatched ' and " and not correctable, the worst
[12:28:38] <apergos>	 anyways that will save a few hundred gb right there
[12:29:15] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Revert "URGENT: Disable GlobalUsage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682815 (https://phabricator.wikimedia.org/T281242) (owner: 10Ladsgroup)
[12:29:59] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "URGENT: Disable GlobalUsage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682815 (https://phabricator.wikimedia.org/T281242) (owner: 10Ladsgroup)
[12:38:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Decom ms-be[1019-1026] [puppet] - 10https://gerrit.wikimedia.org/r/682920 (https://phabricator.wikimedia.org/T272836) (owner: 10Filippo Giunchedi)
[12:38:36] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Decom ms-be[1019-1026] [puppet] - 10https://gerrit.wikimedia.org/r/682920 (https://phabricator.wikimedia.org/T272836)
[12:39:07] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic, 10Browser-Support-Apple-Safari: File:Chessboard480.svg WEBP thumbnail version not visible on safari when size is fixed at 208px - https://phabricator.wikimedia.org/T280439 (10Daimona) 05Open→03Resolved a:03ema Working as expected now, thank you!
[12:44:05] <hashar>	 !log Restarted CI Jenkins for plugins upgrade
[12:44:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:27] <wikibugs>	 10SRE, 10observability, 10MW-1.37-notes (1.37.0-wmf.3; 2021-04-27), 10Patch-For-Review, and 2 others: mwlog1001 is running out of free space on /srv/mw-log - https://phabricator.wikimedia.org/T281048 (10AMooney)
[12:45:18] <wikibugs>	 (03PS2) 10Jbond: systemd::timer::job: quote command as it may contain arguments [puppet] - 10https://gerrit.wikimedia.org/r/682922
[12:46:25] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29214/console" [puppet] - 10https://gerrit.wikimedia.org/r/682922 (owner: 10Jbond)
[12:46:27] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:682815|Revert "URGENT: Disable GlobalUsage" (T281242)]] (duration: 01m 08s)
[12:46:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:36] <stashbot>	 T281242: ReEnable GlobalUsage - https://phabricator.wikimedia.org/T281242
[12:48:05] <wikibugs>	 (03CR) 10Effie Mouzeli: changeprop/changeprop-jobqueue/api-gateway: Use the new rdbs [deployment-charts] - 10https://gerrit.wikimedia.org/r/614901 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli)
[12:50:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks! I 'll merge and deploy 1 by 1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/614901 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli)
[12:52:40] <wikibugs>	 10SRE, 10DBA, 10Sustainability (Incident Followup): Collect metricts for Exec_Master_Log_Pos - https://phabricator.wikimedia.org/T281251 (10jbond)
[12:54:27] <wikibugs>	 10SRE, 10DBA, 10Sustainability (Incident Followup): Collect metricts for Exec_Master_Log_Pos - https://phabricator.wikimedia.org/T281251 (10jcrespo)
[12:55:08] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.decommission for hosts ms-be1019.eqiad.wmnet
[12:55:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:56:46] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1020 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:57:29] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: remove ms-be2016.yml, host long gone [puppet] - 10https://gerrit.wikimedia.org/r/682936
[12:57:55] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] api-gateway: Move networkpolicy to shared values [deployment-charts] - 10https://gerrit.wikimedia.org/r/682905 (owner: 10Alexandros Kosiaris)
[12:59:03] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: nftables: basefirewall: play with the order of the last counter rule [puppet] - 10https://gerrit.wikimedia.org/r/682937
[12:59:05] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: nftables: basefirewall: introduce prometheus facility [puppet] - 10https://gerrit.wikimedia.org/r/682938 (https://phabricator.wikimedia.org/T281124)
[12:59:07] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: enable prometheus scraping [puppet] - 10https://gerrit.wikimedia.org/r/682939 (https://phabricator.wikimedia.org/T281124)
[12:59:19] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: Move networkpolicy to shared values [deployment-charts] - 10https://gerrit.wikimedia.org/r/682905 (owner: 10Alexandros Kosiaris)
[12:59:21] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop/changeprop-jobqueue/api-gateway: Use the new rdbs [deployment-charts] - 10https://gerrit.wikimedia.org/r/614901 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli)
[13:00:04] <jouncebot>	 liw and longma: That opportune time is upon us again. Time for a MediaWiki train - European+American Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210427T1300).
[13:01:12] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:01:26] <Urbanecm>	 liw: longma: please not yet
[13:01:55] <wikibugs>	 (03PS1) 10Lars Wirzenius: group0 wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682940
[13:01:57] <wikibugs>	 (03CR) 10Lars Wirzenius: [C: 03+2] group0 wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682940 (owner: 10Lars Wirzenius)
[13:02:02] <wikibugs>	 (03PS2) 10JMeybohm: Swap zookeeper from conf2003 to conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/682667 (https://phabricator.wikimedia.org/T271573)
[13:02:19] <wikibugs>	 10SRE, 10Patch-For-Review, 10Sustainability (Incident Followup), 10User-Ladsgroup: ReEnable GlobalUsage - https://phabricator.wikimedia.org/T281242 (10Ladsgroup) 05Open→03Resolved
[13:02:20] <wikibugs>	 (03CR) 10Elukey: "Good start! The way that I approach it is by layers:" [puppet] - 10https://gerrit.wikimedia.org/r/682785 (https://phabricator.wikimedia.org/T278423) (owner: 10Razzi)
[13:02:27] <liw>	 Urbanecm, I killed deploy-promote, what's up?
[13:02:58] <Urbanecm>	 liw: mediawiki-stagging is now in weird state. I merged a patch there, then an incient come, and now i need to either sync or revert :)
[13:03:19] <liw>	 Urbanecm, which do you prefer?
[13:03:34] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682940 (owner: 10Lars Wirzenius)
[13:03:57] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1001/29215/" [puppet] - 10https://gerrit.wikimedia.org/r/682937 (owner: 10Arturo Borrero Gonzalez)
[13:04:17] <liw>	 hrmph, can't abandon https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/682940 - apparently change is merged already
[13:04:32] <Urbanecm>	 liw: sync, if possible. It fixes a bug you filled earlier today :)
[13:04:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: remove ms-be2016.yml, host long gone [puppet] - 10https://gerrit.wikimedia.org/r/682936 (owner: 10Filippo Giunchedi)
[13:04:59] <liw>	 Urbanecm, sync it is, what needs to be done?
[13:05:18] <Urbanecm>	 I'll sync it and ping you, if that's ok :)
[13:05:38] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-be1019.eqiad.wmnet
[13:05:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:46] <liw>	 Urbanecm, absolutely - I'll go make a pot of tea meanwhile
[13:06:23] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.decommission for hosts ms-be[1020-1026].eqiad.wmnet
[13:06:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:00] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on conf[2004-2006].codfw.wmnet with reason: for zookeeper migration
[13:07:02] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on conf[2004-2006].codfw.wmnet with reason: for zookeeper migration
[13:07:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:08] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10ops-monitoring-bot) Icinga downtime set by jayme@cumin1001 for 2:00:00 3 host(s) and their services with reason: for zookeeper migration ` conf[2004-2006].codfw.wmnet `
[13:07:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:04] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.3/extensions/GrowthExperiments/includes/Config/WikiPageConfigValidation.php: fe2a0420fd884df7046c0c283bcb2e961e74e8e9: WikiPageConfigValidation: Mentor lists and help desk can be null (T281229) (duration: 01m 06s)
[13:08:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:13] <stashbot>	 T281229: InvalidArgumentException: GrowthExperiments\Config\WikiPageConfigWriter::getCurrentWikiConfig failed to load config - https://phabricator.wikimedia.org/T281229
[13:09:18] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Swap zookeeper from conf2003 to conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/682667 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm)
[13:10:02] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on ms-be1022 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1022&var-datasource=eqiad+prometheus/ops
[13:10:33] <Urbanecm>	 liw: i'm done, thanks. floor is yours
[13:10:56] <icinga-wm>	 RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[13:11:02] <wikibugs>	 (03PS3) 10Jbond: systemd::timer::job: quote command as it may contain arguments [puppet] - 10https://gerrit.wikimedia.org/r/682922
[13:11:04] <wikibugs>	 (03CR) 10Ppchelko: "gosh... cmon envoy. This is so horrible it's almost doing a full circle to beautiful..." [deployment-charts] - 10https://gerrit.wikimedia.org/r/682921 (https://phabricator.wikimedia.org/T277585) (owner: 10Hnowlan)
[13:11:14] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_zookeeper site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:11:39] <wikibugs>	 (03PS1) 10Lars Wirzenius: group0 wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682941
[13:11:41] <wikibugs>	 (03CR) 10Lars Wirzenius: [C: 03+2] group0 wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682941 (owner: 10Lars Wirzenius)
[13:11:57] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "I tested this on snapshot1008 pagetitles-ns0.service and the following command ended up getting issued" [puppet] - 10https://gerrit.wikimedia.org/r/682922 (owner: 10Jbond)
[13:12:04] <icinga-wm>	 RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[13:12:29] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682941 (owner: 10Lars Wirzenius)
[13:13:43] <logmsgbot>	 !log liw@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.37.0-wmf.3
[13:13:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:54] <liw>	 train at group0
[13:19:46] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers
[13:19:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:24] <wikibugs>	 10SRE, 10ChangeProp, 10serviceops, 10Sustainability (Incident Followup): Ensure Changeprop is disabled when the databases are in read only mode - https://phabricator.wikimedia.org/T281240 (10Pchelolo) yeah, that's correct. We can increase the additional delay if needed. Also, this particular additional del...
[13:21:30] <wikibugs>	 10SRE, 10serviceops: Ubtade grafana link for mediawiki-error-rate-$cluster check - https://phabricator.wikimedia.org/T281261 (10jbond)
[13:21:55] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-be[1020-1026].eqiad.wmnet
[13:22:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:28] <wikibugs>	 10SRE, 10serviceops: Ubtade grafana link for mediawiki-error-rate-$cluster check - https://phabricator.wikimedia.org/T281261 (10jbond) @jijiki perhaps?
[13:23:04] <wikibugs>	 10SRE, 10serviceops: Update grafana link for mediawiki-error-rate-$cluster in icinga check - https://phabricator.wikimedia.org/T281261 (10jbond)
[13:23:18] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers
[13:23:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:28] <wikibugs>	 10ops-eqiad, 10SRE-swift-storage, 10User-fgiunchedi: Decom ms-be[1019-1026] - https://phabricator.wikimedia.org/T272836 (10fgiunchedi) @Cmjohnson or @Jclark-ctr all yours, hosts ready for decom
[13:23:44] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T280961 (10fgiunchedi) 05Open→03Declined Hosts is decom
[13:23:46] <wikibugs>	 10ops-eqiad, 10SRE-swift-storage, 10User-fgiunchedi: Decom ms-be[1019-1026] - https://phabricator.wikimedia.org/T272836 (10fgiunchedi)
[13:24:50] <wikibugs>	 10SRE, 10SRE-swift-storage: Some object-replicator log lines not making it to centrallog - https://phabricator.wikimedia.org/T264998 (10fgiunchedi)
[13:24:52] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Patch-For-Review, 10User-fgiunchedi: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 (10fgiunchedi)
[13:26:32] <wikibugs>	 (03PS2) 10JMeybohm: configcluster: No longer include zookeeper in old configcluster role [puppet] - 10https://gerrit.wikimedia.org/r/682669 (https://phabricator.wikimedia.org/T271573)
[13:30:46] <hashar>	 !log Upgrading CI Jenkins from 2.263.3 to 2.277.2
[13:30:50] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: nftables: basefirewall: drop double-quote scaping [puppet] - 10https://gerrit.wikimedia.org/r/682942
[13:30:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:37] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] nftables: basefirewall: drop double-quote scaping [puppet] - 10https://gerrit.wikimedia.org/r/682942 (owner: 10Arturo Borrero Gonzalez)
[13:33:01] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
[13:33:01] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
[13:33:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:10] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
[13:34:10] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
[13:34:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:26] <liw>	 kostajh, I see a fix was merged for T281226; would you be able to do a backport of it for train?
[13:40:27] <stashbot>	 T281226: PHP Notice: Only variables should be assigned by reference - https://phabricator.wikimedia.org/T281226
[13:42:23] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: api-gateway: Clear-up the nutcracker configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/682945
[13:43:06] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: api-gateway: Clear-up the nutcracker configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/682945
[13:44:15] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
[13:44:15] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'production' .
[13:44:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:13] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'production' .
[13:45:13] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
[13:45:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:30] <akosiaris>	 !log switchover api-gateway, changeprop, cpjobqueue to use the new redis cluster servers (rdb2007-rdb2010)
[13:45:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:59] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
[13:45:59] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'production' .
[13:46:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:50] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] api-gateway: Clear-up the nutcracker configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/682945 (owner: 10Alexandros Kosiaris)
[13:48:39] <moritzm>	 !log uploaded openjdk-8 8u292-b10-0~deb10u1 (buster forward port of latest Java 8 security release)
[13:48:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:08] <liw>	 or anyone else doing backports: would you be able to do a backport of it for train?
[13:50:41] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: api-gateway: Clear-up the nutcracker configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/682945
[13:50:43] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Remove rdb200{3,5} from netpols [deployment-charts] - 10https://gerrit.wikimedia.org/r/682912 (https://phabricator.wikimedia.org/T255250)
[13:50:45] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1001.eqiad.wmnet
[13:50:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:24] <wikibugs>	 (03PS1) 10Andrew Bogott: Horizon: put into maintenance mode for Victoria upgrade [puppet] - 10https://gerrit.wikimedia.org/r/682948 (https://phabricator.wikimedia.org/T261137)
[13:54:26] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps eqiad1 -> version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/682949 (https://phabricator.wikimedia.org/T261137)
[13:54:28] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "Horizon: put into maintenance mode for Victoria upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/682950 (https://phabricator.wikimedia.org/T261137)
[13:55:19] <moritzm>	 !log imported jenkins 2.277.3 to thirdparty/ci
[13:55:24] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-codfw #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-codfw&var-topic=All&var-consumer_group=All
[13:55:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:42] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10observability: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10Papaul) @fgiunchedi please see the document you requested  {F34429804}
[13:56:12] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1001.eqiad.wmnet
[13:56:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Horizon: put into maintenance mode for Victoria upgrade [puppet] - 10https://gerrit.wikimedia.org/r/682948 (https://phabricator.wikimedia.org/T261137) (owner: 10Andrew Bogott)
[13:58:54] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1002.eqiad.wmnet
[13:59:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:29] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 9 hosts with reason: upgrading                  openstack
[14:00:32] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 9 hosts with reason: upgrading                  openstack
[14:00:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "merging after talking to Antoine" [puppet] - 10https://gerrit.wikimedia.org/r/670990 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[14:01:18] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 105 hosts with reason: upgrading openstack
[14:01:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:56] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 105 hosts with reason: upgrading openstack
[14:02:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:21] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1002.eqiad.wmnet
[14:04:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:35] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
[14:08:35] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
[14:08:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:13] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1003.eqiad.wmnet
[14:09:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:10] <wikibugs>	 10SRE, 10observability: Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10MoritzMuehlenhoff)
[14:10:21] <wikibugs>	 10SRE, 10observability: Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10MoritzMuehlenhoff) p:05Triage→03High
[14:11:05] <wikibugs>	 (03PS1) 10Hashar: zuul-gearman.py: response must be decoded [puppet] - 10https://gerrit.wikimedia.org/r/682953
[14:11:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps eqiad1 -> version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/682949 (https://phabricator.wikimedia.org/T261137) (owner: 10Andrew Bogott)
[14:11:22] <wikibugs>	 (03CR) 10Hashar: "Follow up: https://gerrit.wikimedia.org/r/c/operations/puppet/+/682953 zuul-gearman.py: response must be decoded" [puppet] - 10https://gerrit.wikimedia.org/r/670990 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[14:11:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] systemd::timer::job: quote command as it may contain arguments [puppet] - 10https://gerrit.wikimedia.org/r/682922 (owner: 10Jbond)
[14:13:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] zuul-gearman.py: response must be decoded [puppet] - 10https://gerrit.wikimedia.org/r/682953 (owner: 10Hashar)
[14:14:34] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1003.eqiad.wmnet
[14:14:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:46] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
[14:15:46] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
[14:15:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:17] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
[14:16:17] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
[14:16:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:06] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
[14:17:06] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
[14:17:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:44] <moritzm>	 !log installing xen security updates
[14:19:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:24] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0)
[14:20:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:46] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore2001.codfw.wmnet
[14:20:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:57] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0)
[14:24:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:07] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore2001.codfw.wmnet
[14:25:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:22] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore2002.codfw.wmnet
[14:28:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10RobH) These are failing to partition correctly during the initial imaging.  I ran out of bandwidth troubleshooting this yesterday evening, and will retu...
[14:31:10] <moritzm>	 !log installing imagemagick security updates
[14:31:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:34] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-me
[14:32:42] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[14:32:50] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore2002.codfw.wmnet
[14:32:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:28] <bblack>	 !log dns2001 - depooling for T279457 (disable puppet + stop bird)
[14:33:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:38] <stashbot>	 T279457: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457
[14:33:46] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore2003.codfw.wmnet
[14:33:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:16] <icinga-wm>	 PROBLEM - Sessionstore codfw on sessionstore.svc.codfw.wmnet is CRITICAL: /sessions/v1/{key} (Get value for key) is CRITICAL: Test Get value for key returned the unexpected status 500 (expecting: 200): /sessions/v1/{key} (Store value for key) is CRITICAL: Test Store value for key returned the unexpected status 500 (expecting: 201) https://www.mediawiki.org/wiki/Kask
[14:34:58] <elukey>	 hnowlan: --^
[14:36:39] <bblack>	 !log cp203[56] - depool all etcd services via confctl - T279457
[14:36:44] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp203[56].codfw.wmnet
[14:36:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:53] <wikibugs>	 (03PS13) 10ZPapierski: rdf-streaming-updater: create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles)
[14:36:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:24] <hnowlan>	 elukey: ack, thanks
[14:37:30] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: add newline at the end of file [puppet] - 10https://gerrit.wikimedia.org/r/682955
[14:38:42] <wikibugs>	 (03PS1) 10Ayounsi: cloudsw: manage OSPF [homer/public] - 10https://gerrit.wikimedia.org/r/682956
[14:38:44] <wikibugs>	 (03PS1) 10Ayounsi: cloudsw: policy-options [homer/public] - 10https://gerrit.wikimedia.org/r/682957
[14:38:58] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:39:10] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:39:12] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore2003.codfw.wmnet
[14:39:19] <bblack>	 BFD is from the dns2001 depool earlier, will ack
[14:39:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloudsw: manage OSPF [homer/public] - 10https://gerrit.wikimedia.org/r/682956 (owner: 10Ayounsi)
[14:39:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloudsw: policy-options [homer/public] - 10https://gerrit.wikimedia.org/r/682957 (owner: 10Ayounsi)
[14:39:32] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on dns2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[14:40:24] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:40:40] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:41:24] <icinga-wm>	 ACKNOWLEDGEMENT - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast Brandon Black T279457 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:41:24] <icinga-wm>	 ACKNOWLEDGEMENT - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast Brandon Black T279457 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:41:24] <icinga-wm>	 ACKNOWLEDGEMENT - Bird Internet Routing Daemon on dns2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird Brandon Black T279457 https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[14:41:24] <icinga-wm>	 ACKNOWLEDGEMENT - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 Brandon Black T279457 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:41:24] <icinga-wm>	 ACKNOWLEDGEMENT - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 Brandon Black T279457 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:41:30] <icinga-wm>	 RECOVERY - Sessionstore codfw on sessionstore.svc.codfw.wmnet is OK: All endpoints are healthy https://www.mediawiki.org/wiki/Kask
[14:42:37] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudgw: add newline at the end of file [puppet] - 10https://gerrit.wikimedia.org/r/682955
[14:43:59] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: k8s::deployment_server: add ip addresses to discovery data [puppet] - 10https://gerrit.wikimedia.org/r/682959
[14:45:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] k8s::deployment_server: add ip addresses to discovery data [puppet] - 10https://gerrit.wikimedia.org/r/682959 (owner: 10Giuseppe Lavagetto)
[14:46:16] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Have rdb2010 replicate from rdb2009 [puppet] - 10https://gerrit.wikimedia.org/r/682891 (https://phabricator.wikimedia.org/T281216) (owner: 10Legoktm)
[14:47:20] <bblack>	 !log lvs2009 - disable puppet + stop pybal (internal services will move to lvs2010, please avoid LVS service definition changes for now!) - T279457
[14:47:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:29] <stashbot>	 T279457: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457
[14:47:34] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: site.pp: make rdb2007, rdb2008 a redis cluster [puppet] - 10https://gerrit.wikimedia.org/r/614897 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli)
[14:47:58] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Services have been migrated successfully, merging" [puppet] - 10https://gerrit.wikimedia.org/r/614897 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli)
[14:48:08] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: k8s::deployment_server: add ip addresses to discovery data [puppet] - 10https://gerrit.wikimedia.org/r/682959
[14:48:15] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Have rdb2010 replicate from rdb2009 [puppet] - 10https://gerrit.wikimedia.org/r/682891 (https://phabricator.wikimedia.org/T281216) (owner: 10Legoktm)
[14:48:17] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "I have cherry picked this change on the integration puppet master, ran puppet on integration-agent-pkgbuilder-1002 and then ran the servic" [puppet] - 10https://gerrit.wikimedia.org/r/676133 (owner: 10Jbond)
[14:48:35] <moritzm>	 !log installing file/libmagic updates from buster point release
[14:48:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:54] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Have rdb2010 replicate from rdb2009 [puppet] - 10https://gerrit.wikimedia.org/r/682891 (https://phabricator.wikimedia.org/T281216) (owner: 10Legoktm)
[14:49:14] <wikibugs>	 10SRE, 10Dumps-Generation, 10observability: various weekly and daily dumps run from systemd timers are broken - https://phabricator.wikimedia.org/T281267 (10jbond)
[14:49:32] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[14:49:42] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[14:49:55] <wikibugs>	 (03PS3) 10Hashar: R:pbuilder_base: add extra packages to updates as well [puppet] - 10https://gerrit.wikimedia.org/r/676133 (https://phabricator.wikimedia.org/T240430) (owner: 10Jbond)
[14:51:01] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloudgw: add newline at the end of file [puppet] - 10https://gerrit.wikimedia.org/r/682955
[14:51:10] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: k8s::deployment_server: add ip addresses to discovery data [puppet] - 10https://gerrit.wikimedia.org/r/682959
[14:51:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10ArielGlenn) Ah ok! I didn't mean to be hasty, just saw the reimaging script runs and got excited :-)
[14:52:31] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29219/console" [puppet] - 10https://gerrit.wikimedia.org/r/682959 (owner: 10Giuseppe Lavagetto)
[14:52:39] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: add newline at the end of file [puppet] - 10https://gerrit.wikimedia.org/r/682955 (owner: 10Arturo Borrero Gonzalez)
[14:52:48] <wikibugs>	 (03PS1) 10Hashar: cloud - hieradata: add eatmydata to sid/bullseye [puppet] - 10https://gerrit.wikimedia.org/r/682962 (https://phabricator.wikimedia.org/T240430)
[14:53:55] <wikibugs>	 (03PS1) 10Ahmon Dancy: rcfeed: Remove reference assignment [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682818 (https://phabricator.wikimedia.org/T281226)
[14:54:10] <wikibugs>	 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10BBlack) Traffic stuff (lvs/cp/dns) is depooled, downtimed, and ready for the network fixups.
[14:54:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] k8s::deployment_server: add ip addresses to discovery data [puppet] - 10https://gerrit.wikimedia.org/r/682959 (owner: 10Giuseppe Lavagetto)
[14:56:11] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "I have cherry picked it on the integration puppet master and ran puppet on integration-agent-pkgbuilder-1002 . That has properly updated c" [puppet] - 10https://gerrit.wikimedia.org/r/682962 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar)
[14:56:52] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Doing), 10Technical-Debt: debian-glue jobs ignored error messages about libeatmydata.so in LD_PRELOAD - https://phabricator.wikimedia.org/T240430 (10hashar)
[14:57:50] <wikibugs>	 (03PS1) 10Andrew Bogott: validatelabsfqdn.py: update to python3 and run through black [puppet] - 10https://gerrit.wikimedia.org/r/682965
[14:58:21] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: nftables: basefirewall: reverse quotation [puppet] - 10https://gerrit.wikimedia.org/r/682966
[15:01:45] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] R:pbuilder_base: add extra packages to updates as well [puppet] - 10https://gerrit.wikimedia.org/r/676133 (https://phabricator.wikimedia.org/T240430) (owner: 10Jbond)
[15:01:52] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] nftables: basefirewall: reverse quotation [puppet] - 10https://gerrit.wikimedia.org/r/682966 (owner: 10Arturo Borrero Gonzalez)
[15:01:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cloud - hieradata: add eatmydata to sid/bullseye [puppet] - 10https://gerrit.wikimedia.org/r/682962 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar)
[15:02:03] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] rcfeed: Remove reference assignment [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682818 (https://phabricator.wikimedia.org/T281226) (owner: 10Ahmon Dancy)
[15:02:51] <jbond42>	 arturo: i have merged you change as well seemd pretty harmless
[15:03:01] <arturo>	 thanks
[15:03:05] <arturo>	 jbond42: 👍
[15:05:41] <wikibugs>	 (03PS1) 10Hashar: Do not merge: dummy change to test CI [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/682967
[15:06:38] <icinga-wm>	 PROBLEM - configured eth on sretest1002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.139: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[15:06:40] <icinga-wm>	 PROBLEM - Check systemd state on sretest1002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.139: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:09:09] <wikibugs>	 10SRE, 10ops-codfw: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10herron)
[15:10:03] <wikibugs>	 10SRE: Integrate Buster 10.8 point update - https://phabricator.wikimedia.org/T274099 (10MoritzMuehlenhoff)
[15:10:25] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: nftables: basefirewall: nftables doesn't like strings with single quotes [puppet] - 10https://gerrit.wikimedia.org/r/682969
[15:10:59] <wikibugs>	 10SRE, 10ops-codfw: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10jcrespo) `ms-backup2002` and `ms-backup2001`  are not yet fully into production -they will be soon (T276442), so they can be shutdown at any time.  I got confused with backup* hosts, which can be shutdown, b...
[15:11:04] <wikibugs>	 10SRE, 10ops-codfw: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10jcrespo)
[15:11:08] <moritzm>	 ^ sretest1002 is expected, fixing
[15:11:26] <icinga-wm>	 RECOVERY - Check systemd state on sretest1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:12:12] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: kubernetes::deployment_server: only pass ipv4 addresses to egress rules [puppet] - 10https://gerrit.wikimedia.org/r/682971
[15:12:38] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1001/29221/" [puppet] - 10https://gerrit.wikimedia.org/r/682969 (owner: 10Arturo Borrero Gonzalez)
[15:12:56] <wikibugs>	 (03PS11) 10Volans: clustershell: allow to choose different reporters [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[15:12:58] <wikibugs>	 (03PS4) 10Volans: CLI/clustershell: allow to disable progress bars [software/cumin] - 10https://gerrit.wikimedia.org/r/681692 (https://phabricator.wikimedia.org/T212783)
[15:13:00] <wikibugs>	 (03PS4) 10Volans: setup.py: support more recent PyParsing versions [software/cumin] - 10https://gerrit.wikimedia.org/r/681758
[15:13:02] <wikibugs>	 (03PS3) 10Volans: clustershell: instantiate progress bar earlier [software/cumin] - 10https://gerrit.wikimedia.org/r/682588
[15:18:56] <hashar>	 !log Upgraded all Jenkins to 2.277.3 (latest LTS) # T279033
[15:23:03] <XioNoX>	 !log cr1-codfw# set interfaces ae3 disable (to asw-c2-codfw) - T279457
[15:24:03] <XioNoX>	 papaul: ^
[15:25:19] <papaul>	 XioNoX: ok
[15:25:27] <papaul>	 just waiting on bblack 
[15:28:01] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker
[15:28:14] <jayme>	 elukey: ^^
[15:28:17] <XioNoX>	 !log asw-c-codfw> request system power-off member 2 - T279457
[15:30:04] <elukey>	 jayme: ack!
[15:30:26] <elukey>	 jayme: now you are an owner of Kafka and Mirror Maker, this task gets better and better for you :D
[15:30:51] <jayme>	 ouch 
[15:31:04] <icinga-wm>	 PROBLEM - Host elastic2045 is DOWN: PING CRITICAL - Packet loss = 100%
[15:31:38] <icinga-wm>	 PROBLEM - Host ms-be2035 is DOWN: PING CRITICAL - Packet loss = 100%
[15:31:52] <icinga-wm>	 PROBLEM - Host elastic2046 is DOWN: PING CRITICAL - Packet loss = 100%
[15:32:12] <icinga-wm>	 PROBLEM - Host elastic2047 is DOWN: PING CRITICAL - Packet loss = 100%
[15:32:40] <icinga-wm>	 PROBLEM - Host ms-be2034 is DOWN: PING CRITICAL - Packet loss = 100%
[15:32:40] <icinga-wm>	 PROBLEM - Host ms-be2042 is DOWN: PING CRITICAL - Packet loss = 100%
[15:32:46] <jinxer-wm>	 (Emergency syslog message) firing: Emergency syslog message - https://alerts.wikimedia.org
[15:32:52] <icinga-wm>	 PROBLEM - Host ms-be2048 is DOWN: PING CRITICAL - Packet loss = 100%
[15:32:52] <icinga-wm>	 PROBLEM - Host ms-be2055 is DOWN: PING CRITICAL - Packet loss = 100%
[15:32:56] <icinga-wm>	 PROBLEM - Host ms-fe2007 is DOWN: PING CRITICAL - Packet loss = 100%
[15:33:02] <godog>	 these are all expected I think
[15:33:20] <godog>	 not the emergency syslog message perhaps
[15:33:39] <jynus>	 is eqiad back into pool, BTW?
[15:33:47] <jynus>	 eqiad ms?
[15:34:02] <XioNoX>	 emergency syslog is from librenms, most likely the switch saying that one node went down
[15:34:06] <godog>	 yeah we repooled yesterday
[15:34:07] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0)
[15:34:27] <godog>	 XioNoX: ack, thanks
[15:35:22] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw-c-codfw is CRITICAL: CRIT: Down: 5 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[15:37:12] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:13:05] <wikibugs>	 (03PS2) 10Jbond: P:tlsproxy::envoy: refactor ssl configuertion [puppet] - 10https://gerrit.wikimedia.org/r/682982
[16:14:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloudsw: loopback firewall filter [homer/public] - 10https://gerrit.wikimedia.org/r/682972 (owner: 10Ayounsi)
[16:18:25] <effie>	 !log uploading cap_3.17.1-1
[16:18:30] <effie>	 !log uploading scap_3.17.1-1
[16:18:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:39] <wikibugs>	 (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/29227/" [puppet] - 10https://gerrit.wikimedia.org/r/682982 (owner: 10Jbond)
[16:19:56] <icinga-wm>	 RECOVERY - configured eth on lvs2008 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[16:20:43] <effie>	 jouncebot is missing :/
[16:21:06] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 35): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29227/console" [puppet] - 10https://gerrit.wikimedia.org/r/682982 (owner: 10Jbond)
[16:21:27] <effie>	 ah no!
[16:21:30] <effie>	  jouncebot now
[16:21:48] <effie>	 jouncebot now
[16:21:48] <jouncebot>	 For the next 1 hour(s) and 38 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210427T1600)
[16:21:52] <effie>	 jouncebot next
[16:21:52] <jouncebot>	 In 0 hour(s) and 38 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210427T1700)
[16:22:24] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[16:22:50] <effie>	 !log upgrading scap 3.17.1-1 on mediawiki canaries - T279695
[16:23:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:02] <stashbot>	 T279695: Deploy Scap version 3.17.1-1 - https://phabricator.wikimedia.org/T279695
[16:23:12] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 39 hosts with reason: upgrading openstack
[16:23:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:26] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 39 hosts with reason: upgrading openstack
[16:23:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:32] <wikibugs>	 (03CR) 10Ayounsi: "recheck" [homer/public] - 10https://gerrit.wikimedia.org/r/682956 (owner: 10Ayounsi)
[16:25:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloudsw: loopback firewall filter [homer/public] - 10https://gerrit.wikimedia.org/r/682972 (owner: 10Ayounsi)
[16:25:20] <wikibugs>	 (03CR) 10Ayounsi: "recheck" [homer/public] - 10https://gerrit.wikimedia.org/r/682957 (owner: 10Ayounsi)
[16:25:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloudsw: policy-options [homer/public] - 10https://gerrit.wikimedia.org/r/682957 (owner: 10Ayounsi)
[16:26:20] <wikibugs>	 (03PS1) 10WMDE-Fisch: Separate reference preview settings in beta & non-beta [extensions/Popups] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682819 (https://phabricator.wikimedia.org/T281235)
[16:27:41] <wikibugs>	 (03PS3) 10Ayounsi: cloudsw: policy-options [homer/public] - 10https://gerrit.wikimedia.org/r/682957
[16:27:44] <wikibugs>	 (03PS3) 10Ayounsi: cloudsw: loopback firewall filter [homer/public] - 10https://gerrit.wikimedia.org/r/682972
[16:28:16] <wikibugs>	 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10Dzahn) >>! In T280989#7036967, @jcrespo wrote:  > If this is temporary, no problem, if it is long term, it should be added to the list of ignoring monitoring for backups  It's definitely temporary and a fres...
[16:28:59] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Separate reference preview settings in beta & non-beta [extensions/Popups] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682819 (https://phabricator.wikimedia.org/T281235) (owner: 10WMDE-Fisch)
[16:29:09] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: nftables: basefirewall: introduce prometheus facility [puppet] - 10https://gerrit.wikimedia.org/r/682938 (https://phabricator.wikimedia.org/T281124)
[16:29:11] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudgw: enable prometheus scraping [puppet] - 10https://gerrit.wikimedia.org/r/682939 (https://phabricator.wikimedia.org/T281124)
[16:29:42] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.dns.netbox
[16:29:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:52] <wikibugs>	 (03CR) 10Ayounsi: "recheck" [homer/public] - 10https://gerrit.wikimedia.org/r/682957 (owner: 10Ayounsi)
[16:30:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloudsw: policy-options [homer/public] - 10https://gerrit.wikimedia.org/r/682957 (owner: 10Ayounsi)
[16:30:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloudsw: loopback firewall filter [homer/public] - 10https://gerrit.wikimedia.org/r/682972 (owner: 10Ayounsi)
[16:32:34] <wikibugs>	 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10BBlack) Note to our future selves: we forgot to consider the cross-row LVS connections in this downtime: lvs2008 and lvs2010 do not live in row C at all, but had cross-row connections via C2 to...
[16:34:34] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[16:34:37] <wikibugs>	 (03PS1) 10David Caro: ceph: pull all the packages except dbg [puppet] - 10https://gerrit.wikimedia.org/r/682988
[16:34:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:51] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ceph: pull all the packages except dbg [puppet] - 10https://gerrit.wikimedia.org/r/682988 (owner: 10David Caro)
[16:36:37] <wikibugs>	 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10jcrespo) No problem. Sadly, it is my job to bother people from time to time, making sure backups are working 0:-).
[16:38:06] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] ceph: pull all the packages except dbg [puppet] - 10https://gerrit.wikimedia.org/r/682988 (owner: 10David Caro)
[16:38:17] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] ceph: pull all the packages except dbg [puppet] - 10https://gerrit.wikimedia.org/r/682988 (owner: 10David Caro)
[16:39:36] <dcaro>	 !log reprepro updating packages on thirdparty/ceph-nautilus-buster
[16:39:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:04] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[16:41:24] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: nftables: basefirewall: introduce prometheus facility [puppet] - 10https://gerrit.wikimedia.org/r/682938 (https://phabricator.wikimedia.org/T281124)
[16:41:24] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[16:44:21] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1001/29229/" [puppet] - 10https://gerrit.wikimedia.org/r/682938 (https://phabricator.wikimedia.org/T281124) (owner: 10Arturo Borrero Gonzalez)
[16:49:02] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloudgw: enable prometheus scraping [puppet] - 10https://gerrit.wikimedia.org/r/682939 (https://phabricator.wikimedia.org/T281124)
[16:49:52] <papaul>	 !log powerdown ms-be2042 for maintenance 
[16:50:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:51:20] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[16:52:23] <papaul>	 !log powerdown elastic2045  for maintenance 
[16:52:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:15] <wikibugs>	 (03PS1) 10MSantos: wikifeeds: bump to 2021-04-24-180651-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/682990
[16:55:16] <wikibugs>	 (03PS1) 10Herron: remove all references to icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/682992 (https://phabricator.wikimedia.org/T279601)
[16:55:35] <wikibugs>	 (03PS1) 10MSantos: proton: bump to 2021-04-19-114221-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/682993
[16:57:19] <wikibugs>	 (03CR) 10MSantos: [C: 03+2] wikifeeds: bump to 2021-04-24-180651-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/682990 (owner: 10MSantos)
[16:59:16] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: bump to 2021-04-24-180651-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/682990 (owner: 10MSantos)
[16:59:28] <icinga-wm>	 PROBLEM - Host ms-be2042.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:00:04] <jouncebot>	 chrisalbon and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210427T1700).
[17:00:07] <wikibugs>	 (03PS1) 10Urbanecm: Add vrt-wiki.wikimedia.org and vrt-wiki.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/682996 (https://phabricator.wikimedia.org/T280400)
[17:01:20] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[17:03:04] <icinga-wm>	 RECOVERY - Host ms-be2042.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.45 ms
[17:03:06] <wikibugs>	 (03PS1) 10Herron: remove all references to icinga2001 [puppet] - 10https://gerrit.wikimedia.org/r/682999 (https://phabricator.wikimedia.org/T279602)
[17:03:46] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[17:03:56] <icinga-wm>	 RECOVERY - Host ms-be2042 is UP: PING OK - Packet loss = 0%, RTA = 33.10 ms
[17:04:14] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
[17:04:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:05:13] <wikibugs>	 (03CR) 10MSantos: [C: 03+2] proton: bump to 2021-04-19-114221-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/682993 (owner: 10MSantos)
[17:05:27] <wikibugs>	 (03PS1) 10Urbanecm: Add vrt-wiki.wikimedia.org to mediawiki.yaml [puppet] - 10https://gerrit.wikimedia.org/r/683000 (https://phabricator.wikimedia.org/T280400)
[17:06:52] <wikibugs>	 (03Merged) 10jenkins-bot: proton: bump to 2021-04-19-114221-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/682993 (owner: 10MSantos)
[17:07:39] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
[17:07:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:33] <wikibugs>	 (03CR) 10Volans: [C: 03+2] clustershell: allow to choose different reporters [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[17:09:12] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
[17:09:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:09:27] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "No diff since last +1, just rebase with conflict resolution. self-merging." [software/cumin] - 10https://gerrit.wikimedia.org/r/681692 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans)
[17:09:36] <wikibugs>	 (03CR) 10Volans: [C: 03+2] setup.py: support more recent PyParsing versions [software/cumin] - 10https://gerrit.wikimedia.org/r/681758 (owner: 10Volans)
[17:09:36] <icinga-wm>	 PROBLEM - Host elastic2045.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:10:28] <wikibugs>	 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmne...
[17:10:35] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "No changes since last +1, just rebase conflict resolution, self-merging." [software/cumin] - 10https://gerrit.wikimedia.org/r/682588 (owner: 10Volans)
[17:10:58] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' .
[17:11:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:16] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps: set cloudvirt nodes to OpenStack U [puppet] - 10https://gerrit.wikimedia.org/r/683002 (https://phabricator.wikimedia.org/T261137)
[17:11:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloud-vps: set cloudvirt nodes to OpenStack U [puppet] - 10https://gerrit.wikimedia.org/r/683002 (https://phabricator.wikimedia.org/T261137) (owner: 10Andrew Bogott)
[17:12:28] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cloudgw: enable prometheus scraping [puppet] - 10https://gerrit.wikimedia.org/r/682939 (https://phabricator.wikimedia.org/T281124)
[17:13:53] <wikibugs>	 (03PS2) 10Andrew Bogott: cloud-vps: set cloudvirt nodes to OpenStack U [puppet] - 10https://gerrit.wikimedia.org/r/683002 (https://phabricator.wikimedia.org/T261137)
[17:14:07] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' .
[17:14:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:30] <icinga-wm>	 RECOVERY - Host elastic2045 is UP: PING OK - Packet loss = 0%, RTA = 33.11 ms
[17:14:39] <papaul>	 !log powerdown kafka-logging2003  for maintenance 
[17:14:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:30] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: cloudgw: enable prometheus scraping [puppet] - 10https://gerrit.wikimedia.org/r/682939 (https://phabricator.wikimedia.org/T281124)
[17:16:04] <icinga-wm>	 RECOVERY - Host elastic2045.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.13 ms
[17:16:22] <wikibugs>	 (03PS1) 10MSantos: mobileapps: bump to 2021-04-27-171008-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/683003
[17:16:33] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' .
[17:16:37] <wikibugs>	 (03Merged) 10jenkins-bot: clustershell: allow to choose different reporters [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[17:16:39] <wikibugs>	 (03Merged) 10jenkins-bot: CLI/clustershell: allow to disable progress bars [software/cumin] - 10https://gerrit.wikimedia.org/r/681692 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans)
[17:16:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:17:12] <wikibugs>	 (03Merged) 10jenkins-bot: setup.py: support more recent PyParsing versions [software/cumin] - 10https://gerrit.wikimedia.org/r/681758 (owner: 10Volans)
[17:17:14] <wikibugs>	 (03Merged) 10jenkins-bot: clustershell: instantiate progress bar earlier [software/cumin] - 10https://gerrit.wikimedia.org/r/682588 (owner: 10Volans)
[17:17:33] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1001/29234/" [puppet] - 10https://gerrit.wikimedia.org/r/682939 (https://phabricator.wikimedia.org/T281124) (owner: 10Arturo Borrero Gonzalez)
[17:17:44] <wikibugs>	 (03PS3) 10Andrew Bogott: cloud-vps: set cloudvirt nodes to OpenStack U [puppet] - 10https://gerrit.wikimedia.org/r/683002 (https://phabricator.wikimedia.org/T261137)
[17:18:38] <wikibugs>	 (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2021-04-27-171008-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/683003 (owner: 10MSantos)
[17:19:34] <ryankemper>	 !log T281215 Banned `elastic2043` from codfw cirrussearch cluster
[17:19:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:43] <stashbot>	 T281215: elastic2043 doesn't power up - https://phabricator.wikimedia.org/T281215
[17:20:20] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: bump to 2021-04-27-171008-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/683003 (owner: 10MSantos)
[17:20:25] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery: elastic2043 doesn't power up - https://phabricator.wikimedia.org/T281215 (10RKemper) `ryankemper@elastic2044:~$ curl -s localhost:9600/_cluster/health {"cluster_name":"production-search-psi-codfw","status":"green","timed_out":false,"number_of_nodes":17,"number_of_data_nodes":...
[17:20:44] <icinga-wm>	 PROBLEM - Host kafka-logging2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:21:14] <papaul>	 robh: https://netbox.wikimedia.org/ipam/prefixes/132/ip-addresses/
[17:21:24] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
[17:21:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:29] <wikibugs>	 (03Abandoned) 10Hashar: Do not merge: dummy change to test CI [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/682967 (owner: 10Hashar)
[17:23:01] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
[17:23:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:41] <wikibugs>	 (03PS4) 10Andrew Bogott: cloud-vps: set cloudvirt nodes to OpenStack U [puppet] - 10https://gerrit.wikimedia.org/r/683002 (https://phabricator.wikimedia.org/T261137)
[17:24:21] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Doing), 10Technical-Debt: debian-glue jobs ignored error messages about libeatmydata.so in LD_PRELOAD - https://phabricator.wikimedia.org/T240430 (10hashar) 05Open→03Resolved I have confirmed that eatmydat...
[17:24:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "xD" [puppet] - 10https://gerrit.wikimedia.org/r/683002 (https://phabricator.wikimedia.org/T261137) (owner: 10Andrew Bogott)
[17:25:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: set cloudvirt nodes to OpenStack U [puppet] - 10https://gerrit.wikimedia.org/r/683002 (https://phabricator.wikimedia.org/T261137) (owner: 10Andrew Bogott)
[17:25:22] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
[17:25:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:25] <wikibugs>	 (03CR) 10Jcrespo: "Solution worked nicely, prepare now is much faster (probably due to parallelism)." [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682916 (https://phabricator.wikimedia.org/T281094) (owner: 10Jcrespo)
[17:29:29] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Release new v0.5 version [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682916 (https://phabricator.wikimedia.org/T281094) (owner: 10Jcrespo)
[17:29:51] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.dns.netbox
[17:29:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:17] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: nftables: indicate that service has restart [puppet] - 10https://gerrit.wikimedia.org/r/683011
[17:30:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] nftables: indicate that service has restart [puppet] - 10https://gerrit.wikimedia.org/r/683011 (owner: 10Arturo Borrero Gonzalez)
[17:31:46] <wikibugs>	 (03PS1) 10Herron: kafka-logging: migrate logstash2001 broker to kafka-logging2001 [puppet] - 10https://gerrit.wikimedia.org/r/683012 (https://phabricator.wikimedia.org/T279342)
[17:31:48] <wikibugs>	 (03PS1) 10Herron: kafka-logging: migrate logstash2002 broker to kafka-logging2002 [puppet] - 10https://gerrit.wikimedia.org/r/683013 (https://phabricator.wikimedia.org/T279342)
[17:31:50] <wikibugs>	 (03PS1) 10Herron: kafka-logging: migrate logstash2003 broker to kafka-logging2003 [puppet] - 10https://gerrit.wikimedia.org/r/683014 (https://phabricator.wikimedia.org/T279342)
[17:32:07] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: nftables: indicate that service has restart [puppet] - 10https://gerrit.wikimedia.org/r/683011
[17:32:20] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[17:32:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:44] <icinga-wm>	 RECOVERY - Host kafka-logging2003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms
[17:32:57] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Add vrt-wiki.wikimedia.org and vrt-wiki.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/682996 (https://phabricator.wikimedia.org/T280400) (owner: 10Urbanecm)
[17:34:01] <Urbanecm>	 thanks mutante :)
[17:34:42] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/29237/" [puppet] - 10https://gerrit.wikimedia.org/r/683011 (owner: 10Arturo Borrero Gonzalez)
[17:34:55] <papaul>	 !log powerdown moss-fe2001  for maintenance 
[17:35:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:08] <icinga-wm>	 PROBLEM - swift codfw container availability low on alert1001 is CRITICAL: cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw
[17:42:32] <icinga-wm>	 PROBLEM - Host moss-fe2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:44:56] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:45:06] <icinga-wm>	 RECOVERY - Host moss-fe2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.63 ms
[17:45:32] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[17:47:02] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:52:04] <wikibugs>	 (03PS2) 10Jdlrobson: Enable language in header for office and testwiki users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682757 (https://phabricator.wikimedia.org/T280526)
[17:54:01] <wikibugs>	 (03PS3) 10Jdlrobson: Enable language in header for office and testwiki users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682757 (https://phabricator.wikimedia.org/T280526)
[17:55:18] <wikibugs>	 (03PS1) 10Aaron Schulz: Add "mcrouter-master-dc" to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683022 (https://phabricator.wikimedia.org/T278392)
[17:55:20] <wikibugs>	 (03PS1) 10Aaron Schulz: Set $wgChronologyProtectorStash to "mcrouter-master-dc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683023
[17:55:33] <wikibugs>	 (03PS3) 10Aaron Schulz: Use $region for default mcrouter routes [puppet] - 10https://gerrit.wikimedia.org/r/654330
[17:55:46] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Implement machine-local forwarding DNS caches - https://phabricator.wikimedia.org/T171498 (10mmodell)
[17:55:56] <legoktm>	 jouncebot: refresh
[17:55:57] <jouncebot>	 I refreshed my knowledge about deployments.
[17:56:11] <wikibugs>	 (03CR) 10Aaron Schulz: "Blocked on https://gerrit.wikimedia.org/r/c/operations/puppet/+/654330" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683023 (owner: 10Aaron Schulz)
[17:56:45] <AaronSchulz>	 elukey: can you CR https://gerrit.wikimedia.org/r/c/operations/puppet/+/654330 ?
[17:57:48] <Jdlrobson>	 jouncebot: refresh
[17:57:49] <jouncebot>	 I refreshed my knowledge about deployments.
[17:58:21] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1004.eqiad.wmnet with reason: REIMAGE
[17:58:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:59:10] <wikibugs>	 (03PS2) 10Jdlrobson: Rename RelatedArticlesFooterWhitelistedSkins to RelatedArticlesFooterAllowedSkins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681598 (https://phabricator.wikimedia.org/T277958) (owner: 10Phuedx)
[18:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210427T1800).
[18:00:04] <jouncebot>	 Jdlrobson: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:00:26] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1004.eqiad.wmnet with reason: REIMAGE
[18:00:29] <Jdlrobson>	 o/ present
[18:00:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:01:06] <icinga-wm>	 RECOVERY - Host ms-fe2007 is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms
[18:01:21] <papaul>	 robh: logmsgbot: 
[18:02:32] <wikibugs>	 (03PS1) 10BBlack: [noop] remove eqiad upload storage override [puppet] - 10https://gerrit.wikimedia.org/r/683025
[18:02:40] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.dns.netbox
[18:02:46] <wikibugs>	 (03PS1) 10BBlack: Puppetize cp501[3456] [puppet] - 10https://gerrit.wikimedia.org/r/683026
[18:02:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:03:12] <wikibugs>	 (03PS2) 10BBlack: Puppetize cp501[3456] [puppet] - 10https://gerrit.wikimedia.org/r/683026 (https://phabricator.wikimedia.org/T278182)
[18:03:45] <Jdlrobson>	 Is anybody able to run the backport window?
[18:03:49] <Jdlrobson>	 Urbanecm: are you around?
[18:03:58] <Urbanecm>	 Jdlrobson: yes
[18:04:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Puppetize cp501[3456] [puppet] - 10https://gerrit.wikimedia.org/r/683026 (https://phabricator.wikimedia.org/T278182) (owner: 10BBlack)
[18:04:05] <Urbanecm>	 let's get the wheel out :)
[18:04:09] <Urbanecm>	 I can deploy today
[18:04:15] <Jdlrobson>	 (also is the list of deployers accurate? I'm pretty sure  Niharika doesn't do backports any more)
[18:04:38] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable language in header for office and testwiki users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682757 (https://phabricator.wikimedia.org/T280526) (owner: 10Jdlrobson)
[18:04:59] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[18:05:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:27] <Urbanecm>	 Jdlrobson: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/681598 has a -2 by Phuedx
[18:06:03] <mutante>	 from a technical point of view, who _can_ deploy it is accurate in the public repo
[18:06:08] <mutante>	 and she is still in it
[18:06:33] <wikibugs>	 (03Merged) 10jenkins-bot: Enable language in header for office and testwiki users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682757 (https://phabricator.wikimedia.org/T280526) (owner: 10Jdlrobson)
[18:06:35] <mutante>	 if people do not actually use their deployment access though, it is a good idea to ask for it to be removed
[18:06:48] <Jdlrobson>	 Urbanecm: ill ping him
[18:06:52] <mutante>	 there is little "offboarding" when it comes to that
[18:07:44] <Urbanecm>	 Jdlrobson: thanks, I'm reluctant to override an explicit -2.
[18:07:50] <Jdlrobson>	 Urbanecm: yeh we can skip that one if necessary
[18:08:11] <Jdlrobson>	 I think it's valid because of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/RelatedArticles/+/680812
[18:08:15] <Jdlrobson>	 i'll move this backport to thursday
[18:08:28] <Urbanecm>	 okay
[18:08:35] <Jdlrobson>	 thanks for noticing that :)
[18:08:39] <Jdlrobson>	 i clearly need more coffee
[18:08:41] <Urbanecm>	 Jdlrobson: the first patch is pulled onto mwdebug1001, please test :)
[18:08:45] <Jdlrobson>	 on it
[18:09:51] <wikibugs>	 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-main1004.eqiad.wmnet'] `  a...
[18:10:24] <Jdlrobson>	 LGTM!
[18:10:29] <Urbanecm>	 syncing it
[18:10:34] <Jdlrobson>	 oh wait...
[18:10:37] <Jdlrobson>	 wait wait wait
[18:10:40] <Urbanecm>	 okay, waiting
[18:10:44] <wikibugs>	 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmne...
[18:10:54] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[18:10:55] <Jdlrobson>	 something unexpected
[18:11:04] <bblack>	 !log dns2001 - restarting bird to repool, then re-enabling puppet - T279457
[18:11:06] <Urbanecm>	 take your time
[18:11:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:11:12] <Jdlrobson>	 `'default` doesn't seem to be applying correctly
[18:11:13] <stashbot>	 T279457: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457
[18:11:24] <Jdlrobson>	 Could you check the value of `wgVectorLanguageInHeader` on English Wikipedia?
[18:11:30] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[18:11:34] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:11:53] <Urbanecm>	 give me a sec
[18:12:00] <Jdlrobson>	 When I visit https://en.wikipedia.org/wiki/Peter_D%C3%B6ring on debug1001 for some strange reason I'm seeing a language button in the top right and that's not expected
[18:12:13] <Urbanecm>	 this is what i see https://www.irccloud.com/pastebin/4G4lfCWX/
[18:12:32] <Jdlrobson>	 https://en.wikipedia.org/wiki/Peter_Döring?useskinversion=2 sorry
[18:12:40] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on dns2001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[18:12:46] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.dns.netbox
[18:12:47] <Jdlrobson>	 hmmmm very odd
[18:12:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:53] <Urbanecm>	 and that seems to match the default from your patch
[18:12:59] <Jdlrobson>	 when you visit https://en.wikipedia.org/wiki/Peter_Döring?useskinversion=2 in debug1001 do you see a button in the top right?
[18:13:12] <Jdlrobson>	 https://usercontent.irccloud-cdn.com/file/YYFMhbrf/Screen%20Shot%202021-04-27%20at%2011.13.07%20AM.png
[18:13:41] <Urbanecm>	 this is what i see https://usercontent.irccloud-cdn.com/file/sTSB3v0c/image.png
[18:13:53] <Majavah>	 I see that in the top right
[18:14:18] <Urbanecm>	 when i disable mwdebug, I don't see the "languages" thing
[18:14:24] <Urbanecm>	 not sure if that's what your patch is supposed to touch
[18:14:30] <Jdlrobson>	 ohhhhh I think i see what's happening
[18:14:51] <Jdlrobson>	 I think the config value changed. It needs to be a boolean on group 1+2 wikis.
[18:14:59] <Urbanecm>	 this is $wgVectorLanguageInHeader at mwdebug1002 https://www.irccloud.com/pastebin/6JpQHMGQ/
[18:15:01] <Jdlrobson>	 ah rats
[18:15:10] <Jdlrobson>	 Can I remove the default line?
[18:15:11] <Urbanecm>	 ...and you just noticed it as well :)
[18:15:17] <Jdlrobson>	 Or will that create more problems?
[18:15:38] <Urbanecm>	 Jdlrobson: I _think_ it should work. Let me livehack it on mwdebug1002, one second.
[18:15:38] <Jdlrobson>	 not entirely sure if the configuration is smart enough to not have a default but have officewiki and testwiki overrides
[18:16:25] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[18:16:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:14] <Urbanecm>	 Jdlrobson: I applied this change on mwdebug1002, can you test if it works as you would expect? https://www.irccloud.com/pastebin/RIcqpOpo/
[18:17:21] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.dns.netbox
[18:17:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:38] <logmsgbot>	 !log robh@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[18:17:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:46] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 754 bytes in 1.047 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[18:18:05] <Jdlrobson>	 perfect Urbanecm 
[18:18:14] <wikibugs>	 (03PS1) 10Jdlrobson: Drop default value for wgVectorLanguageInHeader [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683036 (https://phabricator.wikimedia.org/T280526)
[18:18:24] <Urbanecm>	 sorry, does that mean "it works"?
[18:18:25] <Jdlrobson>	 ^ so here's the patch to do that
[18:18:33] <Jdlrobson>	 yep it works great on debug1002 and as expected
[18:19:16] <Urbanecm>	 cool
[18:19:44] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.dns.netbox
[18:19:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:55] <Urbanecm>	 Jdlrobson: before I merge it: what will happen if train is undeployed? Will it cause more errors?
[18:20:32] <bblack>	 !log cp203[56] - repooling in etcd - T279457
[18:20:38] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp203[56].codfw.wmnet
[18:20:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:41] <stashbot>	 T279457: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457
[18:20:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:11] <Jdlrobson>	 Urbanecm: when the train rolls the default will change
[18:21:40] <Jdlrobson>	 officewiki and testwiki will be unaffected as train has already rolled for them
[18:22:02] <Jdlrobson>	 if we roll back the train, presumably office and test wiki will throw errors and we'd need to revert the change we already merged 
[18:22:54] <Urbanecm>	 hmm. I'm not sure if it is wise to deploy a change that makes train rollback to generate more errors.
[18:23:04] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:23:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:15] <Jdlrobson>	 Urbanecm: if you prefer we can use the boolean value for all of them
[18:23:26] <Jdlrobson>	 I'm just verifying but we should have backwards compatibility
[18:23:46] <Urbanecm>	 if it will work, I'd prefer that, as it will guarantee clean rollbacks.
[18:24:18] <Jdlrobson>	 yeh let's do that
[18:24:19] <Jdlrobson>	 1s
[18:24:23] <Urbanecm>	 sure
[18:26:15] <wikibugs>	 (03PS2) 10Jdlrobson: Use boolean values for wgVectorLanguageInHeader [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683036 (https://phabricator.wikimedia.org/T280526)
[18:26:22] <Jdlrobson>	 ^ that should do it
[18:26:36] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:26:56] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Use boolean values for wgVectorLanguageInHeader [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683036 (https://phabricator.wikimedia.org/T280526) (owner: 10Jdlrobson)
[18:26:58] <wikibugs>	 (03CR) 10Jdlrobson: [C: 04-1] "Blocked until 1.37.0-wmf.4 train." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682758 (https://phabricator.wikimedia.org/T280526) (owner: 10Jdlrobson)
[18:26:59] <Urbanecm>	 looks good, merging
[18:27:41] <wikibugs>	 (03Merged) 10jenkins-bot: Use boolean values for wgVectorLanguageInHeader [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683036 (https://phabricator.wikimedia.org/T280526) (owner: 10Jdlrobson)
[18:28:47] <Urbanecm>	 Jdlrobson: pulled onto mwdebug1001, can you test, please?
[18:28:52] <Jdlrobson>	 Urbanecm: on it
[18:29:52] <Jdlrobson>	 please sync
[18:29:57] <Urbanecm>	 syncing
[18:30:00] <icinga-wm>	 RECOVERY - Host ms-be2035 is UP: PING OK - Packet loss = 0%, RTA = 33.03 ms
[18:30:49] <Jdlrobson>	 and sorry this didn't go as smoothly as thought. I really appreciate your scrutiny and advice on this one. 
[18:31:33] <Urbanecm>	 no problem, this is the reason why we do testing before deploying a patch :)
[18:32:19] <bblack>	 !log lvs2009 - restart pybal + re-run puppet agent - T279457
[18:32:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:28] <stashbot>	 T279457: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457
[18:32:29] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 91a85f2: ac770bf: Enable language in header for office and testwiki users (T280526) (duration: 01m 19s)
[18:32:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:38] <stashbot>	 T280526: Deploy new language switching functionality to logged-in users  - https://phabricator.wikimedia.org/T280526
[18:32:39] <Urbanecm>	 Jdlrobson: should be live. Anything else (besides the -2'ed patch)?
[18:32:52] <Jdlrobson>	 hurray!
[18:32:57] <Jdlrobson>	 nope that's great. Thanks for all your help here!
[18:33:05] <Urbanecm>	 Any time :)
[18:33:06] <icinga-wm>	 PROBLEM - Host ms-fe2007 is DOWN: PING CRITICAL - Packet loss = 100%
[18:33:10] <Urbanecm>	 !log Morning B&C window done
[18:33:14] <mutante>	 !log people1003 - rebooting, trying to get new VM to work
[18:33:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:36] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 66, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:33:56] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 93, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:34:22] <icinga-wm>	 RECOVERY - Host ms-fe2007 is UP: PING OK - Packet loss = 0%, RTA = 33.10 ms
[18:35:32] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1005.eqiad.wmnet with reason: REIMAGE
[18:35:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:26] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[18:37:37] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1005.eqiad.wmnet with reason: REIMAGE
[18:37:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:55] <wikibugs>	 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10BBlack) Traffic lvs/cp/dns are all repooled, un-downtimed, and green.  Waiting until the other C2 hosts are fully reconfigured (network ports) before re-pooling codfw at the public traffic level.
[18:39:56] <icinga-wm>	 RECOVERY - Host ms-be2034 is UP: PING OK - Packet loss = 0%, RTA = 33.08 ms
[18:40:52] <icinga-wm>	 RECOVERY - Host ms-be2048 is UP: PING WARNING - Packet loss = 33%, RTA = 33.06 ms
[18:41:12] <icinga-wm>	 RECOVERY - Host elastic2046 is UP: PING OK - Packet loss = 0%, RTA = 33.14 ms
[18:45:58] <icinga-wm>	 RECOVERY - Host elastic2047 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms
[18:46:30] <icinga-wm>	 RECOVERY - Host ms-be2055 is UP: PING OK - Packet loss = 0%, RTA = 33.33 ms
[18:46:40] <icinga-wm>	 PROBLEM - SSH on phab2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:46:57] <wikibugs>	 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-main1005.eqiad.wmnet'] `  a...
[18:47:52] <icinga-wm>	 RECOVERY - swift codfw container availability low on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw
[18:48:12] <icinga-wm>	 PROBLEM - Host elastic2047 is DOWN: PING CRITICAL - Packet loss = 100%
[18:48:16] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[18:50:19] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts people1003.eqiad.wmnet
[18:50:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:50:48] <mutante>	 !log people1003 - destroying VM and recreating again from scratch to test if issue of no console and no access is repeatable
[18:50:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1077.eqiad.wmnet - https://phabricator.wikimedia.org/T281075 (10wiki_willy) a:05wiki_willy→03Cmjohnson
[18:58:52] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] validatelabsfqdn.py: update to python3 and run through black [puppet] - 10https://gerrit.wikimedia.org/r/682965 (owner: 10Andrew Bogott)
[19:00:04] <jouncebot>	 liw and longma: That opportune time is upon us again. Time for a MediaWiki train - European+American Version (secondary timeslot) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210427T1900).
[19:00:07] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts people1003.eqiad.wmnet
[19:00:13] <wikibugs>	 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `people1003.eqiad.wmnet` - people1003.eqiad.wmnet (**PASS**)   - Downtimed host on Icinga   - Found Ganeti VM   - V...
[19:00:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:03:13] <longma>	 I will be deploying a backport during the train window
[19:03:31] <icinga-wm>	 RECOVERY - Host elastic2047 is UP: PING OK - Packet loss = 0%, RTA = 33.06 ms
[19:03:36] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host people1003.eqiad.wmnet
[19:03:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:04:11] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[19:07:08] <papaul>	 !log powerdown logstash2035  for maintenance 
[19:07:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:13:04] <wikibugs>	 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmne...
[19:14:11] <wikibugs>	 (03PS1) 10BBlack: Revert "Depool codfw traffic" [dns] - 10https://gerrit.wikimedia.org/r/683041 (https://phabricator.wikimedia.org/T279457)
[19:17:47] <jinxer-wm>	 (Primary inbound port utilisation over 80%  #page) firing: Primary inbound port utilisation over 80%  #page - https://alerts.wikimedia.org
[19:18:11] <legoktm>	 Is that a real page?
[19:19:05] <icinga-wm>	 PROBLEM - Host logstash2035.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:20:30] <wikibugs>	 (03PS1) 10Herron: kafka-main: deploy kafka::main role to kafka-main[12]00[45] [puppet] - 10https://gerrit.wikimedia.org/r/683044 (https://phabricator.wikimedia.org/T225005)
[19:21:11] <cdanis>	 looking
[19:22:14] <cdanis>	 XioNoX: https://librenms.wikimedia.org/graphs/to=1619551200/id=8766/type=port_bits/from=1619529600/ ??
[19:22:47] <jinxer-wm>	 (Primary inbound port utilisation over 80%  #page) resolved: Primary inbound port utilisation over 80%  #page - https://alerts.wikimedia.org
[19:23:45] <cdanis>	 monitoring for the other side of that connection doesn't show the massive spike: https://librenms.wikimedia.org/graphs/to=1619551200/id=8328/type=port_bits/from=1619529600/
[19:24:02] <bblack>	 if I had to guess, I'd say an organic traffic spike on the newly-replaced C2 switch, from some cluster or other resyncing something after all the C2 hosts rejoined?
[19:24:15] <cdanis>	 et/ is juniper's prefix for a 40G interface, so that number on the switch side is physically possible...
[19:24:34] <cdanis>	 bblack: I thought that work was long done, though?
[19:24:50] <bblack>	 the last elastic host just came online for the last time ~20 minutes ago
[19:25:11] <bblack>	 still, it's hard to imagine one hosts joining a cluster driving more than the 10G of its own interface
[19:26:31] <cdanis>	 it is a mystery
[19:27:25] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] rcfeed: Remove reference assignment [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682818 (https://phabricator.wikimedia.org/T281226) (owner: 10Ahmon Dancy)
[19:29:10] <bblack>	 elastic2047 port:
[19:29:11] <bblack>	 https://librenms.wikimedia.org/graphs/to=1619551500/id=21523/type=port_bits/from=1619529900/
[19:29:29] <bblack>	 ~5.4Gbps and rising?
[19:29:40] <icinga-wm>	 RECOVERY - Host logstash2035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms
[19:30:26] <bblack>	 assuming those are bytes, now I don't remember
[19:30:39] <bblack>	 no it's bits but with a capital B
[19:30:50] <bblack>	 ms-fe had some bigger spikes:
[19:30:52] <bblack>	 https://librenms.wikimedia.org/graphs/to=1619551800/id=21527/type=port_bits/from=1619530200/
[19:32:27] <bblack>	 maybe some spike in cp2* -> ms-fe? I'm really at a loss, but it seems to have been transient in any case
[19:33:39] <wikibugs>	 (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1003/29238/" [puppet] - 10https://gerrit.wikimedia.org/r/683044 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron)
[19:35:03] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2004.codfw.wmnet with reason: REIMAGE
[19:35:07] <bblack>	 will wait a bit longer before re-pooling codfw public traffic JIC
[19:35:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:17] <papaul>	 !log powerdown ms-backup2001  for maintenance 
[19:35:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:28] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 97 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[19:37:14] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2004.codfw.wmnet with reason: REIMAGE
[19:37:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:01] <icinga-wm>	 RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[19:40:05] <icinga-wm>	 PROBLEM - Host ms-backup2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:40:47] <icinga-wm>	 RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[19:42:30] <wikibugs>	 (03PS1) 10Herron: eventgate-logging-external: add new codfw kafka-logging hosts to network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/683047 (https://phabricator.wikimedia.org/T279342)
[19:44:38] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host people1003.eqiad.wmnet
[19:44:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:45:35] <icinga-wm>	 RECOVERY - Host ms-backup2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.67 ms
[19:47:09] <icinga-wm>	 RECOVERY - SSH on phab2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:47:36] <wikibugs>	 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-main2004.codfw.wmnet'] `  a...
[19:47:50] <wikibugs>	 (03PS1) 10Dzahn: DHCP: update MAC address of people1003 [puppet] - 10https://gerrit.wikimedia.org/r/683049
[19:48:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] DHCP: update MAC address of people1003 [puppet] - 10https://gerrit.wikimedia.org/r/683049 (owner: 10Dzahn)
[19:48:13] <wikibugs>	 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmne...
[19:48:39] <papaul>	 leaving DC
[19:48:54] <wikibugs>	 (03PS2) 10Dzahn: DHCP: update MAC address of people1003 [puppet] - 10https://gerrit.wikimedia.org/r/683049
[19:50:42] <wikibugs>	 (03PS1) 10Herron: add kafka-logging200[123] to kafka term [homer/public] - 10https://gerrit.wikimedia.org/r/683050 (https://phabricator.wikimedia.org/T279342)
[19:55:53] <wikibugs>	 (03Merged) 10jenkins-bot: rcfeed: Remove reference assignment [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682818 (https://phabricator.wikimedia.org/T281226) (owner: 10Ahmon Dancy)
[19:56:34] <XioNoX>	 cdanis, bblack, I'd put that in monitoring glitch
[19:56:35] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] DHCP: update MAC address of people1003 [puppet] - 10https://gerrit.wikimedia.org/r/683049 (owner: 10Dzahn)
[19:56:56] <cdanis>	 yeah, agreed, librenms has done it before
[20:06:07] <wikibugs>	 (03PS1) 10Ottomata: test/data_purge - add drop_event job [puppet] - 10https://gerrit.wikimedia.org/r/683053 (https://phabricator.wikimedia.org/T273789)
[20:06:10] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2005.codfw.wmnet with reason: REIMAGE
[20:06:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:07:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] test/data_purge - add drop_event job [puppet] - 10https://gerrit.wikimedia.org/r/683053 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata)
[20:08:20] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2005.codfw.wmnet with reason: REIMAGE
[20:08:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:36] <logmsgbot>	 !log jhuneidi@deploy1002 Synchronized php-1.37.0-wmf.3/includes/rcfeed/IRCColourfulRCFeedFormatter.php: Backport rcfeed: Remove reference assignment (T281226) to 1.37.0-wmf.3 (duration: 01m 12s)
[20:11:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:45] <stashbot>	 T281226: PHP Notice: Only variables should be assigned by reference - https://phabricator.wikimedia.org/T281226
[20:17:30] <wikibugs>	 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-main2005.codfw.wmnet'] `  a...
[20:24:12] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:27:18] <wikibugs>	 10SRE, 10Wikimedia-Planet: Find a replacement for RSS aggregator for planet.wikimedia.org - https://phabricator.wikimedia.org/T281219 (10Legoktm)
[20:31:54] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Revert "Depool codfw traffic" [dns] - 10https://gerrit.wikimedia.org/r/683041 (https://phabricator.wikimedia.org/T279457) (owner: 10BBlack)
[20:32:46] <bblack>	 !log re-pooling codfw public traffic - T279457
[20:32:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:55] <stashbot>	 T279457: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457
[20:39:21] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: realm.pp: Add discussiontools_subscription to private tables [puppet] - 10https://gerrit.wikimedia.org/r/683070 (https://phabricator.wikimedia.org/T263817)
[20:40:32] <wikibugs>	 (03PS1) 10Legoktm: site.pp: Decomission rdb200[3456].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/683074 (https://phabricator.wikimedia.org/T273140)
[20:40:32] <icinga-wm>	 PROBLEM - Disk space on mwlog1001 is CRITICAL: DISK CRITICAL - free space: /srv 274385 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwlog1001&var-datasource=eqiad+prometheus/ops
[20:42:56] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.hosts.decommission for hosts rdb[2003-2004].codfw.wmnet
[20:43:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:45:30] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:54:40] <logmsgbot>	 !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts rdb[2003-2004].codfw.wmnet
[20:54:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:55:16] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.hosts.decommission for hosts rdb[2005-2006].codfw.wmnet
[20:55:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:55:54] <wikibugs>	 (03CR) 10Ottomata: "Great, if these are active, they will also need to be added to the metadata.broker.list in values-codfw.wmnet." [deployment-charts] - 10https://gerrit.wikimedia.org/r/683047 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron)
[20:57:52] <wikibugs>	 (03PS2) 10Ottomata: test/data_purge - add drop_event job [puppet] - 10https://gerrit.wikimedia.org/r/683053 (https://phabricator.wikimedia.org/T273789)
[21:03:24] <wikibugs>	 (03PS3) 10Ottomata: test/data_purge - add drop_event job [puppet] - 10https://gerrit.wikimedia.org/r/683053 (https://phabricator.wikimedia.org/T273789)
[21:04:59] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29240/console" [puppet] - 10https://gerrit.wikimedia.org/r/683053 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata)
[21:06:17] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1] "Elukey it looks like you created this test/data_purge.pp class..but it was never applied!  Ok to apply it?" [puppet] - 10https://gerrit.wikimedia.org/r/683053 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata)
[21:07:02] <logmsgbot>	 !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts rdb[2005-2006].codfw.wmnet
[21:07:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:12:25] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[21:19:09] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[21:21:32] <wikibugs>	 (03PS1) 10Tchanders: Enable partial action blocks on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683088 (https://phabricator.wikimedia.org/T280528)
[21:21:34] <wikibugs>	 (03PS1) 10Tchanders: Enable partial action blocks on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683089 (https://phabricator.wikimedia.org/T280528)
[21:26:40] <wikibugs>	 10SRE, 10ops-codfw: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul)
[21:32:46] <wikibugs>	 10SRE, 10Traffic, 10decommission-hardware: decommission cescout1001.eqiad.wmnet - https://phabricator.wikimedia.org/T275696 (10ssingh) p:05Medium→03High
[21:40:50] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] site.pp: Decomission rdb200[3456].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/683074 (https://phabricator.wikimedia.org/T273140) (owner: 10Legoktm)
[21:48:43] <wikibugs>	 (03PS1) 10Andrew Bogott: Trove: set low default quotas per project. [puppet] - 10https://gerrit.wikimedia.org/r/683092 (https://phabricator.wikimedia.org/T212595)
[21:52:03] <wikibugs>	 (03PS2) 10Andrew Bogott: Trove: set low default quotas per project but big potential DB size [puppet] - 10https://gerrit.wikimedia.org/r/683092 (https://phabricator.wikimedia.org/T212595)
[21:59:28] <wikibugs>	 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10serviceops: decommission rdb200[3456].codfw.wmnet - https://phabricator.wikimedia.org/T273140 (10Legoktm) This is ready for #DC-ops now.
[22:08:16] <wikibugs>	 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10serviceops: decommission rdb200[3456].codfw.wmnet - https://phabricator.wikimedia.org/T273140 (10Papaul) p:05Medium→03Low a:03Papaul
[22:13:22] <icinga-wm>	 PROBLEM - HTTPS-peopleweb on people1003 is CRITICAL: connect to address 10.64.0.8 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/People.wikimedia.org
[22:16:00] <wikibugs>	 (03CR) 10Dzahn: "Ran into this when trying bullseye on a host with envoy. Profile::Tlsproxy::Envoy/Sslcert::Certificate will fail because it uses this and " [puppet] - 10https://gerrit.wikimedia.org/r/670978 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[22:22:23] <wikibugs>	 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10Papaul) switch replace, onsite work complete and Netbox updated. Will be shipping the faulty switch tomorrow.
[22:22:41] <wikibugs>	 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10Papaul) p:05High→03Low
[22:22:48] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on people1003 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[22:25:32] <icinga-wm>	 PROBLEM - Check that envoy is running on people1003 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is inactive https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[22:25:59] <wikibugs>	 (03PS2) 10Legoktm: site.pp: Setup rdb1011, rdb1012 [puppet] - 10https://gerrit.wikimedia.org/r/682892 (https://phabricator.wikimedia.org/T281217)
[22:26:01] <wikibugs>	 (03PS2) 10Legoktm: Have rdb1012 replicate from rdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/682893 (https://phabricator.wikimedia.org/T281217)
[22:27:56] <icinga-wm>	 PROBLEM - people.wikimedia.org requires authentication on people1003 is CRITICAL: connect to address 10.64.0.8 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[22:28:18] <mutante>	 ACK
[22:28:51] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] site.pp: Setup rdb1011, rdb1012 [puppet] - 10https://gerrit.wikimedia.org/r/682892 (https://phabricator.wikimedia.org/T281217) (owner: 10Legoktm)
[22:29:15] <icinga-wm>	 ACKNOWLEDGEMENT - Check no envoy runtime configuration is left persistent on people1003 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused daniel_zahn bullseye - needs https://gerrit.wikimedia.org/r/c/operations/puppet/+/670978 https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[22:29:15] <icinga-wm>	 ACKNOWLEDGEMENT - Check that envoy is running on people1003 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is inactive daniel_zahn bullseye - needs https://gerrit.wikimedia.org/r/c/operations/puppet/+/670978 https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[22:29:15] <icinga-wm>	 ACKNOWLEDGEMENT - HTTPS-peopleweb on people1003 is CRITICAL: connect to address 10.64.0.8 and port 443: Connection refused daniel_zahn bullseye - needs https://gerrit.wikimedia.org/r/c/operations/puppet/+/670978 https://wikitech.wikimedia.org/wiki/People.wikimedia.org
[22:29:15] <icinga-wm>	 ACKNOWLEDGEMENT - people.wikimedia.org requires authentication on people1003 is CRITICAL: connect to address 10.64.0.8 and port 443: Connection refused daniel_zahn bullseye - needs https://gerrit.wikimedia.org/r/c/operations/puppet/+/670978 https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[22:38:47] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10Papaul)
[22:42:56] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (people1003), Fresh: 97 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[22:50:03] <wikibugs>	 (03PS2) 10Legoktm: mariadb: Allow lists1001.wikimedia.org to talk to m5 [puppet] - 10https://gerrit.wikimedia.org/r/681753 (https://phabricator.wikimedia.org/T278614)
[23:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210427T2300)
[23:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[23:00:41] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[23:02:05] <wikibugs>	 (03PS1) 10Legoktm: [WIP] Initial commit [software/pipermail-redirector] - 10https://gerrit.wikimedia.org/r/683108 (https://phabricator.wikimedia.org/T280731)
[23:04:01] <wikibugs>	 10SRE, 10ops-codfw: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul)
[23:06:26] <wikibugs>	 (03PS2) 10Legoktm: [WIP] Initial commit [software/pipermail-redirector] - 10https://gerrit.wikimedia.org/r/683108 (https://phabricator.wikimedia.org/T280731)
[23:10:01] <wikibugs>	 (03CR) 10Dzahn: "I manually made the same changes this makes to x509-bundle on people1003 and then manually ran the command that puppet would run:" [puppet] - 10https://gerrit.wikimedia.org/r/670978 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[23:12:52] <wikibugs>	 (03CR) 10Dzahn: "> TypeError: a bytes-like object is required, not 'str'" [puppet] - 10https://gerrit.wikimedia.org/r/670978 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[23:17:18] <wikibugs>	 (03CR) 10Dzahn: x509-bundle.py: Port to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670978 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[23:18:19] <wikibugs>	 (03CR) 10Dzahn: "It works when opening the file with "w" instead of "wb". in:  with open(args.output, "wb") as outfile:" [puppet] - 10https://gerrit.wikimedia.org/r/670978 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[23:20:16] <mutante>	 legoktm: I applied that change manually on people1003, then also manually ran the command puppet would run. found one more issue ^. But also the fix, i think
[23:20:48] <wikibugs>	 (03PS2) 10Jdlrobson: Enable new language button for all logged in users outside test projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682758 (https://phabricator.wikimedia.org/T280526)
[23:20:55] <wikibugs>	 (03CR) 10Jdlrobson: [C: 04-1] "Probably blocked until Tues 4th." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682758 (https://phabricator.wikimedia.org/T280526) (owner: 10Jdlrobson)
[23:24:18] <mutante>	 the "unless" part of the puppet exec is too smart though to easily fool it and make puppet happy
[23:24:38] <mutante>	 it tests not only if chained cert exists but also which files are older than others
[23:33:07] <wikibugs>	 (03CR) 10STran: Enable partial action blocks on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683089 (https://phabricator.wikimedia.org/T280528) (owner: 10Tchanders)
[23:38:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['snapshot1011.eqiad.wmnet', 'snapshot1012.eqiad....
[23:47:18] <wikibugs>	 (03PS1) 10Dzahn: Revert "site: add peopleweb role to people1003" [puppet] - 10https://gerrit.wikimedia.org/r/683126
[23:47:37] <wikibugs>	 (03CR) 10Tchanders: "> Enable or disable?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683089 (https://phabricator.wikimedia.org/T280528) (owner: 10Tchanders)
[23:51:32] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1011.eqiad.wmnet with reason: REIMAGE
[23:51:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:52:07] <wikibugs>	 (03PS2) 10Dzahn: Revert "site: add peopleweb role to people1003" [puppet] - 10https://gerrit.wikimedia.org/r/683126
[23:52:26] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "site: add peopleweb role to people1003" [puppet] - 10https://gerrit.wikimedia.org/r/683126 (owner: 10Dzahn)
[23:52:36] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1012.eqiad.wmnet with reason: REIMAGE
[23:52:39] <icinga-wm>	 RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[23:52:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:52:49] <icinga-wm>	 RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[23:53:35] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1011.eqiad.wmnet with reason: REIMAGE
[23:53:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:54:37] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1013.eqiad.wmnet with reason: REIMAGE
[23:54:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:55:41] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1012.eqiad.wmnet with reason: REIMAGE
[23:55:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:57:33] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1014.eqiad.wmnet with reason: REIMAGE
[23:57:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:57:50] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1013.eqiad.wmnet with reason: REIMAGE
[23:57:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:58:37] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1015.eqiad.wmnet with reason: REIMAGE
[23:58:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log