[00:00:28] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2015.codfw.wmnet [00:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:04] RECOVERY - MD RAID on wtp2015 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:02:12] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:02:38] RECOVERY - Check the NTP synchronisation status of timesyncd on wtp2015 is OK: OK: synced at Fri 2020-07-31 00:02:35 UTC. https://wikitech.wikimedia.org/wiki/NTP [00:02:42] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 200 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [00:06:17] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.1/extensions/Echo/modules/mobile/notificationsFilterOverlay.js: T258954 (duration: 01m 10s) [00:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:24] T258954: Special:Notifications filter overlay never closes in Minerva - https://phabricator.wikimedia.org/T258954 [00:06:54] RECOVERY - Check the last execution of php7.2-fpm_check_restart on wtp2015 is OK: OK: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:07:24] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.2/extensions/Echo/modules/mobile/notificationsFilterOverlay.js: T258954 (duration: 01m 06s) [00:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for... [00:13:56] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:15:48] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:19:32] RECOVERY - mediawiki-installation DSH group on wtp2015 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [00:40:48] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 138 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:44:24] (03PS1) 10Tim Starling: Revert "Re-enable LilyPond/Score in safe mode (2nd attempt)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617566 [00:44:45] (03PS2) 10Tim Starling: Revert "Re-enable LilyPond/Score in safe mode (2nd attempt)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617566 [00:44:59] (03CR) 10Tim Starling: [C: 03+2] Revert "Re-enable LilyPond/Score in safe mode (2nd attempt)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617566 (owner: 10Tim Starling) [00:46:00] (03Merged) 10jenkins-bot: Revert "Re-enable LilyPond/Score in safe mode (2nd attempt)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617566 (owner: 10Tim Starling) [00:46:36] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:47:50] !log tstarling@deploy1001 Synchronized wmf-config/CommonSettings.php: disable lilypond execution again (duration: 01m 10s) [00:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:51] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10tstarling) [00:49:43] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10tstarling) 05Resolved→03Open It's disabled again, since I found a new vulnerability. [01:20:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1031.eqiad.wmnet'] ` Of which... [02:34:46] (03PS1) 10Andrew Bogott: cloudvirt103[1-9]: use a simple one-drive raid config [puppet] - 10https://gerrit.wikimedia.org/r/617590 (https://phabricator.wikimedia.org/T251627) [02:34:48] (03Abandoned) 10Andrew Bogott: Add records for cloudvirt103[1-9] [dns] - 10https://gerrit.wikimedia.org/r/617558 (https://phabricator.wikimedia.org/T251627) (owner: 10Andrew Bogott) [02:35:28] (03PS2) 10Andrew Bogott: cloudvirt103[1-9]: use a simple one-volume raid config [puppet] - 10https://gerrit.wikimedia.org/r/617590 (https://phabricator.wikimedia.org/T251627) [02:36:19] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt103[1-9]: use a simple one-volume raid config [puppet] - 10https://gerrit.wikimedia.org/r/617590 (https://phabricator.wikimedia.org/T251627) (owner: 10Andrew Bogott) [02:38:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for... [02:54:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1031.eqiad.wmnet'] ` Of which... [02:55:50] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [02:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for... [02:57:53] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [02:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:10:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1032.eqiad.wmnet'] ` Of which... [03:12:20] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [03:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:14:24] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [03:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:15:34] (03PS1) 10Andrew Bogott: cloudvirt103[1-9]: puppetize as thinvirts [puppet] - 10https://gerrit.wikimedia.org/r/617593 (https://phabricator.wikimedia.org/T251627) [03:17:00] (03CR) 10jerkins-bot: [V: 04-1] cloudvirt103[1-9]: puppetize as thinvirts [puppet] - 10https://gerrit.wikimedia.org/r/617593 (https://phabricator.wikimedia.org/T251627) (owner: 10Andrew Bogott) [03:22:12] (03PS2) 10Andrew Bogott: cloudvirt103[1-9]: puppetize as thinvirts [puppet] - 10https://gerrit.wikimedia.org/r/617593 (https://phabricator.wikimedia.org/T251627) [03:23:24] (03CR) 10jerkins-bot: [V: 04-1] cloudvirt103[1-9]: puppetize as thinvirts [puppet] - 10https://gerrit.wikimedia.org/r/617593 (https://phabricator.wikimedia.org/T251627) (owner: 10Andrew Bogott) [03:24:16] (03PS3) 10Andrew Bogott: cloudvirt103[1-9]: puppetize as thinvirts [puppet] - 10https://gerrit.wikimedia.org/r/617593 (https://phabricator.wikimedia.org/T251627) [03:26:03] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt103[1-9]: puppetize as thinvirts [puppet] - 10https://gerrit.wikimedia.org/r/617593 (https://phabricator.wikimedia.org/T251627) (owner: 10Andrew Bogott) [03:31:03] (03PS1) 10Andrew Bogott: cloudvirt103[1-9] -> debian stretch [puppet] - 10https://gerrit.wikimedia.org/r/617595 (https://phabricator.wikimedia.org/T251627) [03:31:51] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt103[1-9] -> debian stretch [puppet] - 10https://gerrit.wikimedia.org/r/617595 (https://phabricator.wikimedia.org/T251627) (owner: 10Andrew Bogott) [03:34:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for... [03:34:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1031.eqiad.wmnet'] ` Of which... [03:38:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for... [03:51:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1031.eqiad.wmnet'] ` Of which... [03:53:13] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [03:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:55:14] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [03:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:00:56] (03PS1) 10Andrew Bogott: nova-compute: Remove a reference to a (now-not-always-present) mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/617596 (https://phabricator.wikimedia.org/T251627) [04:01:44] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: analytics1039.eqiad.wmnet, cloudvirt1032.eqiad.wmnet, wdqs1009.eqiad.wmnet, testreduce1001.eqiad.wmnet, cloudvirt1031.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [04:02:37] (03CR) 10Andrew Bogott: [C: 03+2] nova-compute: Remove a reference to a (now-not-always-present) mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/617596 (https://phabricator.wikimedia.org/T251627) (owner: 10Andrew Bogott) [04:05:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for... [04:20:01] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [04:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:22:07] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [04:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:35:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1031.eqiad.wmnet'] ` Of which... [04:39:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for... [04:54:28] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [04:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:34] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [04:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:36] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:09:41] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:09:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1032.eqiad.wmnet'] ` Of which... [05:45:27] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/617529 (https://phabricator.wikimedia.org/T257016) (owner: 10Herron) [05:49:17] (03PS2) 10Muehlenhoff: Enable managed adduser/sysusers config also for WMCS [puppet] - 10https://gerrit.wikimedia.org/r/602286 (https://phabricator.wikimedia.org/T235162) [05:51:45] (03PS1) 10Privacybatm: Sphinx: Resolve unexpected intend error [software/transferpy] - 10https://gerrit.wikimedia.org/r/617600 (https://phabricator.wikimedia.org/T257601) [05:54:29] (03CR) 10Privacybatm: "This resoves the sphinx doc build error." [software/transferpy] - 10https://gerrit.wikimedia.org/r/617600 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [05:57:10] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [05:59:35] !log installing qemu updates on stretch [05:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:31] (03CR) 10Elukey: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/617479 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [06:02:48] (03PS5) 10Privacybatm: [POC4 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/616282 (https://phabricator.wikimedia.org/T259327) [06:02:50] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 200 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [06:02:55] (03CR) 10jerkins-bot: [V: 04-1] [POC4 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/616282 (https://phabricator.wikimedia.org/T259327) (owner: 10Privacybatm) [06:03:59] (03PS8) 10Privacybatm: [POC3 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/615179 (https://phabricator.wikimedia.org/T259327) [06:04:07] (03CR) 10jerkins-bot: [V: 04-1] [POC3 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/615179 (https://phabricator.wikimedia.org/T259327) (owner: 10Privacybatm) [06:04:39] (03Restored) 10Privacybatm: [POC2 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/614744 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [06:04:50] (03PS2) 10Privacybatm: [POC2 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/614744 (https://phabricator.wikimedia.org/T259327) [06:04:58] (03CR) 10jerkins-bot: [V: 04-1] [POC2 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/614744 (https://phabricator.wikimedia.org/T259327) (owner: 10Privacybatm) [06:05:58] (03Abandoned) 10Privacybatm: [POC2 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/614744 (https://phabricator.wikimedia.org/T259327) (owner: 10Privacybatm) [06:06:18] (03Restored) 10Privacybatm: [POC1 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/614745 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [06:06:31] (03PS3) 10Privacybatm: [POC1 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/614745 (https://phabricator.wikimedia.org/T259327) [06:06:39] (03CR) 10jerkins-bot: [V: 04-1] [POC1 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/614745 (https://phabricator.wikimedia.org/T259327) (owner: 10Privacybatm) [06:06:49] (03Abandoned) 10Privacybatm: [POC1 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/614745 (https://phabricator.wikimedia.org/T259327) (owner: 10Privacybatm) [06:22:08] PROBLEM - nova-compute proc maximum on cloudvirt1031 is CRITICAL: connect to address 10.64.20.73 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [06:22:08] PROBLEM - ensure kvm processes are running on cloudvirt1031 is CRITICAL: connect to address 10.64.20.73 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [06:22:09] PROBLEM - nova-compute proc minimum on cloudvirt1031 is CRITICAL: connect to address 10.64.20.73 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [06:24:50] (03PS1) 10Ayounsi: Configure transport links OSPF based on Netbox data [homer/public] - 10https://gerrit.wikimedia.org/r/617603 (https://phabricator.wikimedia.org/T200277) [06:26:51] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [06:26:52] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:35] (03PS1) 10Elukey: druid: add cache monitoring for 0.19 clusters [puppet] - 10https://gerrit.wikimedia.org/r/617604 (https://phabricator.wikimedia.org/T244482) [06:30:06] (03CR) 10Elukey: [C: 03+2] druid: add cache monitoring for 0.19 clusters [puppet] - 10https://gerrit.wikimedia.org/r/617604 (https://phabricator.wikimedia.org/T244482) (owner: 10Elukey) [06:30:28] (03CR) 10Ayounsi: "I'm not 100% satisfied with the Jinja code, so please let me know if you have suggestions on how to improve it." [homer/public] - 10https://gerrit.wikimedia.org/r/617603 (https://phabricator.wikimedia.org/T200277) (owner: 10Ayounsi) [06:32:46] !log roll restart of druid brokers on druid100[4-8] to pick up new changes [06:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:28] PROBLEM - Check the NTP synchronisation status of timesyncd on cloudvirt1031 is CRITICAL: connect to address 10.64.20.73 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [06:50:19] (03CR) 10Elukey: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/617479 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [06:55:01] (03CR) 10Elukey: [C: 03+2] profile::mariadb::misc::analytics::multiinstance: change ports [puppet] - 10https://gerrit.wikimedia.org/r/617479 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [06:57:24] PROBLEM - ensure kvm processes are running on cloudvirt1032 is CRITICAL: connect to address 10.64.20.74 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [06:57:34] PROBLEM - nova-compute proc maximum on cloudvirt1032 is CRITICAL: connect to address 10.64.20.74 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [06:57:41] PROBLEM - nova-compute proc minimum on cloudvirt1032 is CRITICAL: connect to address 10.64.20.74 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200731T0700) [07:04:34] PROBLEM - Check the NTP synchronisation status of timesyncd on cloudvirt1032 is CRITICAL: connect to address 10.64.20.74 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [07:07:12] !log stop mysql replication on db1108; update port config for mysql instances and restart them; restart replication on instances [07:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:51] (03PS1) 10Elukey: analytics-in[46]: add new ports for term mysql-replica [homer/public] - 10https://gerrit.wikimedia.org/r/617649 (https://phabricator.wikimedia.org/T234826) [07:22:57] (03CR) 10Jcrespo: [C: 03+2] "Thank you for the quick fix!" [software/transferpy] - 10https://gerrit.wikimedia.org/r/617600 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [07:23:25] (03Merged) 10jenkins-bot: Sphinx: Resolve unexpected intend error [software/transferpy] - 10https://gerrit.wikimedia.org/r/617600 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [07:39:30] (03Restored) 10Jcrespo: mariadb: Create ugly exception for port assignment for db1108 [puppet] - 10https://gerrit.wikimedia.org/r/617077 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [07:39:41] (03PS2) 10Jcrespo: mariadb: Match port 3351 and 3352 to 2 analytics sections [puppet] - 10https://gerrit.wikimedia.org/r/617077 (https://phabricator.wikimedia.org/T234826) [07:39:43] (03PS1) 10Jcrespo: mariadb-backups: Move db1108 (analytics db) backups' ports [puppet] - 10https://gerrit.wikimedia.org/r/617650 (https://phabricator.wikimedia.org/T234826) [07:40:12] (03CR) 10Jcrespo: "New option." [puppet] - 10https://gerrit.wikimedia.org/r/617077 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [07:41:48] (03CR) 10Elukey: [C: 03+1] mariadb: Match port 3351 and 3352 to 2 analytics sections [puppet] - 10https://gerrit.wikimedia.org/r/617077 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [07:42:18] (03CR) 10Jcrespo: [C: 03+2] mariadb: Match port 3351 and 3352 to 2 analytics sections [puppet] - 10https://gerrit.wikimedia.org/r/617077 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [07:42:20] (03CR) 10Elukey: [C: 03+1] mariadb-backups: Move db1108 (analytics db) backups' ports [puppet] - 10https://gerrit.wikimedia.org/r/617650 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [07:42:33] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Move db1108 (analytics db) backups' ports [puppet] - 10https://gerrit.wikimedia.org/r/617650 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [07:43:54] RECOVERY - MariaDB read only matomo on db1108 is OK: Version 10.4.13-MariaDB-log, Uptime 2715s, read_only: True, event_scheduler: True, 27.33 QPS, connection latency: 0.002670s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [07:48:06] elukey: can we run backups and I show you how to recover? [07:48:20] also you check backups look good [07:49:36] jynus: sure! [07:49:59] if the recovery is what we have on wikitech for bacula I have already used it for other stuff (like archiva etc..) [07:50:20] !log uploaded lilypond 2.19.81+really-2.18.2-13~bpo9+1+wmf1 to stretch-wikimedia T256877 [07:50:20] so I have an idea about the recovery process (but never done for mariadb) [07:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:25] T256877: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 [07:50:50] elukey: indeed https://wikitech.wikimedia.org/wiki/MariaDB/Backups#Recovering_a_logical_backup [07:50:55] but it is not bacula [07:51:03] !log updating lilypond on mw* servers [07:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:07] as bacula is used for storage, but not for automatic database loading [07:51:24] ahh nice [07:51:25] let me pm [07:55:52] 10Operations, 10Graphoid, 10Code-Stewardship-Reviews, 10Release-Engineering-Team (Code Health), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10akosiaris) >>! In T211881#6350152, @kaldari wrote: > I'd like to propose that we close this ticket, since we've dec... [07:57:01] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Patch-For-Review, and 2 others: Create Graphoid .pipeline files - https://phabricator.wikimedia.org/T203092 (10akosiaris) 05Stalled→03Declined With graphoid being undeployed in T242855, this makes no sense anymore. Declining. [07:57:08] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Pipeline), 10Services (watching): Move Graphoid to Kubernetes via the deployment pipeline - https://phabricator.wikimedia.org/T203091 (10akosiaris) [08:02:57] (03PS1) 10Muehlenhoff: Extend snapshot Cumin alias to also include the testbed role [puppet] - 10https://gerrit.wikimedia.org/r/617651 [08:10:50] RECOVERY - dump of analytics_meta in eqiad on icinga1001 is OK: Last dump for analytics_meta at eqiad (db1108.eqiad.wmnet:3352) taken on 2020-07-31 07:54:57 (1 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [08:14:51] (03PS1) 10JMeybohm: Switch helmfiles to use chartmuseum repository [deployment-charts] - 10https://gerrit.wikimedia.org/r/617652 (https://phabricator.wikimedia.org/T253843) [08:21:58] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:22:41] (03CR) 10Alexandros Kosiaris: [C: 04-1] Switch helmfiles to use chartmuseum repository (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/617652 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [08:25:42] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:35:21] (03PS1) 10Jcrespo: mariadb: Increase analytics binlog retention time to 14 days [puppet] - 10https://gerrit.wikimedia.org/r/617653 (https://phabricator.wikimedia.org/T234826) [08:38:06] (03PS2) 10Jcrespo: mariadb: Increase analytics binlog retention time to 14 days [puppet] - 10https://gerrit.wikimedia.org/r/617653 (https://phabricator.wikimedia.org/T234826) [08:39:51] (03CR) 10Elukey: [C: 03+1] mariadb: Increase analytics binlog retention time to 14 days [puppet] - 10https://gerrit.wikimedia.org/r/617653 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [08:42:12] (03CR) 10Jcrespo: [C: 03+2] mariadb: Increase analytics binlog retention time to 14 days [puppet] - 10https://gerrit.wikimedia.org/r/617653 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [08:44:27] (03CR) 10Alexandros Kosiaris: [C: 03+1] api-gateway: add helmfile.d configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/616467 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [08:51:54] (03PS1) 10Jcrespo: mariadb: Increase misc db binlog retention to 14 days [puppet] - 10https://gerrit.wikimedia.org/r/617656 (https://phabricator.wikimedia.org/T234826) [08:51:56] (03PS2) 10JMeybohm: Remove the repository definition from helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/617652 (https://phabricator.wikimedia.org/T253843) [08:52:10] ACKNOWLEDGEMENT - OTRS SMTP on otrs1001 is CRITICAL: connect to address 10.64.16.39 and port 25: Connection refused alexandros kosiaris ignore, migration ongoing. https://wikitech.wikimedia.org/wiki/OTRS%23Troubleshooting [08:52:17] (03CR) 10Elukey: [C: 03+1] mariadb: Increase misc db binlog retention to 14 days [puppet] - 10https://gerrit.wikimedia.org/r/617656 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [08:53:40] (03CR) 10JMeybohm: Remove the repository definition from helmfiles (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/617652 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [08:53:53] (03CR) 10Jcrespo: "At least until binlog backup is centralized on dbprov hosts." [puppet] - 10https://gerrit.wikimedia.org/r/617656 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [08:54:26] (03CR) 10Jcrespo: [C: 03+2] mariadb: Increase misc db binlog retention to 14 days [puppet] - 10https://gerrit.wikimedia.org/r/617656 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [08:56:21] (03CR) 10Kormat: [C: 03+1] mariadb: Increase misc db binlog retention to 14 days [puppet] - 10https://gerrit.wikimedia.org/r/617656 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [08:59:22] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash7: increase SSD tier JVM heap to 32G [puppet] - 10https://gerrit.wikimedia.org/r/617526 (https://phabricator.wikimedia.org/T259219) (owner: 10Herron) [08:59:41] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: ensure only one webrequest host sends 5xx to logstash [puppet] - 10https://gerrit.wikimedia.org/r/617388 (https://phabricator.wikimedia.org/T247968) (owner: 10Filippo Giunchedi) [08:59:46] (03PS3) 10Filippo Giunchedi: profile: ensure only one webrequest host sends 5xx to logstash [puppet] - 10https://gerrit.wikimedia.org/r/617388 (https://phabricator.wikimedia.org/T247968) [08:59:52] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:07:04] RECOVERY - OTRS SMTP on otrs1001 is OK: SMTP OK - 0.007 sec. response time https://wikitech.wikimedia.org/wiki/OTRS%23Troubleshooting [09:07:24] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:07:55] 10Operations, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10fgiunchedi) [09:12:45] (03PS1) 10JMeybohm: chartmuseum: Change repository name to stable [puppet] - 10https://gerrit.wikimedia.org/r/617659 (https://phabricator.wikimedia.org/T253843) [09:14:39] (03CR) 10Alexandros Kosiaris: [C: 03+1] chartmuseum: Change repository name to stable [puppet] - 10https://gerrit.wikimedia.org/r/617659 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [09:16:21] (03CR) 10Filippo Giunchedi: [C: 03+1] "PCC https://puppet-compiler.wmflabs.org/compiler1001/24244/" [puppet] - 10https://gerrit.wikimedia.org/r/617083 (owner: 10Filippo Giunchedi) [09:16:23] (03CR) 10Filippo Giunchedi: [C: 03+2] rsync: listen for stunnel connections on v4/v6 [puppet] - 10https://gerrit.wikimedia.org/r/617083 (owner: 10Filippo Giunchedi) [09:16:49] 10Operations: gerrit.wm.o/r/changes/ has leading garbage in the output - https://phabricator.wikimedia.org/T259333 (10Kormat) [09:18:03] 10Operations, 10Gerrit: gerrit.wm.o/r/changes/ has leading garbage in the output - https://phabricator.wikimedia.org/T259333 (10Majavah) [09:18:27] (03CR) 10JMeybohm: [C: 03+2] chartmuseum: Change repository name to stable [puppet] - 10https://gerrit.wikimedia.org/r/617659 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [09:23:19] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Haven't tried building the package tho" [debs/prometheus-es-exporter] (debian/sid) - 10https://gerrit.wikimedia.org/r/617250 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [09:28:58] (03CR) 10Filippo Giunchedi: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [09:31:38] (03PS1) 10Jcrespo: mariadb: Add port analytics assignment to wmfmariadbpy and backups [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/617661 (https://phabricator.wikimedia.org/T234826) [09:33:08] (03PS2) 10Jcrespo: mariadb: Add port analytics assignment to wmfmariadbpy and backups [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/617661 (https://phabricator.wikimedia.org/T234826) [09:33:56] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:35:18] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:36:26] that's probably me ^ [09:36:31] (03CR) 10Kormat: [C: 04-1] mariadb: Add port analytics assignment to wmfmariadbpy and backups (033 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/617661 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [09:37:59] (03CR) 10Kormat: [C: 03+1] "My review crossed your updated. LGTM :)" (033 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/617661 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [09:41:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. The package name is too generic from my POV ("es" could be other things besides elasticsearch), but there's also an argument f" (031 comment) [debs/prometheus-es-exporter] (debian/sid) - 10https://gerrit.wikimedia.org/r/617250 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [09:41:33] (03CR) 10Jcrespo: [C: 03+2] mariadb: Add port analytics assignment to wmfmariadbpy and backups [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/617661 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [09:42:02] (03PS1) 10Jcrespo: mariadb-backups: Update backup automation to wmfmariadbpy's HEAD [puppet] - 10https://gerrit.wikimedia.org/r/617662 (https://phabricator.wikimedia.org/T234826) [09:43:20] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Update backup automation to wmfmariadbpy's HEAD [puppet] - 10https://gerrit.wikimedia.org/r/617662 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [09:44:38] (03PS2) 10Jcrespo: mariadb-backups: Update backup automation to wmfmariadbpy's HEAD [puppet] - 10https://gerrit.wikimedia.org/r/617662 (https://phabricator.wikimedia.org/T234826) [09:47:11] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Update backup automation to wmfmariadbpy's HEAD [puppet] - 10https://gerrit.wikimedia.org/r/617662 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [09:52:23] 10Operations: Rebase patches for VP9 support to ffmpeg 3.2.15 - https://phabricator.wikimedia.org/T259336 (10MoritzMuehlenhoff) [09:56:34] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:56:59] (03PS1) 10Jcrespo: mariadb-backups: Add _ to the list of characters alowed for section names [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/617668 (https://phabricator.wikimedia.org/T234826) [09:58:14] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Add _ to the list of characters alowed for section names [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/617668 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [09:58:29] (03PS1) 10Muehlenhoff: Fix typo in sources for older distros on package build host (and remove jessie) [puppet] - 10https://gerrit.wikimedia.org/r/617669 [09:58:36] (03PS15) 10Kormat: Create debian packages. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) [09:58:44] (03Merged) 10jenkins-bot: mariadb-backups: Add _ to the list of characters alowed for section names [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/617668 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [10:02:14] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:03:40] (03PS1) 10Jcrespo: mariadb-backups: Update backup_mariadb.py to HEAD [puppet] - 10https://gerrit.wikimedia.org/r/617670 (https://phabricator.wikimedia.org/T234826) [10:04:01] (03PS2) 10Jcrespo: mariadb-backups: Update backup_mariadb.py to HEAD [puppet] - 10https://gerrit.wikimedia.org/r/617670 (https://phabricator.wikimedia.org/T234826) [10:04:35] 10Operations, 10Gerrit, 10User-Kormat: gerrit.wm.o/r/changes/ has leading garbage in the output - https://phabricator.wikimedia.org/T259333 (10Kormat) [10:05:40] (03CR) 10Kormat: "I've done some light testing, and this seems to work. Lintian isn't overly thrilled with me, however:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) (owner: 10Kormat) [10:05:47] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Update backup_mariadb.py to HEAD [puppet] - 10https://gerrit.wikimedia.org/r/617670 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [10:07:33] (03CR) 10Jcrespo: "> Patch Set 15:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) (owner: 10Kormat) [10:07:56] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:07:59] (03PS1) 10JMeybohm: helm: Allow multiple helm repositories [puppet] - 10https://gerrit.wikimedia.org/r/617673 (https://phabricator.wikimedia.org/T253843) [10:10:13] (03CR) 10JMeybohm: "PCC https://puppet-compiler.wmflabs.org/compiler1001/24245/deploy1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/617673 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [10:13:10] (03PS3) 10JMeybohm: Remove the repository definition from helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/617652 (https://phabricator.wikimedia.org/T253843) [10:13:45] (03CR) 10Kormat: "> Nothing there seems too surprising, but do you know where "out-of-date-standards-version 4.1.2" comes from?" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) (owner: 10Kormat) [10:17:48] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/617669 (owner: 10Muehlenhoff) [10:19:14] (03CR) 10Muehlenhoff: [C: 03+2] Fix typo in sources for older distros on package build host (and remove jessie) [puppet] - 10https://gerrit.wikimedia.org/r/617669 (owner: 10Muehlenhoff) [10:23:21] (03PS4) 10JMeybohm: Remove the repository definition from helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/617652 (https://phabricator.wikimedia.org/T253843) [10:32:44] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:40:18] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:50:25] (03PS1) 10Jbond: profile::gerrit::migrations: correct hiera name space [puppet] - 10https://gerrit.wikimedia.org/r/617676 (https://phabricator.wikimedia.org/T247956) [10:51:31] (03PS2) 10Jbond: profile::gerrit::migrations: correct hiera name space [puppet] - 10https://gerrit.wikimedia.org/r/617676 (https://phabricator.wikimedia.org/T247956) [10:53:54] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/24247/" [puppet] - 10https://gerrit.wikimedia.org/r/617676 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [10:58:01] (03CR) 10Muehlenhoff: [C: 03+1] debianization (031 comment) [debs/prometheus-es-exporter] (debian/sid) - 10https://gerrit.wikimedia.org/r/617250 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [10:59:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, two minor nits inline" (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) (owner: 10Kormat) [11:03:37] 10Operations, 10Acme-chief, 10Traffic: do not generate metadata for parts that aren't allowed - https://phabricator.wikimedia.org/T259338 (10Vgutierrez) p:05Triage→03Medium [11:04:19] (03PS1) 10Vgutierrez: api: Exclude not valid parts from get_directory_metadata output [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617680 (https://phabricator.wikimedia.org/T259338) [11:11:13] 10Operations: Rebase patches for VP9 support to ffmpeg 3.2.15 - https://phabricator.wikimedia.org/T259336 (10MoritzMuehlenhoff) For the record, the steps to validate that VP9 multi-threading support works as expected in the new ffmpeg build: * Download https://upload.wikimedia.org/wikipedia/commons/6/69/Wall_of... [11:16:36] !log imported ffmpeg 3.2.15-0+deb9u1+wmf1 to component/vp9 for stretch-wikimedia T259336 [11:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:43] T259336: Rebase patches for VP9 support to ffmpeg 3.2.15 - https://phabricator.wikimedia.org/T259336 [11:19:54] (03PS1) 10Jbond: profile::gerrit::server: correct hiera name space [puppet] - 10https://gerrit.wikimedia.org/r/617683 (https://phabricator.wikimedia.org/T247956) [11:19:56] !log installing ffmpeg security updates for jessie (standard version from security.debian.org, not the VP9-enabled component) [11:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:19] !log restart dbstore1004 [11:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:50] (03PS2) 10Jbond: profile::gerrit::server: correct hiera name space [puppet] - 10https://gerrit.wikimedia.org/r/617683 (https://phabricator.wikimedia.org/T247956) [11:28:15] (03PS1) 10Jcrespo: mariadb: Reduce buffer cache memory footprint to prevent OOMs [puppet] - 10https://gerrit.wikimedia.org/r/617684 [11:29:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:29:48] (03CR) 10Jcrespo: [C: 03+2] mariadb: Reduce buffer cache memory footprint to prevent OOMs [puppet] - 10https://gerrit.wikimedia.org/r/617684 (owner: 10Jcrespo) [11:32:21] (03PS1) 10Ema: atskafka: librdkafka settings tuning [puppet] - 10https://gerrit.wikimedia.org/r/617685 (https://phabricator.wikimedia.org/T254317) [11:35:03] (03PS3) 10Jbond: profile::gerrit::server: correct hiera name space [puppet] - 10https://gerrit.wikimedia.org/r/617683 (https://phabricator.wikimedia.org/T247956) [11:35:59] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/24250/" [puppet] - 10https://gerrit.wikimedia.org/r/617683 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [11:36:45] (03PS1) 10Andrew Bogott: cloudvirt103[1-9]: move to insetup until I can figure out what's happening [puppet] - 10https://gerrit.wikimedia.org/r/617686 (https://phabricator.wikimedia.org/T251627) [11:37:02] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:37:20] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt103[1-9]: move to insetup until I can figure out what's happening [puppet] - 10https://gerrit.wikimedia.org/r/617686 (https://phabricator.wikimedia.org/T251627) (owner: 10Andrew Bogott) [11:39:29] (03PS2) 10Ema: atskafka: librdkafka settings tuning [puppet] - 10https://gerrit.wikimedia.org/r/617685 (https://phabricator.wikimedia.org/T254317) [11:42:18] (03CR) 10Ema: [C: 03+1] "Other than for the comment I've left inline, this looks great." (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) (owner: 10Kormat) [11:42:41] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/617685 (https://phabricator.wikimedia.org/T254317) (owner: 10Ema) [11:51:59] (03CR) 10Elukey: [C: 03+1] atskafka: librdkafka settings tuning [puppet] - 10https://gerrit.wikimedia.org/r/617685 (https://phabricator.wikimedia.org/T254317) (owner: 10Ema) [11:54:06] (03PS1) 10Filippo Giunchedi: alertmanager: add IRC notifier [puppet] - 10https://gerrit.wikimedia.org/r/617688 (https://phabricator.wikimedia.org/T258948) [11:54:08] (03PS1) 10Filippo Giunchedi: role: add alertmanager::irc to alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/617689 (https://phabricator.wikimedia.org/T258948) [11:54:34] (03PS1) 10Jbond: profile::gerrit::server: correct hiera name space [puppet] - 10https://gerrit.wikimedia.org/r/617690 (https://phabricator.wikimedia.org/T247956) [11:55:53] !log installing mercurial security updates [11:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:06] (03CR) 10jerkins-bot: [V: 04-1] profile::gerrit::server: correct hiera name space [puppet] - 10https://gerrit.wikimedia.org/r/617690 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [11:56:22] (03PS2) 10Jbond: profile::gerrit::server: correct hiera name space [puppet] - 10https://gerrit.wikimedia.org/r/617690 (https://phabricator.wikimedia.org/T247956) [11:57:00] (03CR) 10Filippo Giunchedi: "Things still TODO/pending:" [puppet] - 10https://gerrit.wikimedia.org/r/617688 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [11:57:17] 10Operations: Rebase patches for VP9 support to ffmpeg 3.2.15 - https://phabricator.wikimedia.org/T259336 (10MoritzMuehlenhoff) 05Open→03Resolved This is completed [12:02:52] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:03:33] (03PS3) 10Jbond: profile::gerrit::server: correct hiera name space [puppet] - 10https://gerrit.wikimedia.org/r/617690 (https://phabricator.wikimedia.org/T247956) [12:07:06] (03PS4) 10Jbond: profile::gerrit::server: correct hiera name space [puppet] - 10https://gerrit.wikimedia.org/r/617690 (https://phabricator.wikimedia.org/T247956) [12:08:48] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [12:09:15] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/24259/" [puppet] - 10https://gerrit.wikimedia.org/r/617690 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [12:10:36] (03PS16) 10Kormat: Create debian packages. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) [12:11:04] (03CR) 10Kormat: Create debian packages. (033 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) (owner: 10Kormat) [12:11:29] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/617673 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [12:12:43] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) (owner: 10Kormat) [12:14:26] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 200 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [12:15:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor nitpick but otherwise LGTM" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/617652 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [12:16:11] (03PS1) 10Jbond: profile::gerrit::server: rename profile [puppet] - 10https://gerrit.wikimedia.org/r/617691 [12:18:01] (03CR) 10JMeybohm: [C: 03+2] helm: Allow multiple helm repositories [puppet] - 10https://gerrit.wikimedia.org/r/617673 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [12:18:47] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/24261/" [puppet] - 10https://gerrit.wikimedia.org/r/617691 (owner: 10Jbond) [12:19:55] (03CR) 10Ayounsi: [C: 03+1] analytics-in[46]: add new ports for term mysql-replica [homer/public] - 10https://gerrit.wikimedia.org/r/617649 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [12:23:10] (03PS5) 10JMeybohm: Remove the repository definition from helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/617652 (https://phabricator.wikimedia.org/T253843) [12:23:12] (03PS1) 10JMeybohm: Remove the repository definition from helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/617693 (https://phabricator.wikimedia.org/T253843) [12:23:14] (03PS1) 10JMeybohm: changeprop: Update repository URL in requirements [deployment-charts] - 10https://gerrit.wikimedia.org/r/617694 (https://phabricator.wikimedia.org/T253843) [12:23:17] (03PS1) 10JMeybohm: eventgate: Update repository URL in requirements [deployment-charts] - 10https://gerrit.wikimedia.org/r/617695 (https://phabricator.wikimedia.org/T253843) [12:25:26] (03CR) 10JMeybohm: Remove the repository definition from helmfiles (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/617652 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [12:25:59] (03CR) 10JMeybohm: [C: 04-2] "Needs testing" [deployment-charts] - 10https://gerrit.wikimedia.org/r/617693 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [12:27:52] (03CR) 10JMeybohm: [C: 03+2] Remove the repository definition from helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/617652 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [12:28:44] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 20897144 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:29:06] (03Merged) 10jenkins-bot: Remove the repository definition from helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/617652 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [12:29:33] (03Abandoned) 10Filippo Giunchedi: WIP prometheus::alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/354976 (owner: 10Filippo Giunchedi) [12:29:54] (03Abandoned) 10Filippo Giunchedi: role: use alertmanager in beta prometheus [puppet] - 10https://gerrit.wikimedia.org/r/354460 (owner: 10Filippo Giunchedi) [12:29:58] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:30:38] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 45456 and 90 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:31:39] (03PS2) 10Ayounsi: Netbox: add circuits support [software/homer] - 10https://gerrit.wikimedia.org/r/617418 [12:31:51] (03Abandoned) 10Filippo Giunchedi: profile: install SMART checks after 'raid' fact is available. [puppet] - 10https://gerrit.wikimedia.org/r/428947 (https://phabricator.wikimedia.org/T132324) (owner: 10Filippo Giunchedi) [12:32:58] (03CR) 10Kormat: [C: 03+2] Create debian packages. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) (owner: 10Kormat) [12:33:29] (03Merged) 10jenkins-bot: Create debian packages. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) (owner: 10Kormat) [12:33:46] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:41:09] (03PS2) 10JMeybohm: Remove the repository definition from helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/617693 (https://phabricator.wikimedia.org/T253843) [12:41:28] (03PS2) 10JMeybohm: changeprop: Update repository URL in requirements [deployment-charts] - 10https://gerrit.wikimedia.org/r/617694 (https://phabricator.wikimedia.org/T253843) [12:41:30] (03PS2) 10JMeybohm: eventgate: Update repository URL in requirements [deployment-charts] - 10https://gerrit.wikimedia.org/r/617695 (https://phabricator.wikimedia.org/T253843) [12:41:32] (03PS1) 10JMeybohm: mathoid: Change staging chart back to stable [deployment-charts] - 10https://gerrit.wikimedia.org/r/617699 (https://phabricator.wikimedia.org/T253843) [12:41:49] (03PS1) 10Alexandros Kosiaris: otrs: Disallow outgoing emails from test instance [puppet] - 10https://gerrit.wikimedia.org/r/617700 (https://phabricator.wikimedia.org/T187984) [12:44:12] (03PS1) 10JMeybohm: helm: Switch stable chart repository to chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/617701 (https://phabricator.wikimedia.org/T25384) [12:46:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] otrs: Disallow outgoing emails from test instance [puppet] - 10https://gerrit.wikimedia.org/r/617700 (https://phabricator.wikimedia.org/T187984) (owner: 10Alexandros Kosiaris) [12:48:15] (03CR) 10JMeybohm: [C: 04-1] helmfile: strawman refactoring (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [12:50:47] (03CR) 10Ema: [C: 03+2] atskafka: librdkafka settings tuning [puppet] - 10https://gerrit.wikimedia.org/r/617685 (https://phabricator.wikimedia.org/T254317) (owner: 10Ema) [12:53:10] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:53:58] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:55:52] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:02:31] (03PS1) 10Jbond: standard: move none standard class to profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/617703 (https://phabricator.wikimedia.org/T247956) [13:02:33] (03PS1) 10Jbond: profile::standard::admin: manage admin groups in profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/617704 (https://phabricator.wikimedia.org/T247956) [13:02:41] (03PS1) 10Alexandros Kosiaris: otrs: Fix ferm::rule syntax [puppet] - 10https://gerrit.wikimedia.org/r/617705 [13:03:47] (03CR) 10jerkins-bot: [V: 04-1] standard: move none standard class to profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/617703 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [13:04:04] (03CR) 10jerkins-bot: [V: 04-1] profile::standard::admin: manage admin groups in profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/617704 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [13:04:24] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10MoritzMuehlenhoff) [13:04:31] !log proudly uploaded version 0.1 of python3-wmfmariadbpy + wmfmariadbpy [13:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:43] (03CR) 10Muehlenhoff: [C: 03+2] Extend snapshot Cumin alias to also include the testbed role [puppet] - 10https://gerrit.wikimedia.org/r/617651 (owner: 10Muehlenhoff) [13:09:45] (03PS2) 10Jbond: standard: move none standard class to profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/617703 (https://phabricator.wikimedia.org/T247956) [13:10:51] (03CR) 10Alexandros Kosiaris: [C: 03+2] otrs: Fix ferm::rule syntax [puppet] - 10https://gerrit.wikimedia.org/r/617705 (owner: 10Alexandros Kosiaris) [13:13:01] (03PS2) 10Jbond: profile::standard::admin: manage admin groups in profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/617704 (https://phabricator.wikimedia.org/T247956) [13:13:58] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:17:55] (03PS1) 10Jbond: ferm: ensure rules always end in a semi colon [puppet] - 10https://gerrit.wikimedia.org/r/617706 [13:17:57] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/617704 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [13:18:43] (03PS2) 10Jbond: ferm: ensure rules always end in a semi colon [puppet] - 10https://gerrit.wikimedia.org/r/617706 [13:20:09] !log installing openjpeg2 security updates [13:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:46] (03CR) 10Ottomata: [C: 03+1] eventgate: Update repository URL in requirements [deployment-charts] - 10https://gerrit.wikimedia.org/r/617695 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [13:26:28] (03PS1) 10Jbond: hieradata: drop apache::logrotate keys [puppet] - 10https://gerrit.wikimedia.org/r/617708 (https://phabricator.wikimedia.org/T247956) [13:27:19] (03CR) 10Jbond: [C: 03+2] hieradata: drop apache::logrotate keys [puppet] - 10https://gerrit.wikimedia.org/r/617708 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [13:30:24] (03PS1) 10Jbond: diamond: remove unused hiera key [puppet] - 10https://gerrit.wikimedia.org/r/617710 (https://phabricator.wikimedia.org/T247956) [13:30:55] (03CR) 10Jbond: [C: 03+2] diamond: remove unused hiera key [puppet] - 10https://gerrit.wikimedia.org/r/617710 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [13:39:37] (03CR) 10Alexandros Kosiaris: [C: 03+1] mathoid: Change staging chart back to stable (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/617699 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [13:42:33] (03PS1) 10Jbond: confd: use the default $::domain variable for cofd srv_dns [puppet] - 10https://gerrit.wikimedia.org/r/617716 (https://phabricator.wikimedia.org/T247956) [13:42:44] (03PS1) 10Urbanecm: New throttle rule for Czech editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617717 (https://phabricator.wikimedia.org/T259352) [13:43:37] (03CR) 10jerkins-bot: [V: 04-1] New throttle rule for Czech editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617717 (https://phabricator.wikimedia.org/T259352) (owner: 10Urbanecm) [13:44:20] (03PS2) 10Urbanecm: New throttle rule for Czech editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617717 (https://phabricator.wikimedia.org/T259352) [13:45:54] (03CR) 10Alexandros Kosiaris: [C: 03+1] changeprop: Update repository URL in requirements [deployment-charts] - 10https://gerrit.wikimedia.org/r/617694 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [13:46:30] (03CR) 10Alexandros Kosiaris: [C: 03+1] eventgate: Update repository URL in requirements [deployment-charts] - 10https://gerrit.wikimedia.org/r/617695 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [13:47:12] (03CR) 10Alexandros Kosiaris: [C: 03+1] Remove the repository definition from helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/617693 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [13:49:46] (03PS2) 10Jbond: confd: use the default $::domain variable for cofd srv_dns [puppet] - 10https://gerrit.wikimedia.org/r/617716 (https://phabricator.wikimedia.org/T247956) [13:50:57] (03CR) 10jerkins-bot: [V: 04-1] confd: use the default $::domain variable for cofd srv_dns [puppet] - 10https://gerrit.wikimedia.org/r/617716 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [13:51:37] !log installing cups security updates (client-side tools/libs only) [13:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:50] (03CR) 10Elukey: [C: 03+2] analytics-in[46]: add new ports for term mysql-replica [homer/public] - 10https://gerrit.wikimedia.org/r/617649 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [13:52:15] !log update cr1/cr2-eqiad's analytics filters (ref: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/617649/) [13:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:59] (03PS3) 10Jbond: confd: use the default $::domain variable for cofd srv_dns [puppet] - 10https://gerrit.wikimedia.org/r/617716 (https://phabricator.wikimedia.org/T247956) [13:59:41] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/617716 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [14:00:50] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:03:54] (03PS4) 10Jbond: confd: pass srv_dns directly instead of loading confd::srv_dns [puppet] - 10https://gerrit.wikimedia.org/r/617716 (https://phabricator.wikimedia.org/T247956) [14:04:33] (03PS1) 10Jbond: hieradata: remove unused hiera file [puppet] - 10https://gerrit.wikimedia.org/r/617723 [14:05:10] (03CR) 10Jbond: [C: 03+2] hieradata: remove unused hiera file [puppet] - 10https://gerrit.wikimedia.org/r/617723 (owner: 10Jbond) [14:08:24] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:09:38] (03PS1) 10Jbond: discovery: clean up old hiera values [puppet] - 10https://gerrit.wikimedia.org/r/617724 (https://phabricator.wikimedia.org/T247956) [14:10:48] (03CR) 10Jbond: [C: 03+2] discovery: clean up old hiera values [puppet] - 10https://gerrit.wikimedia.org/r/617724 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [14:11:17] (03CR) 10JMeybohm: [C: 03+2] mathoid: Change staging chart back to stable [deployment-charts] - 10https://gerrit.wikimedia.org/r/617699 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [14:11:28] (03CR) 10JMeybohm: [C: 03+2] Remove the repository definition from helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/617693 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [14:12:19] (03Merged) 10jenkins-bot: mathoid: Change staging chart back to stable [deployment-charts] - 10https://gerrit.wikimedia.org/r/617699 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [14:12:43] (03Merged) 10jenkins-bot: Remove the repository definition from helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/617693 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [14:17:16] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.1213 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:21:21] jbond42: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Function lookup() did not find a value for the name 'discovery::app_routes' on node netbox1001.wikimedia.org [14:21:50] puppetboard is like a christmas tree :D [14:22:25] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) [14:24:25] there seems to be an app_routes = hiera('discovery::app_routes') in realm.pp, related to aqs, no idea why [14:25:06] apparently used by restbase [14:25:08] elukey: thanks looking [14:26:05] (03PS1) 10Jbond: Revert "discovery: clean up old hiera values" [puppet] - 10https://gerrit.wikimedia.org/r/617579 [14:26:48] (03CR) 10Jbond: [C: 03+2] Revert "discovery: clean up old hiera values" [puppet] - 10https://gerrit.wikimedia.org/r/617579 (owner: 10Jbond) [14:27:37] elukey thanks have reverted [14:30:17] (03PS1) 10Jbond: graphite: move graphite paramters under profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/617725 (https://phabricator.wikimedia.org/T247956) [14:31:59] (03PS1) 10Jbond: discovery: clean up old hiera values [puppet] - 10https://gerrit.wikimedia.org/r/617580 (https://phabricator.wikimedia.org/T247956) [14:40:00] (03PS1) 10MSantos: Enable printBackground to fix style issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/617728 (https://phabricator.wikimedia.org/T52178) [14:40:26] (03PS2) 10Jbond: discovery: clean up old hiera values [puppet] - 10https://gerrit.wikimedia.org/r/617580 (https://phabricator.wikimedia.org/T247956) [14:40:28] (03PS1) 10Jbond: profile::restbase: update aqs_uri to remove aqs_site variable [puppet] - 10https://gerrit.wikimedia.org/r/617729 (https://phabricator.wikimedia.org/T247956) [14:40:54] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/617729 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [14:42:30] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10eyazi) Not sure if you did, but you should also reset the Ticket::SearchIndexModule setting. Can be done on the interface if you have acce... [14:42:48] (03PS3) 10Jbond: discovery: clean up old hiera values [puppet] - 10https://gerrit.wikimedia.org/r/617580 (https://phabricator.wikimedia.org/T247956) [14:45:54] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.002489 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:52:21] (03CR) 10Mholloway: "You'll also need to do a new chart release with this change to get it into production. The process is described in the README, but in a nu" [deployment-charts] - 10https://gerrit.wikimedia.org/r/617728 (https://phabricator.wikimedia.org/T52178) (owner: 10MSantos) [14:54:17] (03CR) 10Mholloway: "Lol, I'm failing badly at Gerrit this morning." [deployment-charts] - 10https://gerrit.wikimedia.org/r/617728 (https://phabricator.wikimedia.org/T52178) (owner: 10MSantos) [14:54:43] (03CR) 10Mholloway: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/617728 (https://phabricator.wikimedia.org/T52178) (owner: 10MSantos) [15:01:20] (03PS10) 10Ottomata: Initial debian commit [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) [15:07:43] (03CR) 10Herron: [C: 03+2] logstash7: increase SSD tier JVM heap to 32G [puppet] - 10https://gerrit.wikimedia.org/r/617526 (https://phabricator.wikimedia.org/T259219) (owner: 10Herron) [15:10:48] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Patch-For-Review, and 2 others: Create Graphoid .pipeline files - https://phabricator.wikimedia.org/T203092 (10kaldari) [15:10:53] 10Operations, 10Graphoid, 10Code-Stewardship-Reviews, 10Release-Engineering-Team (Code Health), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10kaldari) 05Open→03Resolved [15:15:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for... [15:16:41] (03PS1) 10Elukey: Set spark deploy-mode client for all the Analytics Hive to Druid jobs [puppet] - 10https://gerrit.wikimedia.org/r/617735 (https://phabricator.wikimedia.org/T254493) [15:25:04] (03PS1) 10Elukey: Swap fake keytabs from an-launcher1001 to 1002 [labs/private] - 10https://gerrit.wikimedia.org/r/617736 [15:25:21] (03CR) 10Elukey: [V: 03+2 C: 03+2] Swap fake keytabs from an-launcher1001 to 1002 [labs/private] - 10https://gerrit.wikimedia.org/r/617736 (owner: 10Elukey) [15:28:41] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10RLazarus) The only user-impacting section of the process will be a read-only period for all wikis while we move MediaWiki itself -- that s... [15:29:08] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/24268/an-launcher1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/617735 (https://phabricator.wikimedia.org/T254493) (owner: 10Elukey) [15:30:37] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [15:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:38] (03PS11) 10Ottomata: Initial debian commit [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) [15:32:00] (03CR) 10MSantos: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/617728 (https://phabricator.wikimedia.org/T52178) (owner: 10MSantos) [15:32:43] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1031.eqiad.wmnet'] ` and were... [16:05:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for... [16:07:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for... [16:09:59] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:20:04] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [16:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:11] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:31] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [16:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:37] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1032.eqiad.wmnet'] ` and were... [16:30:05] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2016.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [16:30:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1031.eqiad.wmnet'] ` and were... [16:31:45] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2017.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [16:37:55] (03CR) 10Ottomata: Initial debian commit (031 comment) [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [16:38:05] (03PS4) 10Ahmon Dancy: Add mtail program for monitoring the Zuul error log [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) [16:39:16] (03CR) 10jerkins-bot: [V: 04-1] Add mtail program for monitoring the Zuul error log [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) (owner: 10Ahmon Dancy) [16:41:18] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:42:38] (03PS1) 10Andrew Bogott: Update role for cloudvirt1031 and 1032 [puppet] - 10https://gerrit.wikimedia.org/r/617742 (https://phabricator.wikimedia.org/T251627) [16:42:52] (03CR) 10jerkins-bot: [V: 04-1] Update role for cloudvirt1031 and 1032 [puppet] - 10https://gerrit.wikimedia.org/r/617742 (https://phabricator.wikimedia.org/T251627) (owner: 10Andrew Bogott) [16:43:33] (03PS2) 10Andrew Bogott: Update role for cloudvirt1031 and 1032 [puppet] - 10https://gerrit.wikimedia.org/r/617742 (https://phabricator.wikimedia.org/T251627) [16:43:36] (03CR) 10Ahmon Dancy: Add mtail program for monitoring the Zuul error log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) (owner: 10Ahmon Dancy) [16:46:09] (03CR) 10Andrew Bogott: [C: 03+2] Update role for cloudvirt1031 and 1032 [puppet] - 10https://gerrit.wikimedia.org/r/617742 (https://phabricator.wikimedia.org/T251627) (owner: 10Andrew Bogott) [16:49:11] (03CR) 10Dzahn: Add mtail program for monitoring the Zuul error log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) (owner: 10Ahmon Dancy) [16:53:19] (03CR) 10Dzahn: "the issue now is "Found hiera call in class 'zuul::monitoring::server' for 'prometheus_nodes'". So you should do the lookup() in the param" [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) (owner: 10Ahmon Dancy) [16:56:02] (03CR) 10Dzahn: [C: 04-1] "looks like there is a singular vs plural issue: migration/migrations" [puppet] - 10https://gerrit.wikimedia.org/r/617676 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [16:56:27] (03CR) 10Ahmon Dancy: Add mtail program for monitoring the Zuul error log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) (owner: 10Ahmon Dancy) [16:58:42] Toolforge bad gateway? :O [16:58:52] (03CR) 10Dzahn: Add mtail program for monitoring the Zuul error log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) (owner: 10Ahmon Dancy) [16:59:24] Bsadowski1: let's use the !help feature in -cloud [16:59:38] nvm :P [16:59:49] Bsadowski1: fine here [16:59:56] i could not repro, ack [17:02:01] (03CR) 10Ahmon Dancy: Add mtail program for monitoring the Zuul error log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) (owner: 10Ahmon Dancy) [17:02:09] (03PS1) 10Dzahn: switch xhgui1001 to new xhgui role [puppet] - 10https://gerrit.wikimedia.org/r/617744 (https://phabricator.wikimedia.org/T259206) [17:06:13] (03CR) 10Dzahn: "to solve the current reason for jerkins downvote:" [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) (owner: 10Ahmon Dancy) [17:09:13] (03PS1) 10Krinkle: mediawiki-cache-warmup: Reduce warmup URLs [puppet] - 10https://gerrit.wikimedia.org/r/617745 [17:09:15] (03PS1) 10Krinkle: mediawiki-cache-warmup: Add "dry" mode [puppet] - 10https://gerrit.wikimedia.org/r/617746 [17:09:17] (03PS1) 10Krinkle: mediawiki-cache-warmup: Limit warmup URLs to large wikis [puppet] - 10https://gerrit.wikimedia.org/r/617747 [17:11:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [17:11:56] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [17:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:15] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [17:13:15] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [17:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:14] PROBLEM - Check systemd state on wtp2017 is CRITICAL: connect to address 10.192.32.32 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:21:14] PROBLEM - MD RAID on wtp2017 is CRITICAL: connect to address 10.192.32.32 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:21:14] PROBLEM - php7.2-fpm service on wtp2017 is CRITICAL: connect to address 10.192.32.32 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:21:42] ACK - wtp2017 [17:21:49] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [17:21:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:34] PROBLEM - configured eth on wtp2016 is CRITICAL: connect to address 10.192.32.31 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [17:23:34] PROBLEM - DPKG on wtp2016 is CRITICAL: connect to address 10.192.32.31 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [17:24:08] (03PS1) 10Andrew Bogott: cloudvirt103[1-9]: rename nics for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/617748 (https://phabricator.wikimedia.org/T251627) [17:25:13] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [17:25:13] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:42] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt103[1-9]: rename nics for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/617748 (https://phabricator.wikimedia.org/T251627) (owner: 10Andrew Bogott) [17:30:48] (03CR) 10Chad: [C: 03+1] "Nuke from high orbit!" [puppet] - 10https://gerrit.wikimedia.org/r/616164 (owner: 10Dzahn) [17:31:35] (03CR) 10Dzahn: "Greetings Chad! Hope you are doing well :) and ok, will do" [puppet] - 10https://gerrit.wikimedia.org/r/616164 (owner: 10Dzahn) [17:32:08] (03CR) 10Greg Grossmeier: [C: 03+1] "per Chad's comment. /me pours one out" [puppet] - 10https://gerrit.wikimedia.org/r/616164 (owner: 10Dzahn) [17:33:27] (03CR) 10Dzahn: [C: 03+2] admins: remove demon from gerrit and phab root users [puppet] - 10https://gerrit.wikimedia.org/r/616164 (owner: 10Dzahn) [17:33:32] (03PS2) 10Dzahn: admins: remove demon from gerrit and phab root users [puppet] - 10https://gerrit.wikimedia.org/r/616164 [17:36:01] (03CR) 10Chad: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/616164 (owner: 10Dzahn) [17:40:53] (03PS2) 10Dzahn: switch xhgui1001 to new xhgui role [puppet] - 10https://gerrit.wikimedia.org/r/617744 (https://phabricator.wikimedia.org/T259206) [17:42:29] (03CR) 10Dzahn: [C: 03+2] switch xhgui1001 to new xhgui role [puppet] - 10https://gerrit.wikimedia.org/r/617744 (https://phabricator.wikimedia.org/T259206) (owner: 10Dzahn) [17:43:44] (03CR) 10Dave Pifke: [C: 03+1] switch xhgui1001 to new xhgui role [puppet] - 10https://gerrit.wikimedia.org/r/617744 (https://phabricator.wikimedia.org/T259206) (owner: 10Dzahn) [17:45:22] !log rebooting / reinstalling OS on xhgui1001 [17:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:51] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [17:45:51] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:04] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:51:26] (03PS1) 10Chad: Revoke all remaining group memberships, etc [puppet] - 10https://gerrit.wikimedia.org/r/617749 [17:54:01] (03CR) 10Greg Grossmeier: [C: 03+1] Revoke all remaining group memberships, etc [puppet] - 10https://gerrit.wikimedia.org/r/617749 (owner: 10Chad) [17:54:52] (03CR) 10jerkins-bot: [V: 04-1] Revoke all remaining group memberships, etc [puppet] - 10https://gerrit.wikimedia.org/r/617749 (owner: 10Chad) [17:56:57] (03PS2) 10Greg Grossmeier: Revoke all remaining group memberships, etc [puppet] - 10https://gerrit.wikimedia.org/r/617749 (owner: 10Chad) [17:57:03] 10Operations, 10Gerrit, 10User-Kormat: gerrit.wm.o/r/changes/ has leading garbage in the output - https://phabricator.wikimedia.org/T259333 (10dpifke) From https://gerrit-review.googlesource.com/Documentation/rest-api.html#output > To prevent against Cross Site Script Inclusion (XSSI) attacks, the JSON resp... [17:57:05] (03PS3) 10Chad: Revoke all remaining group memberships, etc [puppet] - 10https://gerrit.wikimedia.org/r/617749 [18:01:21] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:13:14] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [18:18:50] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 200 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [18:19:01] (03CR) 10RLazarus: [C: 03+1] mediawiki-cache-warmup: Reduce warmup URLs [puppet] - 10https://gerrit.wikimedia.org/r/617745 (owner: 10Krinkle) [18:19:06] 10Operations, 10serviceops: reinstall xhgui* with buster - https://phabricator.wikimedia.org/T259206 (10Dzahn) 05Open→03Resolved both xhgui1001 and xhgui2001 are now on buster, have xhgui package installed and puppet is happy [18:23:18] PROBLEM - ensure kvm processes are running on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:26:14] PROBLEM - nova-compute proc minimum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:26:26] PROBLEM - ensure kvm processes are running on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:30:04] RECOVERY - nova-compute proc minimum on cloudvirt1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:33:11] (03CR) 10RLazarus: [C: 03+1] mediawiki-cache-warmup: Add "dry" mode (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617746 (owner: 10Krinkle) [18:36:46] RECOVERY - Check systemd state on wtp2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:36:48] RECOVERY - MD RAID on wtp2017 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:37:24] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:39:01] (03PS2) 10Krinkle: mediawiki-cache-warmup: Add "dry" mode [puppet] - 10https://gerrit.wikimedia.org/r/617746 [18:39:08] (03PS2) 10Krinkle: mediawiki-cache-warmup: Limit warmup URLs to large wikis [puppet] - 10https://gerrit.wikimedia.org/r/617747 [18:39:32] 10Operations, 10LDAP-Access-Requests: LDAP access to the 'wmf' group for Monte Hurd - https://phabricator.wikimedia.org/T259382 (10Mhurd) [18:40:31] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2017.codfw.wmnet'] ` and were **ALL** successful. [18:41:14] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:42:13] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2016.codfw.wmnet'] ` and were **ALL** successful. [18:47:45] (03CR) 10Nuria: [C: 03+1] "Nice, this will bring piece of mind" [puppet] - 10https://gerrit.wikimedia.org/r/617735 (https://phabricator.wikimedia.org/T254493) (owner: 10Elukey) [18:48:12] 10Operations, 10LDAP-Access-Requests: LDAP access to the 'wmf' group for Monte Hurd - https://phabricator.wikimedia.org/T259382 (10greg) Approved as his manager's manager. [18:48:17] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:49:21] redis connection errors ^ [18:52:25] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:52:35] PROBLEM - nova-compute proc minimum on cloudvirt1006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:53:27] ACKNOWLEDGEMENT - nova-compute proc minimum on cloudvirt1006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute andrew bogott restarting https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:53:39] RECOVERY - nova-compute proc minimum on cloudvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:55:27] RECOVERY - configured eth on wtp2016 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:55:27] RECOVERY - DPKG on wtp2016 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [18:55:50] (03CR) 10RLazarus: [C: 03+1] mediawiki-cache-warmup: Limit warmup URLs to large wikis [puppet] - 10https://gerrit.wikimedia.org/r/617747 (owner: 10Krinkle) [18:56:57] 10Operations, 10LDAP-Access-Requests: LDAP access to the 'wmf' group for Monte Hurd - https://phabricator.wikimedia.org/T259382 (10herron) p:05Triage→03Medium Hi @Mhurd, it looks like you are actually a member of the `wmf` ldap group already, so that should be working with existing memberships unless a dif... [18:57:26] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2016.codfw.wmnet [18:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:45] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2018.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [19:00:47] 10Operations, 10Gerrit, 10User-Kormat: gerrit.wm.o/r/changes/ has leading garbage in the output - https://phabricator.wikimedia.org/T259333 (10herron) p:05Triage→03Medium [19:00:53] 10Operations, 10SRE-tools: Exception raised while executing cookbook sre.hosts.downtime - https://phabricator.wikimedia.org/T259158 (10herron) p:05Triage→03Medium [19:02:56] 10Operations, 10Gerrit, 10User-Kormat: gerrit.wm.o/r/changes/ has leading garbage in the output - https://phabricator.wikimedia.org/T259333 (10greg) 05Open→03Invalid Boldly marking as invalid as this seems to be intended behavior. [19:12:39] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:13:04] (03CR) 10RLazarus: [C: 03+1] mediawiki-cache-warmup: Add "dry" mode [puppet] - 10https://gerrit.wikimedia.org/r/617746 (owner: 10Krinkle) [19:14:43] PROBLEM - puppet last run on otrs1001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:15:18] known that there is WIP on otrs [19:20:59] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2017.codfw.wmnet [19:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:25] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2019.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [19:23:39] ACKNOWLEDGEMENT - mediawiki-installation DSH group on wtp2017 is CRITICAL: Host wtp2017 is not in mediawiki-installation dsh group daniel_zahn reinstall https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:23:59] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:28:09] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2020.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [19:34:15] (03PS1) 10Andrew Bogott: Move cloudvirt1031 from virt_ceph to virt [puppet] - 10https://gerrit.wikimedia.org/r/617758 [19:34:51] (03CR) 10Andrew Bogott: [C: 03+2] Move cloudvirt1031 from virt_ceph to virt [puppet] - 10https://gerrit.wikimedia.org/r/617758 (owner: 10Andrew Bogott) [19:34:56] (03PS2) 10Andrew Bogott: Move cloudvirt1031 from virt_ceph to virt [puppet] - 10https://gerrit.wikimedia.org/r/617758 [19:35:16] 10Operations, 10Analytics-Clusters, 10Analytics-Radar, 10observability: Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10herron) [19:35:48] 10Operations, 10vm-requests: eqiad: 1 VM for kafkamon - https://phabricator.wikimedia.org/T257560 (10herron) 05Open→03Resolved [19:36:01] 10Operations, 10vm-requests: codfw: 1 VM for kafkamon - https://phabricator.wikimedia.org/T257561 (10herron) 05Open→03Resolved [19:36:12] 10Operations, 10vm-requests: eqiad: 1 VM for kafkamon - kafkamon1002 - https://phabricator.wikimedia.org/T257560 (10herron) [19:36:25] 10Operations, 10vm-requests: codfw: 1 VM for kafkamon - kafkamon2002 - https://phabricator.wikimedia.org/T257561 (10herron) [19:39:05] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:39:06] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:59] (03PS5) 10Ahmon Dancy: Add mtail program for monitoring the Zuul error log [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) [19:50:13] (03CR) 10jerkins-bot: [V: 04-1] Add mtail program for monitoring the Zuul error log [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) (owner: 10Ahmon Dancy) [19:51:18] PROBLEM - Check size of conntrack table on wtp2018 is CRITICAL: connect to address 10.192.32.33 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:51:18] PROBLEM - parsoid on wtp2018 is CRITICAL: connect to address 10.192.32.33 and port 8000: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [19:51:35] ACK - wtp2018 [19:51:48] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:51:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:13] (03PS6) 10Ahmon Dancy: Add mtail program for monitoring the Zuul error log [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) [19:55:24] (03CR) 10jerkins-bot: [V: 04-1] Add mtail program for monitoring the Zuul error log [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) (owner: 10Ahmon Dancy) [20:01:05] (03PS1) 10Ahmon Dancy: Rplace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/617762 [20:01:07] (03PS2) 10Ahmon Dancy: Replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/617762 [20:02:58] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [20:02:58] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:24] (03PS1) 10Andrew Bogott: Revert "Move cloudvirt1031 from virt_ceph to virt" [puppet] - 10https://gerrit.wikimedia.org/r/617584 [20:04:16] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:05:24] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Move cloudvirt1031 from virt_ceph to virt" [puppet] - 10https://gerrit.wikimedia.org/r/617584 (owner: 10Andrew Bogott) [20:08:34] RECOVERY - ensure kvm processes are running on cloudvirt1031 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:08:43] (03PS7) 10Ahmon Dancy: Add mtail program for monitoring the Zuul error log [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) [20:08:52] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:09:16] RECOVERY - ensure kvm processes are running on cloudvirt1032 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:10:11] 10Operations, 10Discovery-Search: Requesting access to production shell for Denny Vrandecic - https://phabricator.wikimedia.org/T259388 (10DVrandecic) [20:11:12] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [20:11:12] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:46] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:18:46] 10Operations, 10Fundraising-Backlog: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10DStrine) p:05Medium→03High [20:21:32] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 200 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [20:21:56] PROBLEM - PHP7 rendering on wtp2020 is CRITICAL: connect to address 10.192.32.35 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:21:58] PROBLEM - Check whether ferm is active by checking the default input chain on wtp2020 is CRITICAL: connect to address 10.192.32.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:21:58] PROBLEM - mcrouter process on wtp2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.34: Connection reset by peer https://wikitech.wikimedia.org/wiki/Mcrouter [20:22:05] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [20:22:06] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:11] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [20:22:12] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:18] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell for Denny Vrandecic - https://phabricator.wikimedia.org/T259388 (10Nuria) [20:33:31] (03PS1) 10Andrew Bogott: Cloudvirt103[3-9] to Buster [puppet] - 10https://gerrit.wikimedia.org/r/617767 [20:34:03] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell for Denny Vrandecic - https://phabricator.wikimedia.org/T259388 (10Nuria) [20:34:59] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell for Denny Vrandecic - https://phabricator.wikimedia.org/T259388 (10Nuria) [20:35:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10Andrew) [20:35:41] (03PS3) 10Ahmon Dancy: Replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/617762 [20:35:43] (03PS8) 10Ahmon Dancy: Add mtail program for monitoring the Zuul error log [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) [20:36:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10Andrew) I have cloudvirt1031 and 1032 running nova-compute, and things look right from the host... [20:36:47] (03CR) 10Andrew Bogott: [C: 03+2] Cloudvirt103[3-9] to Buster [puppet] - 10https://gerrit.wikimedia.org/r/617767 (owner: 10Andrew Bogott) [20:37:11] (03CR) 10jerkins-bot: [V: 04-1] Add mtail program for monitoring the Zuul error log [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) (owner: 10Ahmon Dancy) [20:55:30] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [20:55:30] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [20:55:30] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [20:55:30] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [20:55:30] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [20:55:31] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [20:55:31] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [20:55:33] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:55:33] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:55:33] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:55:33] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:55:33] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:55:34] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:40] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:43] RECOVERY - parsoid on wtp2018 is OK: HTTP OK: HTTP/1.1 200 OK - 1022 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [21:07:44] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2018.codfw.wmnet'] ` and were **ALL** successful. [21:13:14] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2018.codfw.wmnet [21:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:39] RECOVERY - Check size of conntrack table on wtp2018 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:22:43] 10Operations, 10Mail, 10OTRS, 10Trust-and-Safety, and 2 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733 (10Dzahn) So the original request for privacy@wikidata is resolved. We can either close this ticket or talk about the domains that... [21:23:42] 10Operations, 10Mail, 10OTRS, 10Trust-and-Safety, and 2 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733 (10Dzahn) @Emufarmers I guess the corresponding OTRS queues can be disabled or removed if that's appropriate and needed. [21:23:45] RECOVERY - mcrouter process on wtp2019 is OK: PROCS OK: 1 process with UID = 113 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter [21:24:58] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10Dzahn) a:05JMeybohm→03Dzahn [21:31:00] 10Operations, 10Fundraising-Backlog: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Dzahn) > At the moment we're just looking for a short-term solution Fwiw, creating a a new wiki involves quite a few steps and people and might not a... [21:32:09] RECOVERY - PHP7 rendering on wtp2020 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 0.475 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:35:13] (03PS1) 10Andrew Bogott: Fix comment about cloudvirts and scheduling [puppet] - 10https://gerrit.wikimedia.org/r/617774 [21:35:15] (03PS1) 10Andrew Bogott: Make cloudvirt103[3-9] into cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/617775 [21:36:11] (03CR) 10Andrew Bogott: [C: 03+2] Make cloudvirt103[3-9] into cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/617775 (owner: 10Andrew Bogott) [21:36:17] (03CR) 10Andrew Bogott: [C: 03+2] Fix comment about cloudvirts and scheduling [puppet] - 10https://gerrit.wikimedia.org/r/617774 (owner: 10Andrew Bogott) [21:36:43] !log [wtp2019:~] $ sudo rm -rf /srv/deployment/parsoid/deploy-cache [21:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:21] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell for Denny Vrandecic - https://phabricator.wikimedia.org/T259388 (10Nuria) [21:40:39] (03PS1) 10Andrew Bogott: Remove some old Ubuntu->Debian migration code [puppet] - 10https://gerrit.wikimedia.org/r/617777 [21:42:01] (03PS2) 10Andrew Bogott: Nova compute: remove some old Ubuntu->Debian migration code [puppet] - 10https://gerrit.wikimedia.org/r/617777 [21:42:55] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2019.codfw.wmnet'] ` and were **ALL** successful. [21:43:15] (03CR) 10Andrew Bogott: [C: 03+2] Nova compute: remove some old Ubuntu->Debian migration code [puppet] - 10https://gerrit.wikimedia.org/r/617777 (owner: 10Andrew Bogott) [21:46:23] (03CR) 10Dzahn: [C: 04-1] "in general good intention just the syntax for default values is a bit different and unfortunately can't be directly replaced like that" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617762 (owner: 10Ahmon Dancy) [21:47:20] (03CR) 10Dzahn: [C: 04-1] "nitpick: please start commit messages with the module name or topic, so "zuul::server: replace hiera() with lookup()" or so" [puppet] - 10https://gerrit.wikimedia.org/r/617762 (owner: 10Ahmon Dancy) [21:48:47] RECOVERY - Check whether ferm is active by checking the default input chain on wtp2020 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:50:12] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2019.codfw.wmnet [21:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:28] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2020.codfw.wmnet'] ` and were **ALL** successful. [21:55:44] (03PS1) 10Andrew Bogott: cloudvirt103[3-9]: rename nic yet again [puppet] - 10https://gerrit.wikimedia.org/r/617779 [21:56:07] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [21:56:56] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt103[3-9]: rename nic yet again [puppet] - 10https://gerrit.wikimedia.org/r/617779 (owner: 10Andrew Bogott) [21:57:05] (03PS21) 10CRusnov: customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [21:57:56] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=wtp2019.codfw.wmnet [21:58:04] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2020.codfw.wmnet [21:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:46] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 200 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [22:03:22] !log wtp2019 - parsoid could not start after reimaging - was missing /etc/parsoid/config.yaml which is a symbolic link deep onto /srv/deployment/parsoid/deploy-cache/.. like in some other cases before manually deleted deploy-cache dir and ran puppet again .. T258775 [22:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:28] T258775: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 [22:12:41] PROBLEM - MariaDB Replica Lag: s4 on db1145 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 910.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:13:11] PROBLEM - ensure kvm processes are running on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:14:53] RECOVERY - ensure kvm processes are running on cloudvirt1033 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:36:40] 10Operations, 10Platform Engineering, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 6 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10jeena) We attempted to run the tests using CI, but ran into errors deploying cassandra t... [22:46:25] PROBLEM - parsoid on wtp2019 is CRITICAL: connect to address 10.192.32.34 and port 8000: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [22:53:19] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:53:20] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:36] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:53:36] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:33] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10Dzahn) All wtp* and parse* servers have been reimaged. With the exception of wtp2019 they have also been tested with httpbb, parsoid service running, repooled and look fine in mo... [23:14:55] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10Dzahn) ` [cumin1001:~] $ sudo cumin wtp* 'df -h | grep mapper | cut -d "/" -f1,2' 43 hosts will be targeted: wtp[2001-2004,2006-2020].codfw.wmnet,wtp[1025-1048].eqiad.wmnet Confirm... [23:15:13] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10Dzahn) 05Open→03Resolved [23:42:42] (03PS1) 10Andrew Bogott: openstack::nova::compute::service::rocky::buster: use legacy ebtables [puppet] - 10https://gerrit.wikimedia.org/r/617791 (https://phabricator.wikimedia.org/T259399) [23:43:20] (03CR) 10jerkins-bot: [V: 04-1] openstack::nova::compute::service::rocky::buster: use legacy ebtables [puppet] - 10https://gerrit.wikimedia.org/r/617791 (https://phabricator.wikimedia.org/T259399) (owner: 10Andrew Bogott) [23:44:52] (03PS2) 10Andrew Bogott: openstack::nova::compute::service::rocky::buster: use legacy ebtables [puppet] - 10https://gerrit.wikimedia.org/r/617791 (https://phabricator.wikimedia.org/T259399) [23:45:32] (03CR) 10Andrew Bogott: [C: 03+2] openstack::nova::compute::service::rocky::buster: use legacy ebtables [puppet] - 10https://gerrit.wikimedia.org/r/617791 (https://phabricator.wikimedia.org/T259399) (owner: 10Andrew Bogott) [23:56:09] RECOVERY - MariaDB Replica Lag: s4 on db1145 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica