[00:00:05] RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200206T0000). [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:02:20] (03CR) 10RLazarus: [C: 03+1] "I see this depends on the patch to enable forensic logging, which I haven't looked at yet -- but the changes here LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/570256 (owner: 10Giuseppe Lavagetto) [00:04:47] PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100% [00:07:30] 10Operations, 10SRE-tools: Homer: commit> no causes stacktrace - https://phabricator.wikimedia.org/T244362 (10Volans) a:03Volans [00:09:54] (03PS1) 10Volans: Handle commit abort separately [software/homer] - 10https://gerrit.wikimedia.org/r/570483 (https://phabricator.wikimedia.org/T244362) [00:11:32] 10Operations, 10ops-eqiad, 10Dumps-Generation: (Need By Jan 25) rack/setup/install snapshot1010.eqiad.wmnet - https://phabricator.wikimedia.org/T241794 (10Jclark-ctr) Host is racked rack B5 U25 . Switchport 14 [00:12:01] 10Operations, 10ops-eqiad, 10Dumps-Generation: (Need By Jan 25) rack/setup/install snapshot1010.eqiad.wmnet - https://phabricator.wikimedia.org/T241794 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [00:12:18] 10Operations, 10ops-eqiad, 10Dumps-Generation: (Need By Jan 25) rack/setup/install snapshot1010.eqiad.wmnet - https://phabricator.wikimedia.org/T241794 (10Jclark-ctr) [00:27:16] 10Operations, 10SRE-tools: Homer: commit timeout on MX104 and SRXs - https://phabricator.wikimedia.org/T244363 (10Volans) a:03Volans [00:29:04] RECOVERY - Host mr1-esams.oob is UP: PING OK - Packet loss = 0%, RTA = 84.10 ms [00:35:50] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10Krinkle) Looks like we've recovered about 5-6%, but still consistantly regressed by 9-10% overall, e.g. 484ms... [00:35:52] (03CR) 10Ayounsi: [C: 03+1] "Tested and works as expected." [software/homer] - 10https://gerrit.wikimedia.org/r/570483 (https://phabricator.wikimedia.org/T244362) (owner: 10Volans) [00:37:34] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10Krinkle) Currently open patches: >>! From T238494#5750475: > [operations/puppet@production] mediawiki::webse... [01:00:04] twentyafterfour: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Phabricator update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200206T0100). [01:14:46] (03CR) 10Dzahn: [C: 03+1] "@gehel sure you don't want to use the opportunity to test buster on one of them while there is hardware that is not in production yet?" [puppet] - 10https://gerrit.wikimedia.org/r/570465 (https://phabricator.wikimedia.org/T242301) (owner: 10Papaul) [01:23:34] 10Operations, 10ops-codfw: codfw: new mw servers not getting an IP when default to Stretch - https://phabricator.wikimedia.org/T244438 (10Papaul) [01:23:49] 10Operations, 10ops-codfw: codfw: new mw servers not getting an IP when default to Stretch - https://phabricator.wikimedia.org/T244438 (10Papaul) p:05Triage→03Medium [01:31:21] (03CR) 10Dzahn: [C: 03+1] "the "wdqs*" line would match first and echo. i think you need to move the special case before the wildcard case" [puppet] - 10https://gerrit.wikimedia.org/r/570465 (https://phabricator.wikimedia.org/T242301) (owner: 10Papaul) [01:37:01] (03PS2) 10Dzahn: add private IPs for new install servers [dns] - 10https://gerrit.wikimedia.org/r/570468 (https://phabricator.wikimedia.org/T224576) [01:37:06] (03PS2) 10Papaul: DHCP: Add wdqs200[7-8] to netboot.cfg and MAC address [puppet] - 10https://gerrit.wikimedia.org/r/570465 (https://phabricator.wikimedia.org/T242301) [01:39:26] (03CR) 10Papaul: [C: 03+2] DHCP: Add wdqs200[7-8] to netboot.cfg and MAC address [puppet] - 10https://gerrit.wikimedia.org/r/570465 (https://phabricator.wikimedia.org/T242301) (owner: 10Papaul) [01:53:36] (03PS1) 10Volans: uwsgi: fix removal of init.d links on buster [puppet] - 10https://gerrit.wikimedia.org/r/570489 [01:54:23] (03PS2) 10Volans: uwsgi: fix removal of init.d links on buster [puppet] - 10https://gerrit.wikimedia.org/r/570489 [01:56:09] (03CR) 10Volans: "The current logged error (visible only in debug mode, thanks Faidon for finding it) is:" [puppet] - 10https://gerrit.wikimedia.org/r/570489 (owner: 10Volans) [02:02:37] (03PS3) 10Volans: uwsgi: fix removal of init.d links on buster [puppet] - 10https://gerrit.wikimedia.org/r/570489 [02:02:39] (03PS1) 10Volans: uwsgi: update check for buster [puppet] - 10https://gerrit.wikimedia.org/r/570490 [02:04:06] (03CR) 10Papaul: [C: 03+1] add private IPs for new install servers [dns] - 10https://gerrit.wikimedia.org/r/570468 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [02:04:32] (03CR) 10Volans: "compiler results here:" [puppet] - 10https://gerrit.wikimedia.org/r/570489 (owner: 10Volans) [02:04:42] (03PS3) 10Dzahn: add private IPs for new install servers [dns] - 10https://gerrit.wikimedia.org/r/570468 (https://phabricator.wikimedia.org/T224576) [02:04:58] (03CR) 10Dzahn: [C: 03+2] add private IPs for new install servers [dns] - 10https://gerrit.wikimedia.org/r/570468 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [02:13:18] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10Dzahn) [02:13:54] 10Operations, 10ops-codfw, 10serviceops-radar: codfw: new mw servers not getting an IP when default to Stretch - https://phabricator.wikimedia.org/T244438 (10Dzahn) [02:15:19] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Dzahn) 05Open→03Stalled currently blocked on T244438 , an installer issue on stretch that only happens on stretch and buster would not have a problem [02:18:42] 10Operations, 10vm-requests: VM requests for install_server replacements - https://phabricator.wikimedia.org/T244390 (10Dzahn) https://gerrit.wikimedia.org/r/c/operations/dns/+/570468 [02:19:03] 10Operations, 10vm-requests: VM requests for install_server replacements - https://phabricator.wikimedia.org/T244390 (10Dzahn) reverted / removed public IPs, added private IPs [02:20:57] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [02:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:36] !log ganeti - Creating new VM named install1003.eqiad.wmnet in eqiad with row=C vcpu=1 memory=1 gigabytes disk=20 gigabytes link=private (T244390) [02:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:38] T244390: VM requests for install_server replacements - https://phabricator.wikimedia.org/T244390 [02:30:24] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [02:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:00] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [02:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:42:57] !log ganeti - Creating new VM named install2003.codfw.wmnet in codfw with row=A vcpu=1 memory=1 gigabytes disk=20 gigabytes link=private (T244390) [02:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:00] T244390: VM requests for install_server replacements - https://phabricator.wikimedia.org/T244390 [02:44:55] 10Operations, 10vm-requests: VM requests for install_server replacements - https://phabricator.wikimedia.org/T244390 (10Dzahn) [02:49:21] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [02:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:51:03] (03PS2) 10Dzahn: DHCP: add install1003/install2003, using current install servers [puppet] - 10https://gerrit.wikimedia.org/r/569685 (https://phabricator.wikimedia.org/T224576) [02:52:41] (03PS1) 10Dzahn: DHCP: remove 'buster-installer' lines, now default and superfluous [puppet] - 10https://gerrit.wikimedia.org/r/570498 [02:54:11] (03CR) 10Dzahn: [C: 03+2] DHCP: add install1003/install2003, using current install servers [puppet] - 10https://gerrit.wikimedia.org/r/569685 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [02:54:22] (03PS3) 10Dzahn: DHCP: add install1003/install2003, using current install servers [puppet] - 10https://gerrit.wikimedia.org/r/569685 (https://phabricator.wikimedia.org/T224576) [02:56:39] (03PS2) 10Dzahn: DHCP: remove 'buster-installer' lines, now default and superfluous [puppet] - 10https://gerrit.wikimedia.org/r/570498 [02:59:56] (03PS3) 10Dzahn: DHCP: remove 'buster-installer' lines, now default and superfluous, linting [puppet] - 10https://gerrit.wikimedia.org/r/570498 [03:01:25] (03CR) 10Dzahn: [C: 03+2] DHCP: remove 'buster-installer' lines, now default and superfluous, linting [puppet] - 10https://gerrit.wikimedia.org/r/570498 (owner: 10Dzahn) [03:13:32] (03PS1) 10Dzahn: install: fremove next-server for new install servers for OS install [puppet] - 10https://gerrit.wikimedia.org/r/570505 (https://phabricator.wikimedia.org/T224576) [03:14:31] (03PS2) 10Dzahn: install: remove next-server for new install servers for OS install [puppet] - 10https://gerrit.wikimedia.org/r/570505 (https://phabricator.wikimedia.org/T224576) [03:16:11] (03PS1) 10Andrew Bogott: Keystone: set max_active_keys for fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570507 (https://phabricator.wikimedia.org/T243418) [03:16:23] (03CR) 10Dzahn: [C: 03+2] install: remove next-server for new install servers for OS install [puppet] - 10https://gerrit.wikimedia.org/r/570505 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [03:25:32] (03PS1) 10CDanis: fastnetmon: connect via NRPE to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587) [03:26:21] (03PS1) 10Volans: junos: handle timeouts separately [software/homer] - 10https://gerrit.wikimedia.org/r/570510 (https://phabricator.wikimedia.org/T244363) [03:28:45] (03PS2) 10CDanis: fastnetmon: connect via NRPE to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587) [03:31:34] (03PS3) 10CDanis: fastnetmon: connect via NRPE to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587) [03:35:14] (03PS4) 10CDanis: fastnetmon: connect via NRPE to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587) [03:39:48] (03PS5) 10CDanis: fastnetmon: connect via NRPE to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587) [03:40:08] (03CR) 10Dzahn: fastnetmon: connect via NRPE to Icinga (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587) (owner: 10CDanis) [03:40:10] (03CR) 10CDanis: "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1002/20644/netflow1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587) (owner: 10CDanis) [03:43:15] (03CR) 10CDanis: fastnetmon: connect via NRPE to Icinga (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587) (owner: 10CDanis) [03:45:43] (03PS6) 10CDanis: fastnetmon: connect via NRPE to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587) [03:46:19] (03CR) 10Dzahn: fastnetmon: connect via NRPE to Icinga (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587) (owner: 10CDanis) [03:51:36] (03PS7) 10CDanis: fastnetmon: connect to Icinga via NRPE [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587) [03:58:42] ACKNOWLEDGEMENT - Host mw2311 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T241852 [04:32:40] (03CR) 10Ppchelko: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570393 (https://phabricator.wikimedia.org/T243106) (owner: 10Ppchelko) [04:35:18] (03PS1) 10KartikMistry: Update cxserver to 2020-02-05-051751-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/570515 (https://phabricator.wikimedia.org/T244230) [05:59:25] PROBLEM - Memory correctable errors -EDAC- on mw1239 is CRITICAL: 4.001 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1239&var-datasource=eqiad+prometheus/ops [06:28:57] PROBLEM - Check whether ferm is active by checking the default input chain on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:28:57] PROBLEM - configured eth on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:28:59] PROBLEM - configured eth on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:28:59] PROBLEM - ores uWSGI web app on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:28:59] PROBLEM - dhclient process on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:28:59] PROBLEM - dhclient process on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:28:59] PROBLEM - puppet last run on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:29:00] PROBLEM - puppet last run on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:29:00] PROBLEM - puppet last run on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:29:01] PROBLEM - Check size of conntrack table on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:03] PROBLEM - dhclient process on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:29:04] PROBLEM - dhclient process on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:29:04] PROBLEM - dhclient process on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:29:05] PROBLEM - Check systemd state on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:05] PROBLEM - Check whether ferm is active by checking the default input chain on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:05] PROBLEM - DPKG on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:29:05] PROBLEM - configured eth on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:29:07] PROBLEM - Disk space on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1002&var-datasource=eqiad+prometheus/ops [06:29:07] PROBLEM - Check whether ferm is active by checking the default input chain on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:07] PROBLEM - Check whether ferm is active by checking the default input chain on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:07] PROBLEM - MD RAID on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:09] PROBLEM - configured eth on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:29:09] PROBLEM - Check the NTP synchronisation status of timesyncd on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [06:29:11] PROBLEM - dhclient process on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:29:11] PROBLEM - DPKG on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:29:11] PROBLEM - ores uWSGI web app on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:29:11] PROBLEM - Check whether ferm is active by checking the default input chain on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:11] PROBLEM - DPKG on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:29:13] PROBLEM - Check whether ferm is active by checking the default input chain on ores1008 is CRITICAL: connect to address 10.64.48.27 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:13] PROBLEM - configured eth on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:29:13] PROBLEM - Check whether ferm is active by checking the default input chain on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:15] PROBLEM - Check systemd state on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:15] PROBLEM - MD RAID on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:15] PROBLEM - Disk space on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1003&var-datasource=eqiad+prometheus/ops [06:29:17] PROBLEM - ores uWSGI web app on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:29:17] PROBLEM - Disk space on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2006&var-datasource=codfw+prometheus/ops [06:29:17] PROBLEM - dhclient process on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:29:19] PROBLEM - MD RAID on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:21] PROBLEM - MD RAID on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:21] PROBLEM - ores uWSGI web app on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:29:21] PROBLEM - Check systemd state on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:23] PROBLEM - configured eth on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:29:25] PROBLEM - Check size of conntrack table on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:25] PROBLEM - Disk space on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2002&var-datasource=codfw+prometheus/ops [06:29:27] PROBLEM - dhclient process on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:29:29] PROBLEM - DPKG on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:29:29] PROBLEM - DPKG on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:29:29] PROBLEM - DPKG on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:29:29] PROBLEM - configured eth on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:29:31] PROBLEM - MD RAID on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:33] PROBLEM - Check size of conntrack table on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:33] PROBLEM - ores uWSGI web app on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:29:34] PROBLEM - Check size of conntrack table on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:37] PROBLEM - Check size of conntrack table on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:39] PROBLEM - Disk space on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1009&var-datasource=eqiad+prometheus/ops [06:29:39] PROBLEM - Disk space on ores1008 is CRITICAL: connect to address 10.64.48.27 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1008&var-datasource=eqiad+prometheus/ops [06:29:41] PROBLEM - configured eth on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:29:41] PROBLEM - puppet last run on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:29:41] PROBLEM - Check whether ferm is active by checking the default input chain on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:41] PROBLEM - Check whether ferm is active by checking the default input chain on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:43] PROBLEM - DPKG on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:29:43] PROBLEM - DPKG on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:29:44] PROBLEM - Check size of conntrack table on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:45] PROBLEM - Check whether ferm is active by checking the default input chain on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:45] PROBLEM - ores uWSGI web app on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:29:47] PROBLEM - Disk space on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2009&var-datasource=codfw+prometheus/ops [06:29:47] PROBLEM - DPKG on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:29:47] PROBLEM - Disk space on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1005&var-datasource=eqiad+prometheus/ops [06:29:47] PROBLEM - DPKG on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:29:49] PROBLEM - Check size of conntrack table on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:49] PROBLEM - Check whether ferm is active by checking the default input chain on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:49] PROBLEM - Check size of conntrack table on ores1008 is CRITICAL: connect to address 10.64.48.27 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:49] PROBLEM - Check size of conntrack table on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:49] PROBLEM - Check systemd state on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:50] PROBLEM - Check systemd state on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:51] PROBLEM - Check whether ferm is active by checking the default input chain on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:51] PROBLEM - Check whether ferm is active by checking the default input chain on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:53] PROBLEM - configured eth on ores1008 is CRITICAL: connect to address 10.64.48.27 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:29:53] PROBLEM - ores uWSGI web app on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:29:53] PROBLEM - MD RAID on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:54] PROBLEM - Check size of conntrack table on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:54] PROBLEM - ores uWSGI web app on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:29:55] PROBLEM - dhclient process on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:29:57] PROBLEM - Check the NTP synchronisation status of timesyncd on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [06:29:57] PROBLEM - Check size of conntrack table on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:59] PROBLEM - configured eth on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:29:59] PROBLEM - DPKG on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:29:59] PROBLEM - Check whether ferm is active by checking the default input chain on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:30:01] PROBLEM - configured eth on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:30:01] PROBLEM - Disk space on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2003&var-datasource=codfw+prometheus/ops [06:30:01] PROBLEM - DPKG on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:03] PROBLEM - ores uWSGI web app on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:03] PROBLEM - Check systemd state on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:03] PROBLEM - dhclient process on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:30:04] PROBLEM - MD RAID on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:30:05] PROBLEM - puppet last run on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:30:05] PROBLEM - dhclient process on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:30:07] PROBLEM - ores uWSGI web app on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:09] PROBLEM - DPKG on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:09] PROBLEM - MD RAID on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:30:09] PROBLEM - Disk space on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2008&var-datasource=codfw+prometheus/ops [06:30:09] PROBLEM - MD RAID on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:30:11] PROBLEM - Check size of conntrack table on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:30:13] PROBLEM - configured eth on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:30:13] PROBLEM - Check size of conntrack table on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:30:15] PROBLEM - ores uWSGI web app on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:15] PROBLEM - Disk space on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1006&var-datasource=eqiad+prometheus/ops [06:30:17] PROBLEM - DPKG on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:17] PROBLEM - MD RAID on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:30:17] PROBLEM - MD RAID on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:30:18] PROBLEM - dhclient process on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:30:21] PROBLEM - Check size of conntrack table on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:30:21] PROBLEM - Check systemd state on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:21] PROBLEM - configured eth on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:30:23] PROBLEM - configured eth on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:30:23] PROBLEM - MD RAID on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:30:24] PROBLEM - ores uWSGI web app on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:25] PROBLEM - Check size of conntrack table on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:30:27] PROBLEM - Check systemd state on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:27] PROBLEM - Check systemd state on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:29] PROBLEM - Disk space on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1004&var-datasource=eqiad+prometheus/ops [06:30:29] PROBLEM - Disk space on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1001&var-datasource=eqiad+prometheus/ops [06:30:29] PROBLEM - Disk space on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2004&var-datasource=codfw+prometheus/ops [06:30:31] PROBLEM - dhclient process on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:30:31] PROBLEM - DPKG on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:35] PROBLEM - puppet last run on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:30:35] PROBLEM - Check size of conntrack table on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:30:35] PROBLEM - Check systemd state on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:35] PROBLEM - Disk space on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2005&var-datasource=codfw+prometheus/ops [06:30:37] PROBLEM - Check systemd state on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:37] PROBLEM - Check whether ferm is active by checking the default input chain on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:30:39] PROBLEM - MD RAID on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:30:39] PROBLEM - Check systemd state on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:41] PROBLEM - ores uWSGI web app on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:43] PROBLEM - configured eth on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:30:43] PROBLEM - MD RAID on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:30:45] PROBLEM - Check the NTP synchronisation status of timesyncd on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [06:30:47] PROBLEM - configured eth on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:30:47] PROBLEM - Disk space on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2001&var-datasource=codfw+prometheus/ops [06:30:47] PROBLEM - puppet last run on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:30:47] PROBLEM - Check systemd state on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:49] PROBLEM - ores uWSGI web app on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:49] PROBLEM - Check systemd state on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:51] PROBLEM - dhclient process on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:30:51] PROBLEM - Check systemd state on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:55] PROBLEM - dhclient process on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:30:55] PROBLEM - Check whether ferm is active by checking the default input chain on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:30:59] RECOVERY - Check size of conntrack table on ores2006 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:30:59] PROBLEM - puppet last run on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:31:03] RECOVERY - dhclient process on ores2006 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:31:05] RECOVERY - Check whether ferm is active by checking the default input chain on ores2006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:31:09] RECOVERY - DPKG on ores2006 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:31:09] RECOVERY - Check whether ferm is active by checking the default input chain on ores1008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:31:11] RECOVERY - configured eth on ores2006 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:31:11] RECOVERY - Check systemd state on ores2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:13] RECOVERY - Disk space on ores2006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2006&var-datasource=codfw+prometheus/ops [06:31:27] PROBLEM - puppet last run on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:31:31] PROBLEM - Check systemd state on ores1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:35] RECOVERY - Disk space on ores1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1008&var-datasource=eqiad+prometheus/ops [06:31:43] RECOVERY - Check size of conntrack table on ores1008 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:31:47] RECOVERY - configured eth on ores1008 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:32:07] PROBLEM - puppet last run on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:32:31] PROBLEM - puppet last run on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:32:39] RECOVERY - configured eth on ores2002 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:33:13] RECOVERY - Disk space on ores2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2002&var-datasource=codfw+prometheus/ops [06:33:31] RECOVERY - DPKG on ores2002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:33:37] RECOVERY - Check systemd state on ores2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:43] PROBLEM - puppet last run on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:33:43] PROBLEM - puppet last run on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:33:47] RECOVERY - Check whether ferm is active by checking the default input chain on ores2002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:33:51] RECOVERY - dhclient process on ores2008 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:33:57] RECOVERY - Disk space on ores2008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2008&var-datasource=codfw+prometheus/ops [06:33:57] RECOVERY - MD RAID on ores2002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:34:05] RECOVERY - dhclient process on ores2002 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:34:07] RECOVERY - Check systemd state on ores2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:09] RECOVERY - configured eth on ores2008 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:34:11] RECOVERY - Check size of conntrack table on ores2002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:34:15] RECOVERY - Disk space on ores1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1004&var-datasource=eqiad+prometheus/ops [06:34:27] RECOVERY - MD RAID on ores2008 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:34:29] RECOVERY - configured eth on ores1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:34:43] PROBLEM - puppet last run on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:34:49] RECOVERY - configured eth on ores2009 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:34:49] PROBLEM - puppet last run on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:34:51] RECOVERY - Check whether ferm is active by checking the default input chain on ores2008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:34:54] RECOVERY - DPKG on ores2008 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:34:59] RECOVERY - MD RAID on ores1004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:35:01] RECOVERY - dhclient process on ores2009 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:35:11] RECOVERY - dhclient process on ores1004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:35:11] RECOVERY - DPKG on ores1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:35:17] RECOVERY - Check size of conntrack table on ores2008 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:35:23] RECOVERY - Check whether ferm is active by checking the default input chain on ores1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:35:27] RECOVERY - Check size of conntrack table on ores2009 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:35:29] RECOVERY - Disk space on ores2009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2009&var-datasource=codfw+prometheus/ops [06:35:29] RECOVERY - DPKG on ores2009 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:35:37] RECOVERY - Check size of conntrack table on ores1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:35:37] RECOVERY - puppet last run on ores2006 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:35:47] RECOVERY - Check systemd state on ores1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:35:47] RECOVERY - MD RAID on ores2009 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:35:59] RECOVERY - puppet last run on ores1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:36:07] RECOVERY - Check systemd state on ores2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:36:27] (03PS1) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) [06:36:35] RECOVERY - Check whether ferm is active by checking the default input chain on ores2009 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:36:53] RECOVERY - puppet last run on ores2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:37:19] RECOVERY - puppet last run on ores2008 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:38:39] (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [06:38:59] PROBLEM - Check the NTP synchronisation status of timesyncd on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [06:39:11] RECOVERY - Check whether ferm is active by checking the default input chain on ores1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:39:19] RECOVERY - MD RAID on ores2003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:39:27] RECOVERY - Disk space on ores2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2003&var-datasource=codfw+prometheus/ops [06:39:35] RECOVERY - puppet last run on ores2009 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:39:39] RECOVERY - Check size of conntrack table on ores2003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:39:41] RECOVERY - DPKG on ores1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:39:45] RECOVERY - Check size of conntrack table on ores1002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:39:47] RECOVERY - MD RAID on ores1002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:39:51] RECOVERY - Check systemd state on ores2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:59] RECOVERY - Check systemd state on ores2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:59] RECOVERY - Disk space on ores2005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2005&var-datasource=codfw+prometheus/ops [06:40:07] RECOVERY - MD RAID on ores2005 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:40:13] RECOVERY - Check systemd state on ores1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:40:17] RECOVERY - dhclient process on ores2003 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:40:19] RECOVERY - Check whether ferm is active by checking the default input chain on ores2003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:40:19] RECOVERY - dhclient process on ores1002 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:40:19] RECOVERY - configured eth on ores2003 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:40:27] RECOVERY - Disk space on ores1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1002&var-datasource=eqiad+prometheus/ops [06:40:31] RECOVERY - configured eth on ores2005 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:40:37] RECOVERY - puppet last run on ores2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:40:43] RECOVERY - configured eth on ores1002 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:41:07] RECOVERY - DPKG on ores2003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:41:09] RECOVERY - Check whether ferm is active by checking the default input chain on ores2005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:41:13] RECOVERY - dhclient process on ores2005 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:41:17] RECOVERY - DPKG on ores2005 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:41:29] RECOVERY - Check size of conntrack table on ores2005 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:41:47] RECOVERY - dhclient process on ores2001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:42:01] RECOVERY - Disk space on ores2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2001&var-datasource=codfw+prometheus/ops [06:42:05] RECOVERY - dhclient process on ores1001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:42:21] RECOVERY - puppet last run on ores1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:42:21] RECOVERY - MD RAID on ores2001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:42:23] PROBLEM - ores_workers_running on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/ORES [06:42:23] PROBLEM - ores_workers_running on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/ORES [06:42:25] PROBLEM - ores_workers_running on ores2001 is CRITICAL: PROCS CRITICAL: 39 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:42:41] RECOVERY - DPKG on ores1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:42:44] (03PS2) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) [06:42:45] RECOVERY - Check size of conntrack table on ores2001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:42:53] RECOVERY - configured eth on ores2001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:42:53] RECOVERY - Check whether ferm is active by checking the default input chain on ores1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:42:59] RECOVERY - Check size of conntrack table on ores1006 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:43:01] RECOVERY - Check systemd state on ores2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:43:03] RECOVERY - Check whether ferm is active by checking the default input chain on ores2001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:43:09] RECOVERY - Check size of conntrack table on ores1001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:43:11] RECOVERY - configured eth on ores1001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:43:13] RECOVERY - DPKG on ores2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:43:17] PROBLEM - ores_workers_running on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/ORES [06:43:17] PROBLEM - ores_workers_running on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/ORES [06:43:19] PROBLEM - ores_workers_running on ores1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:43:23] RECOVERY - configured eth on ores1006 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:43:27] RECOVERY - Disk space on ores1006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1006&var-datasource=eqiad+prometheus/ops [06:43:27] RECOVERY - MD RAID on ores1001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:43:27] RECOVERY - MD RAID on ores1006 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:43:39] RECOVERY - Disk space on ores1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1001&var-datasource=eqiad+prometheus/ops [06:43:41] PROBLEM - ores_workers_running on ores1001 is CRITICAL: PROCS CRITICAL: 75 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:43:47] RECOVERY - Check systemd state on ores1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:43:49] RECOVERY - Check systemd state on ores1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:43:51] PROBLEM - Check the NTP synchronisation status of timesyncd on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [06:44:01] PROBLEM - ores_workers_running on ores1006 is CRITICAL: PROCS CRITICAL: 61 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:44:11] RECOVERY - dhclient process on ores1006 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:44:11] RECOVERY - DPKG on ores1006 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:44:15] RECOVERY - puppet last run on ores1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:44:19] RECOVERY - ores_workers_running on ores2001 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:44:19] RECOVERY - Check whether ferm is active by checking the default input chain on ores1006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:44:41] 10Operations, 10Wikimedia-Mailing-lists, 10serviceops: Allow list admins to train spam filters - https://phabricator.wikimedia.org/T244241 (10Aklapper) Oh darrn! Thanks Reedy! I never realized that this is a custom patch in GNOME's Mailman instance, sorry! Feel free to decline if this is too much maintenanc... [06:45:10] (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [06:45:27] RECOVERY - puppet last run on ores2005 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:45:34] RECOVERY - ores_workers_running on ores1001 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:45:55] RECOVERY - ores_workers_running on ores1006 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:46:15] !log run puppet on all ores[12]* nodes [06:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:34] RECOVERY - puppet last run on ores2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:46:37] RECOVERY - puppet last run on ores1006 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:46:37] RECOVERY - puppet last run on ores1005 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:46:41] RECOVERY - DPKG on ores1005 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:46:45] RECOVERY - Disk space on ores1005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1005&var-datasource=eqiad+prometheus/ops [06:46:45] RECOVERY - Check size of conntrack table on ores1005 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:46:45] RECOVERY - Check whether ferm is active by checking the default input chain on ores1005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:47:04] RECOVERY - dhclient process on ores1009 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:47:04] RECOVERY - DPKG on ores1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:47:04] RECOVERY - MD RAID on ores1005 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:47:17] RECOVERY - configured eth on ores1005 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:47:29] RECOVERY - DPKG on ores1009 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:47:31] RECOVERY - Check size of conntrack table on ores1003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:47:33] RECOVERY - Check whether ferm is active by checking the default input chain on ores1009 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:47:41] RECOVERY - Check systemd state on ores1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:47:45] RECOVERY - Check systemd state on ores1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:47:49] RECOVERY - configured eth on ores1003 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:47:51] RECOVERY - dhclient process on ores1005 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:47:55] RECOVERY - dhclient process on ores1003 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:48:03] RECOVERY - Check whether ferm is active by checking the default input chain on ores1003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:48:07] RECOVERY - Disk space on ores1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1003&var-datasource=eqiad+prometheus/ops [06:48:11] RECOVERY - MD RAID on ores1009 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:48:11] RECOVERY - Check systemd state on ores1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:48:15] RECOVERY - Check size of conntrack table on ores1009 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:48:21] RECOVERY - configured eth on ores1009 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:48:23] RECOVERY - MD RAID on ores1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:48:24] RECOVERY - puppet last run on ores1009 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:48:27] RECOVERY - Check systemd state on ores1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:48:31] RECOVERY - Disk space on ores1009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1009&var-datasource=eqiad+prometheus/ops [06:48:51] RECOVERY - configured eth on ores2004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:48:57] RECOVERY - ores_workers_running on ores1003 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:48:57] RECOVERY - ores_workers_running on ores1009 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:48:57] RECOVERY - ores_workers_running on ores1007 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:49:19] RECOVERY - Disk space on ores2004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2004&var-datasource=codfw+prometheus/ops [06:49:45] RECOVERY - puppet last run on ores2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:49:48] (03PS1) 10Marostegui: Revert "db1098: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/570523 [06:49:51] RECOVERY - Check systemd state on ores2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:49:51] RECOVERY - Check whether ferm is active by checking the default input chain on ores2004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:49:55] RECOVERY - ores_workers_running on ores1005 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:49:57] RECOVERY - dhclient process on ores2004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:50:03] RECOVERY - MD RAID on ores2004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:50:13] RECOVERY - DPKG on ores2004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:50:21] RECOVERY - Check size of conntrack table on ores2004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:51:45] (03CR) 10Andrew Bogott: "Despite jenkins objections about cross-module includes, I don't feel great about moving these rsync bits into a profile. Overriding the -" [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [06:51:47] RECOVERY - ores_workers_running on ores2004 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:51:55] (03CR) 10Marostegui: [C: 03+2] Revert "db1098: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/570523 (owner: 10Marostegui) [06:52:29] RECOVERY - puppet last run on ores1003 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:52:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1098:3317 - T239453', diff saved to https://phabricator.wikimedia.org/P10318 and previous config saved to /var/cache/conftool/dbconfig/20200206-065238-marostegui.json [06:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:42] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [06:59:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1101:3317 - T239453', diff saved to https://phabricator.wikimedia.org/P10319 and previous config saved to /var/cache/conftool/dbconfig/20200206-065906-marostegui.json [06:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:10] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [07:00:01] RECOVERY - Check the NTP synchronisation status of timesyncd on ores2002 is OK: OK: synced at Thu 2020-02-06 06:59:59 UTC. https://wikitech.wikimedia.org/wiki/NTP [07:00:23] (03PS1) 10Marostegui: db1101: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570525 (https://phabricator.wikimedia.org/T239453) [07:00:47] RECOVERY - Check the NTP synchronisation status of timesyncd on ores1003 is OK: OK: synced at Thu 2020-02-06 07:00:46 UTC. https://wikitech.wikimedia.org/wiki/NTP [07:01:31] (03CR) 10Marostegui: [C: 03+2] db1101: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570525 (https://phabricator.wikimedia.org/T239453) (owner: 10Marostegui) [07:01:37] RECOVERY - Check the NTP synchronisation status of timesyncd on ores2009 is OK: OK: synced at Thu 2020-02-06 07:01:34 UTC. https://wikitech.wikimedia.org/wiki/NTP [07:09:51] RECOVERY - Check the NTP synchronisation status of timesyncd on ores2003 is OK: OK: synced at Thu 2020-02-06 07:09:49 UTC. https://wikitech.wikimedia.org/wiki/NTP [07:12:57] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:14:43] RECOVERY - Check the NTP synchronisation status of timesyncd on ores2004 is OK: OK: synced at Thu 2020-02-06 07:14:42 UTC. https://wikitech.wikimedia.org/wiki/NTP [07:22:08] PROBLEM - Check systemd state on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:22:10] PROBLEM - Check whether ferm is active by checking the default input chain on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:22:28] PROBLEM - MD RAID on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [07:22:56] PROBLEM - Check size of conntrack table on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [07:23:02] PROBLEM - puppet last run on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:23:06] PROBLEM - configured eth on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [07:31:12] PROBLEM - ores_workers_running on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/ORES [07:33:23] (03PS1) 10Matthias Mullie: Revert "Remove handler deleted from the MachineVision extension on beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570571 [07:39:42] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:41:30] (03PS1) 10Elukey: presto: enable kerberos and TLS for Analytics Presto [puppet] - 10https://gerrit.wikimedia.org/r/570573 [07:41:46] RECOVERY - MD RAID on ores2001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [07:42:28] RECOVERY - Check size of conntrack table on ores2001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [07:42:40] RECOVERY - configured eth on ores2001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [07:42:52] RECOVERY - Check systemd state on ores2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:42:56] RECOVERY - Check whether ferm is active by checking the default input chain on ores2001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:43:30] RECOVERY - ores_workers_running on ores2001 is OK: PROCS OK: 89 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [07:44:27] (03CR) 10Elukey: [C: 03+2] presto: enable kerberos and TLS for Analytics Presto [puppet] - 10https://gerrit.wikimedia.org/r/570573 (owner: 10Elukey) [07:44:54] RECOVERY - puppet last run on ores2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:47:47] (03PS1) 10Marostegui: mariadb: Productionize dbproxy1012 [puppet] - 10https://gerrit.wikimedia.org/r/570574 (https://phabricator.wikimedia.org/T202367) [07:59:46] (03PS2) 10Marostegui: mariadb: Productionize dbproxy1012 [puppet] - 10https://gerrit.wikimedia.org/r/570574 (https://phabricator.wikimedia.org/T202367) [08:03:01] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize dbproxy1012 [puppet] - 10https://gerrit.wikimedia.org/r/570574 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [08:05:25] (03PS1) 10Elukey: presto: set correct discovery URI for Analytics Presto [puppet] - 10https://gerrit.wikimedia.org/r/570577 [08:09:30] (03CR) 10Elukey: [C: 03+2] presto: set correct discovery URI for Analytics Presto [puppet] - 10https://gerrit.wikimedia.org/r/570577 (owner: 10Elukey) [08:10:13] PROBLEM - MediaWiki centralauth errors on graphite1004 is CRITICAL: CRITICAL: 93.33% of data above the critical threshold [1.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1 [08:17:56] !log switchover selectively eventgate-analytics.discovery.wmnet to codfw for mw1331 and mw1348 to [08:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:02] !log restarting blazegraph on wdqs1006: T242453 [08:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:04] T242453: wdqs1005 stopped to handle updates - https://phabricator.wikimedia.org/T242453 [08:23:03] !log Reboot dbproxy1012 and dbproxy1014 for upgrade [08:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:29] (03CR) 10Alexandros Kosiaris: [C: 03+2] uwsgi: update check for buster [puppet] - 10https://gerrit.wikimedia.org/r/570490 (owner: 10Volans) [08:23:53] (03PS1) 10Elukey: profile::presto::server: use kerberos auth between worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/570579 [08:24:25] (03PS2) 10Alexandros Kosiaris: uwsgi: update check for buster [puppet] - 10https://gerrit.wikimedia.org/r/570490 (owner: 10Volans) [08:24:28] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] uwsgi: update check for buster [puppet] - 10https://gerrit.wikimedia.org/r/570490 (owner: 10Volans) [08:26:47] (03CR) 10Elukey: [C: 03+2] profile::presto::server: use kerberos auth between worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/570579 (owner: 10Elukey) [08:27:53] RECOVERY - MediaWiki centralauth errors on graphite1004 is OK: OK: Less than 30.00% above the threshold [0.5] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1 [08:30:35] jouncebot: next [08:30:35] In 3 hour(s) and 29 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200206T1200) [08:30:56] (03PS1) 10Marostegui: wmnet: Failover dbproxy1001 to dbproxy1014 [dns] - 10https://gerrit.wikimedia.org/r/570580 (https://phabricator.wikimedia.org/T202367) [08:32:15] (03PS1) 10Marostegui: dbproxy1012: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570581 (https://phabricator.wikimedia.org/T202367) [08:32:31] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Double tested on ores2001 as well (both with files existing and without). LGTM, merging" [puppet] - 10https://gerrit.wikimedia.org/r/570489 (owner: 10Volans) [08:32:40] (03PS4) 10Alexandros Kosiaris: uwsgi: fix removal of init.d links on buster [puppet] - 10https://gerrit.wikimedia.org/r/570489 (owner: 10Volans) [08:33:47] (03CR) 10Marostegui: [C: 03+2] dbproxy1012: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570581 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [08:38:08] (03PS2) 10Addshore: Enable EntitySourceBasedFederation for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566795 (https://phabricator.wikimedia.org/T243395) (owner: 10Ladsgroup) [08:39:02] going to deploy wdqs if there are no objections [08:39:10] (03PS1) 10Addshore: Enable EntitySourceBasedFederation for group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570583 (https://phabricator.wikimedia.org/T243395) [08:41:21] (03PS1) 10Addshore: wmgUseEntitySourceBasedFederation everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570584 (https://phabricator.wikimedia.org/T243395) [08:44:28] (03CR) 10Muehlenhoff: "The patch looks fine, but the more fundamental fix would be what's outlined in T222874. Yuvi's original patch is from 2016 where we still " [puppet] - 10https://gerrit.wikimedia.org/r/570489 (owner: 10Volans) [08:44:33] !log dcausse@deploy1001 Started deploy [wdqs/wdqs@4306c64]: deploying wdqs 0.3.14-SNAPSHOT and gui 5a1af3b to wdqs1010.eqiad.wmnet [08:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:03] !log dcausse@deploy1001 Finished deploy [wdqs/wdqs@4306c64]: deploying wdqs 0.3.14-SNAPSHOT and gui 5a1af3b to wdqs1010.eqiad.wmnet (duration: 00m 29s) [08:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:46] 10Operations, 10ORES, 10Scoring-platform-team (Current): Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10akosiaris) I 've tried to reproduce this. It's easily reproducible after all. Just do what logrotate does and issue `systemctl reload uwsgi-ores`. CPU usage spikes and reache... [08:49:49] PROBLEM - Check systemd state on ores2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:56:20] !log dcausse@deploy1001 Started deploy [wdqs/wdqs@4306c64]: deploying wdqs 0.3.14-SNAPSHOT and gui 5a1af3b [08:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:06] (03CR) 10Alexandros Kosiaris: [C: 03+2] site: add new ganeti hosts for refresh/expansion with spare role [puppet] - 10https://gerrit.wikimedia.org/r/570390 (https://phabricator.wikimedia.org/T228924) (owner: 10Dzahn) [08:57:13] (03PS3) 10Alexandros Kosiaris: site: add new ganeti hosts for refresh/expansion with spare role [puppet] - 10https://gerrit.wikimedia.org/r/570390 (https://phabricator.wikimedia.org/T228924) (owner: 10Dzahn) [08:57:25] PROBLEM - ores_workers_running on ores2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [08:58:35] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.01117 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:03:21] (03PS1) 10Vgutierrez: install_server: Reimage cp3065 as buster [puppet] - 10https://gerrit.wikimedia.org/r/570585 (https://phabricator.wikimedia.org/T242093) [09:03:55] (03PS1) 10Elukey: profile::presto::server: use FQDN when workers communicate with the coord [puppet] - 10https://gerrit.wikimedia.org/r/570586 [09:04:28] (03CR) 10jerkins-bot: [V: 04-1] profile::presto::server: use FQDN when workers communicate with the coord [puppet] - 10https://gerrit.wikimedia.org/r/570586 (owner: 10Elukey) [09:06:37] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I'm on the verge here." [puppet] - 10https://gerrit.wikimedia.org/r/570330 (https://phabricator.wikimedia.org/T244222) (owner: 10Jbond) [09:07:08] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/570586 (owner: 10Elukey) [09:07:49] (03CR) 10Marostegui: [C: 03+2] "After our chat on IRC...proceeding" [dns] - 10https://gerrit.wikimedia.org/r/570580 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [09:08:01] !log dcausse@deploy1001 Finished deploy [wdqs/wdqs@4306c64]: deploying wdqs 0.3.14-SNAPSHOT and gui 5a1af3b (duration: 11m 41s) [09:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:34] (03CR) 10Ema: [C: 03+1] install_server: Reimage cp3065 as buster [puppet] - 10https://gerrit.wikimedia.org/r/570585 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [09:08:50] (03CR) 10Vgutierrez: [C: 03+2] install_server: Reimage cp3065 as buster [puppet] - 10https://gerrit.wikimedia.org/r/570585 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [09:09:33] (03CR) 10Elukey: [C: 03+2] profile::presto::server: use FQDN when workers communicate with the coord [puppet] - 10https://gerrit.wikimedia.org/r/570586 (owner: 10Elukey) [09:10:13] vgutierrez: hola, can I merge yours too? [09:10:20] elukey: go ahead <3 [09:10:43] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] Release 8.0.5-1wm14 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/566814 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [09:14:41] (03PS1) 10Muehlenhoff: Switch mw2311 to stretch bootif tftpboot environment [puppet] - 10https://gerrit.wikimedia.org/r/570587 (https://phabricator.wikimedia.org/T244438) [09:16:28] (03CR) 10Muehlenhoff: [C: 03+2] Switch mw2311 to stretch bootif tftpboot environment [puppet] - 10https://gerrit.wikimedia.org/r/570587 (https://phabricator.wikimedia.org/T244438) (owner: 10Muehlenhoff) [09:23:33] (03PS1) 10Marostegui: dbproxy1001: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570588 (https://phabricator.wikimedia.org/T244463) [09:24:45] (03CR) 10Marostegui: [C: 03+2] dbproxy1001: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570588 (https://phabricator.wikimedia.org/T244463) (owner: 10Marostegui) [09:26:24] (03PS1) 10Jcrespo: Revert "uwsgi: fix removal of init.d links on buster" [puppet] - 10https://gerrit.wikimedia.org/r/570589 [09:27:00] jouncebot: next [09:27:00] In 2 hour(s) and 32 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200206T1200) [09:27:33] (03PS2) 10Jcrespo: Revert "uwsgi: fix removal of init.d links on buster" [puppet] - 10https://gerrit.wikimedia.org/r/570589 [09:33:08] 10Operations, 10Traffic: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 (10Vgutierrez) [09:33:32] (03PS1) 10Giuseppe Lavagetto: uwsgi: fix removal of init.d script (revisited) [puppet] - 10https://gerrit.wikimedia.org/r/570590 [09:33:55] (03PS3) 10Filippo Giunchedi: elasticsearch: cirrus logs to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/570374 (https://phabricator.wikimedia.org/T225125) [09:35:13] (03CR) 10Filippo Giunchedi: "I'm not sure whether 'type' in appender properties is case sensitive or not (syslog vs Gelf)." [puppet] - 10https://gerrit.wikimedia.org/r/570374 (https://phabricator.wikimedia.org/T225125) (owner: 10Filippo Giunchedi) [09:36:42] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/570590 (owner: 10Giuseppe Lavagetto) [09:37:44] (03CR) 10Filippo Giunchedi: [C: 03+1] uwsgi: fix removal of init.d script (revisited) [puppet] - 10https://gerrit.wikimedia.org/r/570590 (owner: 10Giuseppe Lavagetto) [09:37:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] uwsgi: fix removal of init.d script (revisited) [puppet] - 10https://gerrit.wikimedia.org/r/570590 (owner: 10Giuseppe Lavagetto) [09:38:28] (03Abandoned) 10Jcrespo: Revert "uwsgi: fix removal of init.d links on buster" [puppet] - 10https://gerrit.wikimedia.org/r/570589 (owner: 10Jcrespo) [09:38:46] (03PS3) 10Muehlenhoff: Pass down MAC address of to installing system via BOOTIF on Buster [puppet] - 10https://gerrit.wikimedia.org/r/564729 (https://phabricator.wikimedia.org/T242481) [09:39:22] (03PS4) 10Muehlenhoff: Pass down MAC address of the installing system via BOOTIF on Buster [puppet] - 10https://gerrit.wikimedia.org/r/564729 (https://phabricator.wikimedia.org/T242481) [09:41:15] RECOVERY - Check systemd state on ores2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:43:29] <_joe_> uhm [09:43:51] (03PS1) 10Vgutierrez: ATS: Allow configuring via hiera KA against origin servers [puppet] - 10https://gerrit.wikimedia.org/r/570594 (https://phabricator.wikimedia.org/T244464) [09:43:51] RECOVERY - ores_workers_running on ores2001 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [09:44:01] <_joe_> so I can confirm the new puppet run works and it removed the init.d link on acmechief1001 [09:44:10] <_joe_> vgutierrez: you might want to verify you wanted that :P [09:44:29] _joe_: wut? what? [09:44:46] <_joe_> don't freak out [09:44:55] <_joe_> that's a part of our uwsgi puppet code [09:45:00] <_joe_> that was broken on buster [09:45:09] <_joe_> now it works [09:45:15] hmm uwsgi on buster has been working fine for acmecheif instances [09:45:17] <_joe_> so we remove the init.d script for uwsgi [09:45:24] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Lea_Lacroix_WMDE) Over the past weeks, we noticed a huge increase of content in Wikidata. Maybe that's... [09:46:14] ack, I'll check for possible side effects, thanks for pingin [09:46:17] *pinging [09:47:25] but it shouldn't be a big issue considering this [09:47:27] Loaded: loaded (/lib/systemd/system/uwsgi-acme-chief.service; enabled; vendor preset: enabled) [09:49:07] (03PS2) 10Vgutierrez: ATS: Allow configuring via hiera KA against origin servers [puppet] - 10https://gerrit.wikimedia.org/r/570594 (https://phabricator.wikimedia.org/T244464) [09:49:13] (03PS1) 10Muehlenhoff: Switch elastic* to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/570596 (https://phabricator.wikimedia.org/T156955) [09:51:14] (03CR) 10jerkins-bot: [V: 04-1] ATS: Allow configuring via hiera KA against origin servers [puppet] - 10https://gerrit.wikimedia.org/r/570594 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [09:52:06] (03PS3) 10Vgutierrez: ATS: Allow configuring via hiera KA against origin servers [puppet] - 10https://gerrit.wikimedia.org/r/570594 (https://phabricator.wikimedia.org/T244464) [09:52:10] vgutierrez: this is typically only an issue during upgrades of the uwsgi package (and I don't think we've had one on buster since acmechief was setup) [09:52:28] ack [09:52:30] (03PS4) 10Vgutierrez: ATS: Allow configuring via hiera KA against origin servers [puppet] - 10https://gerrit.wikimedia.org/r/570594 (https://phabricator.wikimedia.org/T244464) [09:52:49] (03CR) 10Gehel: "A few questions:" [puppet] - 10https://gerrit.wikimedia.org/r/570374 (https://phabricator.wikimedia.org/T225125) (owner: 10Filippo Giunchedi) [09:53:30] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/570596 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [09:54:21] (03CR) 10jerkins-bot: [V: 04-1] ATS: Allow configuring via hiera KA against origin servers [puppet] - 10https://gerrit.wikimedia.org/r/570594 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [09:54:41] sigh.. what's wrong with jerkins [09:56:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "IIRC there were some side effects of using LVM with readahead that Erik discovered (CC'ing) but other than that LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/570596 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [09:56:04] (yet another L8 issue) [09:56:06] (03PS5) 10Vgutierrez: ATS: Allow configuring via hiera KA against origin servers [puppet] - 10https://gerrit.wikimedia.org/r/570594 (https://phabricator.wikimedia.org/T244464) [09:57:45] (03CR) 10Vgutierrez: "pcc shows a NOOP on records.config: https://puppet-compiler.wmflabs.org/compiler1003/20651/" [puppet] - 10https://gerrit.wikimedia.org/r/570594 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [09:58:32] (03CR) 10Filippo Giunchedi: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/570374 (https://phabricator.wikimedia.org/T225125) (owner: 10Filippo Giunchedi) [09:58:59] 10Operations, 10Traffic, 10Patch-For-Review: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 (10Vgutierrez) p:05Triage→03Medium [09:59:16] 10Operations, 10Beta-Cluster-Infrastructure: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster) - https://phabricator.wikimedia.org/T243226 (10Krenair) Last night I tried moving cache-text05 to use the new puppet master. It doesn't seem to work just yet as for some reason it's (the puppetm... [09:59:46] !log upload trafficserver 8.0.5-1wm14 to apt.wm.o (buster) - T242093 [09:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:49] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [09:59:52] (03CR) 10Gehel: [C: 03+1] "We do have an explicit udev rule for readahead which should still work with this update: https://github.com/wikimedia/puppet/blob/producti" [puppet] - 10https://gerrit.wikimedia.org/r/570596 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [10:00:05] (03CR) 10Ema: [C: 03+1] ATS: Allow configuring via hiera KA against origin servers [puppet] - 10https://gerrit.wikimedia.org/r/570594 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [10:00:11] !log depool and reimage cp3065 as buster - T242093 [10:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:33] (03CR) 10Vgutierrez: [C: 03+2] ATS: Allow configuring via hiera KA against origin servers [puppet] - 10https://gerrit.wikimedia.org/r/570594 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [10:06:05] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.004885 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:10:28] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp3065.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [10:11:38] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10ArielGlenn) >>! In T243701#5855352, @Lea_Lacroix_WMDE wrote: > Over the past weeks, we noticed a huge i... [10:12:11] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/570596 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [10:13:09] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10JeanFred) [10:14:02] (03PS1) 10Vgutierrez: ATS: Enable KA between ats-tls and varnish-fe on cp4031 [puppet] - 10https://gerrit.wikimedia.org/r/570599 (https://phabricator.wikimedia.org/T244464) [10:15:42] (03PS1) 10Muehlenhoff: Switch logstash hosts to standard Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/570600 (https://phabricator.wikimedia.org/T156955) [10:16:08] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/20652/" [puppet] - 10https://gerrit.wikimedia.org/r/570599 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [10:17:28] (03CR) 10Ema: [C: 03+1] ATS: Enable KA between ats-tls and varnish-fe on cp4031 [puppet] - 10https://gerrit.wikimedia.org/r/570599 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [10:18:13] (03CR) 10Vgutierrez: [C: 03+2] ATS: Enable KA between ats-tls and varnish-fe on cp4031 [puppet] - 10https://gerrit.wikimedia.org/r/570599 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [10:19:21] !log Enabling HTTP keepalive between ats-tls and varnish-frontend on cp4031 - T244464 [10:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:24] T244464: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 [10:20:43] !log undo "switchover selectively eventgate-analytics.discovery.wmnet to codfw for mw1331 and mw1348" [10:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:08] !log undo "switchover selectively eventgate-analytics.discovery.wmnet to codfw for mw1331 and mw1348". no effect observed [10:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:18] (03CR) 10Alexandros Kosiaris: [C: 03+2] standard: Add linux-perf to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/570254 (owner: 10Alexandros Kosiaris) [10:25:29] (03PS4) 10Jbond: realm global: make the realm variable a global in labs [puppet] - 10https://gerrit.wikimedia.org/r/570331 (https://phabricator.wikimedia.org/T244222) [10:27:28] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10JeanFred) >>>! In T243701#5834751, @jcrespo wrote: >> While I understand the need of "slowing bot edit... [10:28:56] (03PS10) 10Jbond: wmflib::require_domains: add new function to to replace require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570343 [10:32:06] (03PS1) 10Alexandros Kosiaris: site.pp: Add new ganeti codfw hosts as role::spare [puppet] - 10https://gerrit.wikimedia.org/r/570601 (https://phabricator.wikimedia.org/T224603) [10:34:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] dynamicproxy: urlproxy: introduce support for domain-based routing [puppet] - 10https://gerrit.wikimedia.org/r/565556 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez) [10:35:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "thanks! way more elegant solution that my initial approach, which involved updating the regex." [puppet] - 10https://gerrit.wikimedia.org/r/570433 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [10:41:08] (03PS1) 10Vgutierrez: install_server: Fix cescout syntax error [puppet] - 10https://gerrit.wikimedia.org/r/570602 [10:41:19] it looks like we need a linter for netboot.cfg :) [10:42:56] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570602 (owner: 10Vgutierrez) [10:43:15] (03CR) 10Vgutierrez: [C: 03+2] install_server: Fix cescout syntax error [puppet] - 10https://gerrit.wikimedia.org/r/570602 (owner: 10Vgutierrez) [10:44:18] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3065.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3065.esams.wmnet'] ` [10:45:49] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp3065.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002061045_vgutie... [10:47:31] <_joe_> vgutierrez: if confd fails, please wait before rerunning puppet [10:47:55] on cp3065? [10:47:59] ack [10:48:03] (03PS5) 10Jbond: puppet_compiler: add rich_data support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/557050 [10:48:15] I'll let you know [10:48:24] but yesterday we only had issues on a text node, cp3065 is upload [10:55:04] <_joe_> that shouldn't change [11:06:08] 10Operations, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10hnowlan) Unfortunately it seems I don't have permissions to issue commands. I attempted to downtime a service on a host that's n... [11:11:54] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [11:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:04] Updating cxserver.. [11:12:10] 10Operations, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10MoritzMuehlenhoff) @hnowlan There's an error in the username configured in https://gerrit.wikimedia.org/r/566823, let me fix tha... [11:13:18] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2020-02-05-051751-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/570515 (https://phabricator.wikimedia.org/T244230) (owner: 10KartikMistry) [11:13:38] (03Merged) 10jenkins-bot: Update cxserver to 2020-02-05-051751-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/570515 (https://phabricator.wikimedia.org/T244230) (owner: 10KartikMistry) [11:14:10] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:54] akosiaris: Oh, I see some changes pending on cxserver, safe to upgrade? [11:18:16] (03PS1) 10Muehlenhoff: Fix username in Icinga authorization config for Hugh [puppet] - 10https://gerrit.wikimedia.org/r/570611 (https://phabricator.wikimedia.org/T242309) [11:18:23] akosiaris: charts 0.0.9->0.0.11 [11:18:48] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3065.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3065.esams.wmnet'] ` [11:22:50] kart_: lemme have a look [11:23:38] (03CR) 10Muehlenhoff: [C: 03+2] Fix username in Icinga authorization config for Hugh [puppet] - 10https://gerrit.wikimedia.org/r/570611 (https://phabricator.wikimedia.org/T242309) (owner: 10Muehlenhoff) [11:23:48] <_joe_> kart_: that might have been me, sorry, taking a look as well [11:24:01] <_joe_> I think I added the ability to terminate TLS [11:24:22] <_joe_> but I didn't deploy TLS to production still so I didn't do the deploy [11:25:21] <_joe_> lemme check [11:25:23] (03PS1) 10Ema: tlsproxy::localssl: allow setting keepalive_requests [puppet] - 10https://gerrit.wikimedia.org/r/570612 (https://phabricator.wikimedia.org/T241145) [11:25:25] (03PS1) 10Ema: profile::mediawiki::webserver: increase nginx keepalive_requests [puppet] - 10https://gerrit.wikimedia.org/r/570613 (https://phabricator.wikimedia.org/T241145) [11:25:52] <_joe_> uhm no it wasn't me :) [11:25:57] _joe_: Sure. I'm not sure how-to handle it, akosiaris should be able to find way here :) [11:26:16] _joe_: looks like you did that in December, and seems unrelated by log. [11:26:19] 10Operations, 10Traffic, 10Inuka-Team (Kanban), 10MW-1.35-notes (1.35.0-wmf.16; 2020-01-21), 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10phuedx) >>! In T238029#5819088, @SBisson wrote: > A very special thanks to @phuedx for reviewing a... [11:26:45] 0.0.10 -> 0.0.11 is 4f179cf333486e05d930d7da98c7dbe945c28d6f and is just a removal of comments [11:27:25] and 0.0.9 -> 0.0.10 is dae344baf57a995f4239ed4b11f3e4ba4628c3ed and is just minor changes to NOTES.txt [11:27:36] kart_: go ahead, both by me, both innocuous [11:27:47] akosiaris: cool. Thanks! [11:28:06] (03CR) 10Giuseppe Lavagetto: [C: 03+1] tlsproxy::localssl: allow setting keepalive_requests [puppet] - 10https://gerrit.wikimedia.org/r/570612 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema) [11:28:36] !log kartik@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [11:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:42] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "let's test this on the canaries first, that would be role::mediawiki::canary_appserver and role::mediawiki::appserver::canary_api" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570613 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema) [11:31:22] !log kartik@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [11:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:01] !log upload etherpad-lite_1.8.0-1 to apt.wikimedia.org buster-wikimedia/main [11:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:05] !log kartik@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [11:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:32] !log Updated cxserver to 2020-02-05-051751-production (T244230, T234323) [11:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:36] T234323: Load a single section in Content translation's editor - https://phabricator.wikimedia.org/T234323 [11:38:36] T244230: Add Yandex translation for Chuvash - https://phabricator.wikimedia.org/T244230 [11:41:05] !log upgrade etherpad-lite on etherpad1002 to 1.8.0-1 [11:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:54] (03CR) 10Arturo Borrero Gonzalez: "Thanks for working on this! Somme comments inline." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [11:44:39] 10Operations, 10observability, 10serviceops: Stream a subset of mediawiki apache logs to logstash - https://phabricator.wikimedia.org/T244472 (10jijiki) [11:50:27] (03PS3) 10Cparle: Re-enable delayed new upload jobs for MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570287 (https://phabricator.wikimedia.org/T241072) (owner: 10Matthias Mullie) [11:51:13] (03PS4) 10Cparle: Remove handler deleted from the MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566860 (https://phabricator.wikimedia.org/T241242) (owner: 10Matthias Mullie) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200206T1200). [12:00:04] kart_, matthiasmullie, cparle, and addshore: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:13] * kart_ is here [12:00:28] Anyone want to deploy my patch or should I go ahead? [12:00:40] kart_: if you want, I can do so - but feel free to deploy yourself! [12:01:04] Urbanecm: Please deploy my patch :) [12:01:08] will do! [12:01:22] oh, I see 9 patches in SWAT. [12:01:48] (03PS2) 10Urbanecm: Enable CX in te, kn, gu, mr and pawiki as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570051 (https://phabricator.wikimedia.org/T243271) (owner: 10KartikMistry) [12:01:51] we'll deploy whatever we have time for [12:01:58] I can deploy mine [12:01:58] that sounds like a lot of patches [12:01:59] o/ [12:02:00] OK! [12:02:06] * addshore can deploy mine [12:02:12] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570051 (https://phabricator.wikimedia.org/T243271) (owner: 10KartikMistry) [12:02:35] cormacparle__: please +2 your backport(s), so we don't have to wait for jenkins later on [12:02:41] addshore: ^^ [12:02:44] already done [12:02:47] already merged ;) [12:02:48] cool! [12:03:07] (03Merged) 10jenkins-bot: Enable CX in te, kn, gu, mr and pawiki as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570051 (https://phabricator.wikimedia.org/T243271) (owner: 10KartikMistry) [12:03:26] kart_: please test on mwdebug1001 and lmk [12:03:34] OK! [12:06:08] Urbanecm: looks good. Go ahead! [12:06:15] deploying [12:06:49] (03PS5) 10Matthias Mullie: Remove handler deleted from the MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566860 (https://phabricator.wikimedia.org/T241242) [12:07:38] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 5e1cbb2: Enable CX in te, kn, gu, mr and pawiki as a default tool (T243271, T243272, T243273, T243274, T243275) (duration: 01m 09s) [12:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:46] T243273: Enable Content Translation in Gujarati Wikipedia as a default tool - https://phabricator.wikimedia.org/T243273 [12:07:46] T243274: Enable Content Translation in Marathi Wikipedia as a default tool - https://phabricator.wikimedia.org/T243274 [12:07:46] T243271: Enable Content Translation in Telugu Wikipedia as a default tool - https://phabricator.wikimedia.org/T243271 [12:07:47] T243275: Enable Content Translation in Punjabi Wikipedia as a default tool - https://phabricator.wikimedia.org/T243275 [12:07:47] T243272: Enable Content Translation in Kannada Wikipedia as a default tool - https://phabricator.wikimedia.org/T243272 [12:08:03] kart_: done [12:08:10] cormacparle__: go ahead [12:08:22] ok cool [12:08:35] Urbanecm: thanks a lot! [12:08:38] yw [12:12:43] 10Operations, 10Beta-Cluster-Infrastructure: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster) - https://phabricator.wikimedia.org/T243226 (10jbond) I have preformed the following actions * copied the CA from `deployment-puppetmaster03` to `deployment-puppetmaster04` * on `deployment-pu... [12:14:04] 10Operations, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10hnowlan) Moritz clarified how case sensitive logins affect Icinga - I've since logged in as Hnowlan and I can confirm I can run... [12:14:04] (03PS4) 10Krinkle: mediawiki: Add reqId/file/line to php7-fatal-error.php's 'message' field [puppet] - 10https://gerrit.wikimedia.org/r/554599 [12:15:50] (03PS5) 10Krinkle: mediawiki: Add reqId/file/line to php7-fatal-error.php's 'message' field [puppet] - 10https://gerrit.wikimedia.org/r/554599 [12:15:58] (03CR) 10Cparle: [C: 03+2] Remove handler deleted from the MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566860 (https://phabricator.wikimedia.org/T241242) (owner: 10Matthias Mullie) [12:17:04] (03Merged) 10jenkins-bot: Remove handler deleted from the MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566860 (https://phabricator.wikimedia.org/T241242) (owner: 10Matthias Mullie) [12:18:06] 10Operations, 10serviceops: Test and deploy mcrouter 0.41 - https://phabricator.wikimedia.org/T244476 (10jijiki) [12:18:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/562852 (owner: 10Jbond) [12:18:21] cormacparle__: matthiasmullie give me a ping when yours are done :) [12:18:42] addshore: will do! [12:19:49] 10Operations, 10serviceops: Test and deploy mcrouter 0.41 - https://phabricator.wikimedia.org/T244476 (10jijiki) [12:19:55] 10Operations, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T240684 (10jijiki) [12:24:54] !log cparle@deploy1001 Synchronized php-1.35.0-wmf.18/extensions/MachineVision: Use the wbsetclaim API to add depicts statements (duration: 01m 09s) [12:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:39] !log remove full-duplex statement from eqsin Tata link (not supported on Junos 18, as 10G is full duplex anyway) [12:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:25] !log cparle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Remove handler deleted from the MachineVision extension (duration: 01m 05s) [12:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:53] (03PS4) 10Matthias Mullie: Re-enable delayed new upload jobs for MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570287 (https://phabricator.wikimedia.org/T241072) [12:29:14] (03CR) 10Jbond: [C: 03+2] ldap - idp: add ldap helper script for enabling u2f on cas [puppet] - 10https://gerrit.wikimedia.org/r/562852 (owner: 10Jbond) [12:30:40] (03CR) 10Cparle: [C: 03+2] Re-enable delayed new upload jobs for MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570287 (https://phabricator.wikimedia.org/T241072) (owner: 10Matthias Mullie) [12:31:11] (03PS2) 10Matthias Mullie: Revert "Remove handler deleted from the MachineVision extension on beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570571 [12:31:40] (03Merged) 10jenkins-bot: Re-enable delayed new upload jobs for MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570287 (https://phabricator.wikimedia.org/T241072) (owner: 10Matthias Mullie) [12:33:57] (03CR) 10Cparle: [C: 03+2] Revert "Remove handler deleted from the MachineVision extension on beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570571 (owner: 10Matthias Mullie) [12:34:16] !log cparle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Re-enable delayed new upload jobs for MachineVision extension (duration: 01m 08s) [12:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:50] (03Merged) 10jenkins-bot: Revert "Remove handler deleted from the MachineVision extension on beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570571 (owner: 10Matthias Mullie) [12:36:20] ok all done - addshore you're up [12:36:36] sweet [12:36:50] (03PS3) 10Addshore: Enable EntitySourceBasedFederation for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566795 (https://phabricator.wikimedia.org/T243395) (owner: 10Ladsgroup) [12:36:52] (03CR) 10Addshore: [C: 03+2] Enable EntitySourceBasedFederation for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566795 (https://phabricator.wikimedia.org/T243395) (owner: 10Ladsgroup) [12:37:01] (03PS2) 10Addshore: Enable EntitySourceBasedFederation for group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570583 (https://phabricator.wikimedia.org/T243395) [12:37:10] (03PS2) 10Addshore: wmgUseEntitySourceBasedFederation everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570584 (https://phabricator.wikimedia.org/T243395) [12:37:52] (03Merged) 10jenkins-bot: Enable EntitySourceBasedFederation for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566795 (https://phabricator.wikimedia.org/T243395) (owner: 10Ladsgroup) [12:39:41] (03PS1) 10Vgutierrez: ATS: Allow configuring via hiera server session sharing settings [puppet] - 10https://gerrit.wikimedia.org/r/570622 (https://phabricator.wikimedia.org/T244464) [12:39:50] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation for group0 T243395 (duration: 01m 07s) [12:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:54] T243395: wmgUseEntitySourceBasedFederation true for Wikimedia clients (all sites) - https://phabricator.wikimedia.org/T243395 [12:40:14] !log pooling cp3065 - T242093 [12:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:17] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [12:42:41] (03CR) 10Filippo Giunchedi: "Thanks for looking into this!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570600 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [12:43:17] (03CR) 10Filippo Giunchedi: [C: 03+1] Pass down MAC address of the installing system via BOOTIF on Buster [puppet] - 10https://gerrit.wikimedia.org/r/564729 (https://phabricator.wikimedia.org/T242481) (owner: 10Muehlenhoff) [12:43:37] (03PS2) 10Vgutierrez: ATS: Allow configuring via hiera server session sharing settings [puppet] - 10https://gerrit.wikimedia.org/r/570622 (https://phabricator.wikimedia.org/T244464) [12:44:41] !log addshore@deploy1001 sync-file aborted: Fetch central babel information over SQL query, not API (T243726) (duration: 01m 04s) [12:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:44] T243726: Babel should get cross-wiki languages via DB instead of making an HTTP request - https://phabricator.wikimedia.org/T243726 [12:45:49] (03PS5) 10Muehlenhoff: Pass down MAC address of the installing system via BOOTIF on Buster [puppet] - 10https://gerrit.wikimedia.org/r/564729 (https://phabricator.wikimedia.org/T242481) [12:46:03] (03CR) 10Vgutierrez: "pcc shows a NOOP on records.config (except for the removal of an empty line): https://puppet-compiler.wmflabs.org/compiler1002/20656/" [puppet] - 10https://gerrit.wikimedia.org/r/570622 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [12:46:20] reverting that last one [12:46:20] (03CR) 10Ema: "pcc here: https://puppet-compiler.wmflabs.org/compiler1001/20655/" [puppet] - 10https://gerrit.wikimedia.org/r/570612 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema) [12:46:50] (03PS1) 10Polishdeveloper: Clean-up decommisioned Print schema configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570625 (https://phabricator.wikimedia.org/T196159) [12:46:51] !log addshore@deploy1001 Synchronized php-1.35.0-wmf.18/extensions/Babel: REVERT Fetch central babel information over SQL query, not API (T243726) (duration: 01m 07s) [12:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:01] will investigate that after swat... [12:47:27] (03CR) 10Addshore: [C: 03+2] Enable EntitySourceBasedFederation for group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570583 (https://phabricator.wikimedia.org/T243395) (owner: 10Addshore) [12:47:31] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:47:46] ^^ that was me and already fixed (reverted the change) [12:48:20] (03Merged) 10jenkins-bot: Enable EntitySourceBasedFederation for group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570583 (https://phabricator.wikimedia.org/T243395) (owner: 10Addshore) [12:49:21] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:51:03] (03PS3) 10Vgutierrez: ATS: Allow configuring via hiera server session sharing settings [puppet] - 10https://gerrit.wikimedia.org/r/570622 (https://phabricator.wikimedia.org/T244464) [12:52:45] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation for group1 T243395 (duration: 01m 06s) [12:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:48] T243395: wmgUseEntitySourceBasedFederation true for Wikimedia clients (all sites) - https://phabricator.wikimedia.org/T243395 [12:54:47] (03PS1) 10Addshore: Revert "Enable EntitySourceBasedFederation for group1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570627 [12:55:22] (03CR) 10Addshore: [C: 03+2] Revert "Enable EntitySourceBasedFederation for group1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570627 (owner: 10Addshore) [12:56:08] (03CR) 10Ema: [C: 03+2] tlsproxy::localssl: allow setting keepalive_requests [puppet] - 10https://gerrit.wikimedia.org/r/570612 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema) [12:56:19] (03Merged) 10jenkins-bot: Revert "Enable EntitySourceBasedFederation for group1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570627 (owner: 10Addshore) [12:56:53] (03PS4) 10Vgutierrez: ATS: Allow configuring via hiera server session sharing settings [puppet] - 10https://gerrit.wikimedia.org/r/570622 (https://phabricator.wikimedia.org/T244464) [12:57:17] (03CR) 10Addshore: [C: 04-2] "T244479 needs fixing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570584 (https://phabricator.wikimedia.org/T243395) (owner: 10Addshore) [12:58:03] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: REVERT Enable EntitySourceBasedFederation for group1 T243395, due to T244479 (duration: 01m 07s) [12:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:07] T243395: wmgUseEntitySourceBasedFederation true for Wikimedia clients (all sites) - https://phabricator.wikimedia.org/T243395 [12:58:07] T244479: Argument 5 passed to Wikibase\Lexeme\Search\Elastic\LexemeSearchEntity::__construct() must be an instance of Wikibase\Lib\Store\PrefetchingTermLookup, instance of Wikibase\DataAccess\ByTypeDispatchingPrefetchingTermLookup given, called in /srv/mediawiki/php-1.35.0-wmf.18/extensions/WikibaseLexemeCirrusSearch/WikibaseSearch.entitytypes.repo.php on line 41 - https://phabricator.wikimedia.org/T244479 [12:59:38] !log deactivate BGP transits on cr2-eqsin [12:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:43] (03CR) 10Vgutierrez: "100% NOOP now: https://puppet-compiler.wmflabs.org/compiler1002/20657/" [puppet] - 10https://gerrit.wikimedia.org/r/570622 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [13:00:06] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: resync REVERT Enable EntitySourceBasedFederation for group1 (duration: 01m 07s) [13:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:45] !log SWAT done [13:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:47] !log reboot cr2-eqsin for sw upgrade [13:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:54] (03PS2) 10Ema: profile::mediawiki::webserver: increase canary keepalive_requests [puppet] - 10https://gerrit.wikimedia.org/r/570613 (https://phabricator.wikimedia.org/T241145) [13:03:05] 10Operations, 10Traffic: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 (10Vgutierrez) @bblack confirmed that we can drop RSA support on ncredir during the All Hands [13:03:16] and it's booting up [13:03:17] Seems like I've been hit by something now occuring at eqsin [13:04:02] (03CR) 10Ema: [C: 03+1] ATS: Allow configuring via hiera server session sharing settings [puppet] - 10https://gerrit.wikimedia.org/r/570622 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [13:05:20] (03PS1) 10Vgutierrez: ncredir: Drop RSA support [puppet] - 10https://gerrit.wikimedia.org/r/570629 (https://phabricator.wikimedia.org/T243391) [13:05:42] PROBLEM - LVS HTTP IPv6 #page on upload-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:05:44] (03CR) 10Vgutierrez: [C: 03+2] ATS: Allow configuring via hiera server session sharing settings [puppet] - 10https://gerrit.wikimedia.org/r/570622 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [13:05:49] PROBLEM - OSPF status on mr1-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:05:58] <_joe_> XioNoX: ^^ [13:06:00] PROBLEM - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is CRITICAL: /private-info/info.json (private tile service info for osm-intl) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received: /v4/marker/pin-m+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m+ffffff@2x.png (Untitled test) timed [13:06:00] onse was received: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /osm-intl/info.json (tile service info for osm-intl) timed out before a response was received: /_info (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:06:03] XioNoX: I guess that's expected? :) [13:06:09] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:06:10] yepp looking [13:06:13] PROBLEM - Router interfaces on cr1-eqsin is CRITICAL: CRITICAL: host 103.102.166.129, interfaces up: 78, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:06:17] the host is coming back up [13:06:18] PROBLEM - LVS HTTPS IPv6 #page on upload-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:06:21] PROBLEM - LVS HTTPS IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:06:25] but no reason that triggerd [13:06:41] PROBLEM - LVS HTTP IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:06:51] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.eqsin.wikimedia.org, port=443): Read timed out. (read timeout=15),): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [13:06:59] * vgutierrez orders a t-shirt for XioNoX O:) [13:07:04] <_joe_> should we depool eqsin? [13:07:19] here if needed [13:07:23] loading for me, but bit slower [13:07:37] RECOVERY - LVS HTTP IPv6 #page on upload-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 432 bytes in 7.607 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:07:37] RECOVERY - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:07:37] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 246 probes of 604 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:07:39] <_joe_> revi: ack :) [13:07:45] RECOVERY - OSPF status on mr1-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:07:47] <_joe_> it's all coming back anyways [13:07:50] * apergos peeking in and following along [13:08:01] yeah, wtf [13:08:06] RECOVERY - LVS HTTPS IPv6 #page on upload-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 843 bytes in 0.949 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:08:07] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:08:07] RECOVERY - Router interfaces on cr1-eqsin is OK: OK: host 103.102.166.129, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:08:07] * akosiaris around as well, since the page [13:08:09] RECOVERY - LVS HTTPS IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15066 bytes in 1.647 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:08:12] my friend had more PITA time getting disturbed while doing CU lol [13:08:13] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:08:13] :P [13:08:20] XioNoX: BGP coalescing? [13:08:20] everything is back to normal afaik [13:08:25] * effie o/ [13:08:27] RECOVERY - LVS HTTP IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 0.470 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:08:40] this was just a reboot of cr4 ? [13:09:03] ok cool [13:09:06] akosiaris: reboot of cr2-eqsin, which was vrrp backup, doesn't have the transport links, and had its transits disabled [13:09:27] weird [13:09:42] exactly [13:10:05] no traffic should be flowing through it from what you say. How about monitoring? would it flow via it for whatever reason ? [13:10:21] !log rollback: deactivate BGP transits on cr2-eqsin [13:10:22] traffic would still be coming in its transits [13:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:29] akosiaris: finishing up and will look at it [13:10:56] but that wouldn't explain the alerts that look more like transport flap, since we lost reachability from the core dcs [13:10:58] (03CR) 10Ema: "pcc follows." [puppet] - 10https://gerrit.wikimedia.org/r/570613 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema) [13:11:24] it did have the backup transport tunnel, is it possible we were somehow routing over that? [13:11:49] (03CR) 10Ema: profile::mediawiki::webserver: increase canary keepalive_requests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570613 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema) [13:12:17] (03CR) 10Alexandros Kosiaris: [C: 03+2] site.pp: Add new ganeti codfw hosts as role::spare [puppet] - 10https://gerrit.wikimedia.org/r/570601 (https://phabricator.wikimedia.org/T224603) (owner: 10Alexandros Kosiaris) [13:12:26] bblack: nah, it was going over the main one ( https://librenms.wikimedia.org/graphs/to=1580994600/id=16794/type=port_bits/from=1580908200/ ) [13:12:57] (03PS1) 10Vgutierrez: ATS: Allow server session sharing by ip on ats-tls in cp4031 [puppet] - 10https://gerrit.wikimedia.org/r/570633 (https://phabricator.wikimedia.org/T244464) [13:13:35] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 5 probes of 604 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:13:44] did it have pybal traffic perhaps? [13:14:05] somekeping was fine https://smokeping.wikimedia.org/?target=eqsin.Hosts.bast5001 [13:14:33] heh, we never did finish switching all the pybals to dual bgp sessions [13:14:46] mark: ah maybe, I forgot to fail pybal over manually [13:14:49] damn [13:14:51] in eqsin, 5001->cr1, 500[23]->cr2 [13:15:07] so text should've been ok, but upload would've failed, due to that? [13:15:23] we had alerts for both though [13:15:39] (03CR) 10Giuseppe Lavagetto: [C: 03+1] profile::mediawiki::webserver: increase canary keepalive_requests [puppet] - 10https://gerrit.wikimedia.org/r/570613 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema) [13:16:17] although only v6 for text, maybe [13:17:03] also the icinga checks go through the transports so it can't be due to transit BGP convergence [13:17:37] (03CR) 10Ema: [C: 03+1] ATS: Allow server session sharing by ip on ats-tls in cp4031 [puppet] - 10https://gerrit.wikimedia.org/r/570633 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [13:17:50] (03CR) 10Vgutierrez: [C: 03+2] ATS: Allow server session sharing by ip on ats-tls in cp4031 [puppet] - 10https://gerrit.wikimedia.org/r/570633 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [13:18:39] 10Operations, 10Puppet: confd fails to start after a reimage - https://phabricator.wikimedia.org/T244477 (10jbond) p:05Triage→03Medium [13:18:41] cr2-eqsin is now fully healthy [13:19:03] let's fix the pybal->both thing in eqsin before cr1? [13:19:41] bblack: I'm not doing cr1 today [13:19:43] but yeah :) [13:20:05] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:20:23] going to do cr3-knams now [13:22:04] (03PS1) 10Jbond: sslcert: ensure we run update-ca-certificates managing any services [puppet] - 10https://gerrit.wikimedia.org/r/570637 (https://phabricator.wikimedia.org/T244477) [13:22:19] (03CR) 10Ema: [C: 03+2] profile::mediawiki::webserver: increase canary keepalive_requests [puppet] - 10https://gerrit.wikimedia.org/r/570613 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema) [13:22:32] !log Enable server session sharing on ats-tls in cp4031 - T244464 [13:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:36] T244464: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 [13:27:29] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:27:35] !log deactivate BGP transits on cr3-knams [13:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:50] !log depool mw1347 to test some mcrouter settings [13:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:38] !log reboot cr3-knams [13:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:52] !log repool mw1347 with mcrouter running with 10 proxy threads (was: 5) [13:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:37] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:35:41] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:35:55] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:36:35] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:38:25] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:38:25] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:39:09] (03PS1) 10Ema: profile::mediawiki::webserver: increase keepalive_requests [puppet] - 10https://gerrit.wikimedia.org/r/570640 (https://phabricator.wikimedia.org/T241145) [13:39:19] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:39:23] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:39:37] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:40:05] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:41:46] 10Operations: Integrate Stretch 9.6 point update - https://phabricator.wikimedia.org/T209260 (10MoritzMuehlenhoff) 05Open→03Resolved This is done for quite a while, closing. [13:43:19] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [13:45:49] !log rollback deactivate BGP transits on cr3-knams [13:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:12] (03CR) 10Ema: [C: 03+1] ncredir: Drop RSA support [puppet] - 10https://gerrit.wikimedia.org/r/570629 (https://phabricator.wikimedia.org/T243391) (owner: 10Vgutierrez) [13:50:09] (03PS1) 10Addshore: Revert "Revert "Enable EntitySourceBasedFederation for group1"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570642 [13:50:25] (03PS2) 10Addshore: Enable EntitySourceBasedFederation for group1 again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570642 [13:50:49] (03PS3) 10Addshore: wmgUseEntitySourceBasedFederation everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570584 (https://phabricator.wikimedia.org/T243395) [13:51:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1019 for onsite maintenance T243963', diff saved to https://phabricator.wikimedia.org/P10321 and previous config saved to /var/cache/conftool/dbconfig/20200206-135157-marostegui.json [13:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:01] T243963: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 [13:53:14] (03PS1) 10Marostegui: es1019: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570643 (https://phabricator.wikimedia.org/T243963) [13:53:48] !log depool eqiad eventgate-analytics for testing purposes. Requests will flow to codfw, monitoring https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-30m&to=now for issues. [13:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:51] _joe_: ^ [13:54:09] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-analytics [13:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:15] <_joe_> akosiaris: I'm at lunch though :P [13:54:21] (03CR) 10Marostegui: [C: 03+2] es1019: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570643 (https://phabricator.wikimedia.org/T243963) (owner: 10Marostegui) [13:54:28] I 'll revert immediately if everything breaks, no worries [13:54:46] and restart php-fpms in a rolling fashion but I hope it doesn't come to it [13:55:11] !log Stop MySQL on es1019, upgrade and poweroff for on-site maintenance - T243963 [13:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:33] (03Abandoned) 10Filippo Giunchedi: elasticsearch: cirrus logs to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/570374 (https://phabricator.wikimedia.org/T225125) (owner: 10Filippo Giunchedi) [13:58:48] <_joe_> i see a small bump in p95, which is expected [13:59:24] it's within normal parameters yet though [13:59:37] I mean we had more 5m before that [13:59:40] <_joe_> sure, that's what I am saying [13:59:42] also, weren't you at lunch ? [13:59:48] go have your lunch in peace [13:59:52] <_joe_> :D sure ttyl [14:01:12] akosiaris: is the current increase in api_appserver avg response time related to your tests? [14:01:25] I'm looking at https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-m [14:02:23] ema: probably not [14:02:29] it was like that before I started [14:03:11] appservers should also be affected btw, but I see nothing there [14:03:26] akosiaris: ack so probably an api issue? [14:03:44] looks like it? [14:04:06] I don't see elevated request rates though, neither memcached rates being elevated [14:04:24] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Marostegui) @Cmjohnson es1019 is off. Once you are done, just start it back and I will it from there. Thank you! [14:04:34] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) @Jclark-ctr @wiki_willy @Cmjohnson any rough estimation on when we can expect these hosts to be online? As I said, I am... [14:04:41] I am watching both RED appserver cluster and api_appserver cluster [14:04:45] <_joe_> no, it's the canary api [14:04:53] <_joe_> https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=All [14:05:02] <_joe_> ema: I think it was related to your change somehow [14:05:12] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [14:05:39] it does look unrelated, but I think I am gonna revert, just to be on the safe side [14:05:52] <_joe_> akosiaris: you change took effect way too late to be the cause [14:05:56] <_joe_> and only on the canaries [14:05:59] _joe_: interesting, reverting [14:06:13] ok, if ema is going to revert, probably better we don't both revert. [14:06:16] <_joe_> ema: wait [14:06:21] * ema waits [14:06:24] <_joe_> it's apparently recovering [14:06:35] <_joe_> or not [14:06:36] (03CR) 10Jhedden: Keystone: rotate and sync fernet tokens (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [14:06:43] also interesting that appserver canaries aren't affected, only api [14:06:54] <_joe_> yeah no, better to revert :/ [14:07:26] _joe_: ok to revert only for api canaries, or do you want me to revert both? [14:07:32] (03CR) 10Elukey: WIP - Introduce profile::mariadb::misc::analytics (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [14:07:37] <_joe_> just for the apis for now [14:07:44] alright, on it [14:07:48] <_joe_> we can try to find out what's going on afterwards [14:07:56] sorry for interrupting your lunch [14:08:35] (03PS2) 10Elukey: WIP - Introduce profile::mariadb::misc::analytics [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) [14:11:08] (03PS1) 10Ema: profile::mediawiki::webserver: revert api canary keepalive_requests [puppet] - 10https://gerrit.wikimedia.org/r/570655 (https://phabricator.wikimedia.org/T241145) [14:12:50] (03Abandoned) 10Jbond: wmflib::end_with: create String.end_with function [puppet] - 10https://gerrit.wikimedia.org/r/570330 (https://phabricator.wikimedia.org/T244222) (owner: 10Jbond) [14:13:03] (03PS5) 10Jbond: realm global: make the realm variable a global in labs [puppet] - 10https://gerrit.wikimedia.org/r/570331 (https://phabricator.wikimedia.org/T244222) [14:13:18] (03CR) 10Ema: [C: 03+2] profile::mediawiki::webserver: revert api canary keepalive_requests [puppet] - 10https://gerrit.wikimedia.org/r/570655 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema) [14:14:33] !log run puppet on mw-api-canary to revert nginx keepalive_requests bump T241145 [14:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:36] T241145: Improve ATS backend connection reuse against origin servers - https://phabricator.wikimedia.org/T241145 [14:16:06] !log 20mins in with eventgate-analytics/eqiad depooled from discovery, no issues yet. [14:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:34] if ema's revert fixes the api latency as well, I 'll be happy. ema won't on the other hand [14:16:36] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:16:48] here we go [14:17:11] fascinating [14:17:39] ema: it's like immediate: Look at https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-15m&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-m&refresh=1m&fullscreen&panelId=9 [14:17:43] I mean... wow [14:18:12] what on earth? I wouldn't expect keepalive_requests to change that so dramatically [14:18:50] me neither, and indeed bumping the setting didn't cause any trouble on non-api appservers [14:20:00] * akosiaris wonders what else can I depool from eqiad [14:20:16] akosiaris: context? [14:20:49] indeed, sorry. So since the incident with eventgate-analytics being turned over to https and everything collapsing [14:21:07] we 've been wondering why. One of the possible explanations is/was latency increase [14:21:14] (03PS1) 10Vgutierrez: install_server: Remove already decommissioned cp40[09,10,17,18] [puppet] - 10https://gerrit.wikimedia.org/r/570659 [14:21:27] so we were fearing moving eventgate-analytics over to codfw would cause a total collapse of the heart [14:21:58] so I 've been experimenting in small batches and mw servers to verify the hypothesis [14:22:17] turns out that the 40ms RTT that eqiad -> codfw introduces don't trigger an issue [14:22:21] (03PS1) 10Vgutierrez: install_server: Reimage upload@ulsfo as buster [puppet] - 10https://gerrit.wikimedia.org/r/570660 (https://phabricator.wikimedia.org/T242093) [14:22:33] 10Operations, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10hnowlan) [14:23:07] now, I wouldn't want to test with say 300ms, but at least this is a relief. If this did cause a big issue, we would have severe problems with the switchover [14:23:13] 10Operations, 10Elasticsearch, 10Wikimedia-Logstash, 10observability, and 2 others: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline - https://phabricator.wikimedia.org/T225125 (10fgiunchedi) Status update: out of the box json logging support has been introduced i... [14:23:13] (03CR) 10Vgutierrez: [C: 03+2] ncredir: Drop RSA support [puppet] - 10https://gerrit.wikimedia.org/r/570629 (https://phabricator.wikimedia.org/T243391) (owner: 10Vgutierrez) [14:25:10] (03PS6) 10Muehlenhoff: Pass down MAC address of the installing system via BOOTIF on Buster [puppet] - 10https://gerrit.wikimedia.org/r/564729 (https://phabricator.wikimedia.org/T242481) [14:26:09] (03CR) 10Ema: [C: 03+1] install_server: Reimage upload@ulsfo as buster [puppet] - 10https://gerrit.wikimedia.org/r/570660 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [14:26:24] (03CR) 10Ayounsi: "INFO:homer:Committing config for query mr1-ulsfo* with message: test" [software/homer] - 10https://gerrit.wikimedia.org/r/570510 (https://phabricator.wikimedia.org/T244363) (owner: 10Volans) [14:27:07] (03PS2) 10Vgutierrez: install_server: Remove already decommissioned cp40[09,10,17,18] [puppet] - 10https://gerrit.wikimedia.org/r/570659 (https://phabricator.wikimedia.org/T178815) [14:27:10] (03PS2) 10Vgutierrez: install_server: Reimage upload@ulsfo as buster [puppet] - 10https://gerrit.wikimedia.org/r/570660 (https://phabricator.wikimedia.org/T242093) [14:27:54] (03CR) 10Ema: [C: 03+1] install_server: Remove already decommissioned cp40[09,10,17,18] [puppet] - 10https://gerrit.wikimedia.org/r/570659 (https://phabricator.wikimedia.org/T178815) (owner: 10Vgutierrez) [14:31:02] PROBLEM - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:33:10] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 114984152 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:33:25] (03CR) 10Vgutierrez: [C: 03+2] install_server: Remove already decommissioned cp40[09,10,17,18] [puppet] - 10https://gerrit.wikimedia.org/r/570659 (https://phabricator.wikimedia.org/T178815) (owner: 10Vgutierrez) [14:34:03] (03CR) 10Muehlenhoff: [C: 03+2] Pass down MAC address of the installing system via BOOTIF on Buster [puppet] - 10https://gerrit.wikimedia.org/r/564729 (https://phabricator.wikimedia.org/T242481) (owner: 10Muehlenhoff) [14:35:02] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 104544 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:37:06] <_joe_> akosiaris: try eventgate-main next :) [14:37:17] (03CR) 10Vgutierrez: [C: 03+2] install_server: Reimage upload@ulsfo as buster [puppet] - 10https://gerrit.wikimedia.org/r/570660 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [14:37:19] I am thinking about it [14:37:31] (03PS3) 10Vgutierrez: install_server: Reimage upload@ulsfo as buster [puppet] - 10https://gerrit.wikimedia.org/r/570660 (https://phabricator.wikimedia.org/T242093) [14:39:02] (03CR) 10Ayounsi: [C: 03+1] "Overall LGTM at the condition of testing it manually with at least 2 parallel "attacks"." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587) (owner: 10CDanis) [14:39:21] (03PS2) 10Jbond: sslcert: ensure we run update-ca-certificates managing any services [puppet] - 10https://gerrit.wikimedia.org/r/570637 (https://phabricator.wikimedia.org/T244477) [14:39:36] _joe_: that being said, eventgate-main in codfw produces to a different topic [14:39:43] (03CR) 10Andrew Bogott: Keystone: rotate and sync fernet tokens (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [14:39:47] topics* more correctly [14:39:57] so I 'd like andrew's buy in before doing that [14:40:37] <_joe_> akosiaris: sure but it's kind-of the point [14:40:52] jouncebot: next [14:40:52] In 2 hour(s) and 19 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200206T1700) [14:40:53] it is indeed, but I remember andrew having to switch something around [14:40:55] <_joe_> I think eventstreams needed to follow it [14:41:05] <_joe_> but that should not be the case anymore [14:41:12] why ? [14:41:23] (03CR) 10jerkins-bot: [V: 04-1] sslcert: ensure we run update-ca-certificates managing any services [puppet] - 10https://gerrit.wikimedia.org/r/570637 (https://phabricator.wikimedia.org/T244477) (owner: 10Jbond) [14:41:30] <_joe_> because andrew fixed things IIRC [14:41:42] I guess we will wait and see [14:41:44] <_joe_> but sure let's wait for him [14:43:01] (03PS3) 10Jbond: sslcert: ensure we run update-ca-certificates managing any services [puppet] - 10https://gerrit.wikimedia.org/r/570637 (https://phabricator.wikimedia.org/T244477) [14:43:18] _joe_: so it turns out that restbase origins have great connection reuse given that they're behind envoy, hence they don't close connections after 100 reqs [14:44:23] _joe_: a possible explanation of why api canaries were negatively affected by the keepalive_requests bump is that api has decent connection reuse, so it happens often that connections are reused for > 100 reqs [14:45:14] <_joe_> ema: so the TLDR is we need envoy? [14:45:21] <_joe_> did you see what made the cpu explode? [14:45:36] <_joe_> gimme 5 mins and I'm with you [14:45:43] this is not the case for appservers; on cp3050 I've counted the number of times we reached 99 requests over one single connection for different origins for a few seconds; it happened 321 times for api and 6 times for appservers [14:46:00] (03CR) 10Giuseppe Lavagetto: "Overall LGTM, a few small corrections and it should be good to merge." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) (owner: 10Hnowlan) [14:46:12] !log depool & reimage cp4025 as buster - T242093 [14:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:16] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [14:46:58] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4025.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [14:47:02] 10Operations, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T240684 (10elukey) Just updated https://grafana.wikimedia.org/d/000000317/memcache-slabs adding a new row at the bottom '1.5.x metrics' with all the new m... [14:47:12] _joe_: so my (conspiracy) theory is that nginx doesn't behave well with heavy connection reuse, and that's the actual reason why keepalive_requests is 100 by default [14:47:45] (03PS1) 10Vgutierrez: install_server: Reimage ncredir@ulsfo as buster [puppet] - 10https://gerrit.wikimedia.org/r/570665 (https://phabricator.wikimedia.org/T243391) [14:47:47] (03PS1) 10Jdlrobson: Set wgLogoHD before adding wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570666 (https://phabricator.wikimedia.org/T244405) [14:48:09] <_joe_> ema: did you dig deeper in the stats of those servers? [14:48:55] _joe_: I've seen cpu usage, load, and temperature going up. Memory not affected [14:49:04] ema: there are some memory cleanup routines/functions on nginx triggered only on connection close... [14:49:07] <_joe_> ok lemme check something [14:49:14] (03CR) 10Jdlrobson: "its late and i am unable to swat this but this should fix the issue with the unexpected wordmark key in wgLogoHD" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570666 (https://phabricator.wikimedia.org/T244405) (owner: 10Jdlrobson) [14:49:19] (03CR) 10Andrew Bogott: Keystone: rotate and sync fernet tokens (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [14:49:23] at least those involving TLS session cache [14:50:08] <_joe_> ema: I think what happened is [14:50:21] <_joe_> ats funneled through those servers a ton of requests [14:50:43] <_joe_> https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-node=mw1277&fullscreen&panelId=52 [14:50:53] <_joe_> from 150 rps to 300 rps [14:50:55] <_joe_> lol [14:51:23] <_joe_> so we might want to either bump up to like 200 everywhere first [14:53:18] _joe_: when you say "everywhere", do you mean both canaries and non-canaries? [14:53:48] (03CR) 10Vgutierrez: [C: 03+2] install_server: Reimage ncredir@ulsfo as buster [puppet] - 10https://gerrit.wikimedia.org/r/570665 (https://phabricator.wikimedia.org/T243391) (owner: 10Vgutierrez) [14:54:40] PROBLEM - nova-compute proc minimum on cloudvirt1015 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:54:53] BTW, now that we are talking nginx @ applayer.. if it's using the same TLS session cache settings as the old edge TLS termination.. maybe it's worth some tuning.. cause right now it would be just wasting memory on those servers [14:56:42] !log depool and reimage ncredir4002 as buster - T243391 [14:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:45] T243391: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 [14:59:32] !log extend graphite1004 / graphite2003 fs +200G [14:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:13] <_joe_> vgutierrez: I want to move to use envoy soon [15:00:20] <_joe_> ema: yes that's what I mean [15:00:27] _joe_: ack, on it [15:00:27] _joe_: that's going to reveal new issues [15:00:44] <_joe_> vgutierrez: possibly [15:00:53] I'm not opposing to that BTW [15:00:57] RECOVERY - nova-compute proc minimum on cloudvirt1015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:01:00] _joe_: another q. If eventgate-analytics caused such an issue moving to https [15:01:06] just saying that having 3 different HTTP implementations on the chain is always tricky [15:01:13] <_joe_> vgutierrez: but we use envoy exensively for other services [15:01:18] what's going to stop sessionstore from creating the same issue? [15:01:22] <_joe_> so I'm not sure what new issues you expect [15:01:26] <_joe_> akosiaris: sheer volume [15:01:31] even better, what's stopping echostore? [15:01:35] just the volume? [15:01:37] <_joe_> ^^ [15:01:39] <_joe_> yes [15:02:17] ok, but let's keep an eye out on those. I have fear [15:02:28] <_joe_> akosiaris: we can discuss later [15:02:33] <_joe_> in the ops meeting [15:02:42] https://grafana.wikimedia.org/d/IfJykaTZk/echostore?orgId=1 [15:02:43] <_joe_> serviceops [15:02:54] is already at 1.1k [15:04:02] https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1 is also at 4k [15:04:37] and https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&var-dc=codfw%20prometheus%2Fk8s&var-service=eventgate-analytics&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All is at 10k [15:05:03] so, it isn't that far... [15:06:19] (03CR) 10Alexandros Kosiaris: profile::services_proxy: add temporarily entries for k8s services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570306 (owner: 10Giuseppe Lavagetto) [15:07:45] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [15:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:29] (03PS3) 10Jhedden: openstack: switch cloudvirt101[56] to ceph storage [puppet] - 10https://gerrit.wikimedia.org/r/570363 (https://phabricator.wikimedia.org/T243327) [15:08:56] (03CR) 10Alexandros Kosiaris: [C: 03+2] "I think that's in the correct direction. It interferes indeed with the sharing we want to do of _helpers.tpl between the charts. Let's rev" [deployment-charts] - 10https://gerrit.wikimedia.org/r/570290 (owner: 10Giuseppe Lavagetto) [15:09:01] (03PS2) 10Alexandros Kosiaris: Revert "Update scaffold template names to use chart name" [deployment-charts] - 10https://gerrit.wikimedia.org/r/570290 (owner: 10Giuseppe Lavagetto) [15:09:33] (03PS1) 10Ema: profile::mediawiki::webserver: set api keepalive_requests to 200 [puppet] - 10https://gerrit.wikimedia.org/r/570670 (https://phabricator.wikimedia.org/T241145) [15:09:58] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:05] (03CR) 10Jhedden: [C: 03+2] "PCC results https://puppet-compiler.wmflabs.org/compiler1002/20661/" [puppet] - 10https://gerrit.wikimedia.org/r/570363 (https://phabricator.wikimedia.org/T243327) (owner: 10Jhedden) [15:10:29] _joe_: like this? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/570670/ [15:10:48] (03PS3) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) [15:11:00] (03CR) 10Giuseppe Lavagetto: [C: 03+1] profile::mediawiki::webserver: set api keepalive_requests to 200 [puppet] - 10https://gerrit.wikimedia.org/r/570670 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema) [15:11:04] <_joe_> ema: let's try [15:11:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] "@Jeena, fyi, we had unfortunately to revert this as we want to share _helpers.tpl across all charts and enforce their consistency. I 'll a" [deployment-charts] - 10https://gerrit.wikimedia.org/r/570290 (owner: 10Giuseppe Lavagetto) [15:12:35] _joe_: merging, perhaps we should speed up things a bit by foricing a puppet run to get a more uniform distribution of reqs? [15:13:02] 10Operations, 10LDAP-Access-Requests: Get access to Superset - https://phabricator.wikimedia.org/T244490 (10alexhollender) [15:13:06] <_joe_> yes [15:13:07] (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [15:13:11] <_joe_> but do like -b 25 [15:13:14] ack [15:13:20] (03CR) 10Ema: [C: 03+2] profile::mediawiki::webserver: set api keepalive_requests to 200 [puppet] - 10https://gerrit.wikimedia.org/r/570670 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema) [15:14:17] (03PS2) 10Alexandros Kosiaris: helpers: Move most charts to common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/570064 [15:14:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] "https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/570064 fwiw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/570290 (owner: 10Giuseppe Lavagetto) [15:14:52] !log A:mw-api: force puppet run to increase keepalive_requests from 100 to 200 https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/570670/ T241145 [15:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:55] T241145: Improve ATS backend connection reuse against origin servers - https://phabricator.wikimedia.org/T241145 [15:17:37] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4025.ulsfo.wmnet'] ` and were **ALL** successful. [15:17:45] (03PS4) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) [15:18:01] (03PS8) 10Hnowlan: mediawiki: check mw versions match those on the deploy server [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) [15:18:58] (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [15:20:54] (03CR) 10Hnowlan: mediawiki: check mw versions match those on the deploy server (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) (owner: 10Hnowlan) [15:21:06] (03PS5) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) [15:23:03] !log pooling cp4025 with buster - T242093 [15:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:06] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [15:23:29] _joe_: change applied to all api servers, there's been a response time spike but it seems to be recovering now [15:23:29] (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [15:24:12] I'm looking at https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-1h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-m&refresh=1m&fullscreen&panelId=9 [15:25:21] (03PS3) 10Mholloway: WIP: Proton charts first draft [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos) [15:27:26] !log installing sudo security updates on jessie [15:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:22] !log pooling ncredir4002 running buster - T243391 [15:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:24] T243391: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 [15:29:43] !log depool & reimage cp4024 as buster - T242093 [15:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:46] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [15:30:21] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4024.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [15:30:30] !log depool & reimage ncredir4001 as buster - T243391 [15:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:39] (03CR) 10Mholloway: "I'm keeping this moving while Mateus is out on vacation. Some comments (specifically the ones about using internal and not external endpoi" (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos) [15:32:09] (03CR) 10Krinkle: [C: 03+1] Set wgLogoHD before adding wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570666 (https://phabricator.wikimedia.org/T244405) (owner: 10Jdlrobson) [15:32:19] (03CR) 10Krinkle: [C: 03+1] "I can verify it today in beta and prod." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570666 (https://phabricator.wikimedia.org/T244405) (owner: 10Jdlrobson) [15:35:28] (03CR) 10Mholloway: "> Please add TLS termination, see cxserver or termbox as examples." [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos) [15:36:04] 10Operations, 10Traffic: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 (10Vgutierrez) [15:38:46] RECOVERY - Host mw2311 is UP: PING OK - Packet loss = 0%, RTA = 36.27 ms [15:39:06] PROBLEM - configured eth on mw2311 is CRITICAL: connect to address 10.192.16.158 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [15:41:01] !log installing jsoup security updates [15:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:36] PROBLEM - Check systemd state on mw2311 is CRITICAL: connect to address 10.192.16.158 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:50] PROBLEM - dhclient process on mw2311 is CRITICAL: connect to address 10.192.16.158 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [15:43:18] PROBLEM - DPKG on mw2311 is CRITICAL: connect to address 10.192.16.158 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:43:50] (03PS1) 10BBlack: pybal to both routers for eqsin [puppet] - 10https://gerrit.wikimedia.org/r/570672 (https://phabricator.wikimedia.org/T165765) [15:43:52] (03PS1) 10BBlack: pybal to both routers for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/570673 (https://phabricator.wikimedia.org/T165765) [15:43:55] (03PS1) 10BBlack: pybal to both routers for codfw primary [puppet] - 10https://gerrit.wikimedia.org/r/570674 (https://phabricator.wikimedia.org/T165765) [15:43:58] (03PS1) 10BBlack: pybal to both routers for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/570675 (https://phabricator.wikimedia.org/T165765) [15:45:19] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [15:45:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] "couple of typos, but otherwise LGTM" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/570331 (https://phabricator.wikimedia.org/T244222) (owner: 10Jbond) [15:46:01] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [15:47:16] o/ akosiaris [15:47:26] 10Operations, 10ops-codfw, 10serviceops-radar: codfw: new mw servers not getting an IP when default to Stretch - https://phabricator.wikimedia.org/T244438 (10Papaul) Did the re-install on mw2311 it works . Thanks [15:47:50] (03PS2) 10Mholloway: Add chart for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/570162 (https://phabricator.wikimedia.org/T218733) [15:47:52] I'm ready to do a rollback as elukey suggested [15:48:14] Wanted to check if you are around since I'm out of the deploy window and wanted to make sure I'd have someone to help out if something goes weirdly. [15:48:15] (03PS6) 10Jbond: realm global: make the realm variable a global in labs [puppet] - 10https://gerrit.wikimedia.org/r/570331 (https://phabricator.wikimedia.org/T244222) [15:48:16] 10Operations, 10ops-codfw, 10serviceops-radar: codfw: new mw servers not getting an IP when default to Stretch - https://phabricator.wikimedia.org/T244438 (10MoritzMuehlenhoff) The ethernet adapter is slightly different than the BCM5720 we otherwise already run on stretch. E.g. on ms-be2050 it reports as... [15:48:29] (03CR) 10Jbond: "thanks updated" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/570331 (https://phabricator.wikimedia.org/T244222) (owner: 10Jbond) [15:48:31] I don't expect it to but you never know. [15:48:38] PROBLEM - Check systemd state on mw2311 is CRITICAL: connect to address 10.192.16.158 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:45] PROBLEM - configured eth on mw2311 is CRITICAL: connect to address 10.192.16.158 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [15:48:51] PROBLEM - dhclient process on mw2311 is CRITICAL: connect to address 10.192.16.158 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [15:49:02] (03CR) 10BBlack: [C: 03+2] pybal to both routers for eqsin [puppet] - 10https://gerrit.wikimedia.org/r/570672 (https://phabricator.wikimedia.org/T165765) (owner: 10BBlack) [15:49:35] PROBLEM - Disk space on mw2311 is CRITICAL: connect to address 10.192.16.158 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mw2311&var-datasource=codfw+prometheus/ops [15:49:43] PROBLEM - puppet last run on mw2311 is CRITICAL: connect to address 10.192.16.158 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:50:06] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [15:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:19] !log installing python-ecdsa security updates [15:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] "+1, aside from 2 typos" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/570343 (owner: 10Jbond) [15:50:53] halfak: here to help if needed [15:51:02] Thanks elukey [15:51:08] Did you create a task by any chance? [15:51:17] Oh I see it in the email [15:51:18] :) [15:51:40] (03CR) 10Alexandros Kosiaris: [C: 03+1] wmflib::require_domains: use require_domains instead of require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570348 (https://phabricator.wikimedia.org/T244222) (owner: 10Jbond) [15:51:45] PROBLEM - MD RAID on mw2311 is CRITICAL: connect to address 10.192.16.158 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:51:46] Oh that task is super old and probably unrelated. [15:51:47] https://phabricator.wikimedia.org/T242705 [15:52:10] * akosiaris around as well [15:52:16] !log lvs5003 - restart pybal for dual bgp session config - T180069 [15:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:19] T180069: Pybal should be able to advertise to multiple routers - https://phabricator.wikimedia.org/T180069 [15:52:24] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:04] !log halfak@deploy1001 Started deploy [ores/deploy@50a101a]: T242705 [15:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:07] T242705: Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 [15:53:14] !log lvs5002 - restart pybal for dual bgp session config - T180069 [15:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:54] (03CR) 10Vgutierrez: [C: 03+1] sslcert: ensure we run update-ca-certificates managing any services [puppet] - 10https://gerrit.wikimedia.org/r/570637 (https://phabricator.wikimedia.org/T244477) (owner: 10Jbond) [15:54:01] halfak: fwiw, that deploy is just exacerbating the issue. The trigger is uwsgi going a bit haywire on every reload/restart. Not sure why [15:54:16] RECOVERY - configured eth on mw2311 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [15:54:16] Gotcha. [15:54:17] hmm I wonder if it's on shutdown or startup [15:54:22] RECOVERY - dhclient process on mw2311 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [15:54:32] !log lvs5001 - restart pybal for dual bgp session config - T180069 [15:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:39] lemme know when you are doing with the deploy, I 'd like to verify that, but better not step on your toes [15:54:44] s/doing/done/ [15:55:00] RECOVERY - Disk space on mw2311 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mw2311&var-datasource=codfw+prometheus/ops [15:55:05] RECOVERY - MD RAID on mw2311 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:55:08] RECOVERY - Check systemd state on mw2311 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:32] !log installing qemu security updates [15:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:30] !log pooling ncredir4001 running buster - T243391 [15:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:32] T243391: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 [15:56:55] (03PS1) 10Ema: profile::mediawiki::webserver: set keepalive_requests to 200 [puppet] - 10https://gerrit.wikimedia.org/r/570677 (https://phabricator.wikimedia.org/T241145) [15:57:25] _joe_: as things look stable on the api hosts, we should be fine setting keepalive_requests to 200 elsewhere too I hope? :) [15:57:39] !log halfak@deploy1001 Finished deploy [ores/deploy@50a101a]: T242705 (duration: 04m 35s) [15:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:46] <_joe_> sure [15:57:52] (03PS11) 10Jbond: wmflib::require_domains: add new function to to replace require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570343 [15:58:22] see this for the beneficial impact on ats-be new connections rate: https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&from=now-3h&to=now&fullscreen&panelId=6 [15:58:22] (03CR) 10BBlack: [C: 03+2] pybal to both routers for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/570673 (https://phabricator.wikimedia.org/T165765) (owner: 10BBlack) [15:58:36] you have to click on 'api-rw.discovery.wmnet' to see the interesting part [15:58:42] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=eventgate-analytics [15:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:56] (03CR) 10Jhedden: [C: 03+1] "LGTM, non-blocking thought: The host specific bits in openstack::keystone::fernet_tokens makes me think it should be a profile. I'm not su" [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [15:59:25] !log repool eventgate-analytics/eqiad. Experiment proved the failover wouldn't cause (on it's own) a problem. Experiment done. [15:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:32] OK looks like we're in the clear with ORES. This memory usage is really absurd. I'll investigate today. [15:59:37] _joe_: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/570677/ [15:59:41] elukey, ^ [15:59:52] halfak: thanks! [16:00:29] (03PS1) 10Vgutierrez: install_server: Reimage ncredir@eqsin as buster [puppet] - 10https://gerrit.wikimedia.org/r/570678 (https://phabricator.wikimedia.org/T243391) [16:00:30] (03CR) 10CDanis: [C: 03+1] sslcert: ensure we run update-ca-certificates managing any services [puppet] - 10https://gerrit.wikimedia.org/r/570637 (https://phabricator.wikimedia.org/T244477) (owner: 10Jbond) [16:00:50] (03Abandoned) 10Ema: profile::mediawiki::webserver: increase keepalive_requests [puppet] - 10https://gerrit.wikimedia.org/r/570640 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema) [16:00:55] 10Operations, 10netops: cr3-knams:xe-0/1/3 down - https://phabricator.wikimedia.org/T244497 (10ayounsi) p:05Triage→03High [16:00:57] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4024.ulsfo.wmnet'] ` and were **ALL** successful. [16:01:25] ACKNOWLEDGEMENT - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi https://phabricator.wikimedia.org/T244497 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:01:26] (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [16:01:33] (03CR) 10Jbond: "thanks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/570343 (owner: 10Jbond) [16:02:05] (03CR) 10Jbond: [C: 03+2] sslcert: ensure we run update-ca-certificates managing any services [puppet] - 10https://gerrit.wikimedia.org/r/570637 (https://phabricator.wikimedia.org/T244477) (owner: 10Jbond) [16:03:03] halfak: thanks! [16:03:05] Memory usage in eqiad was way worse than in codfw this time. What's yp with that? [16:03:06] !log pooling cp4024 with buster - T242093 [16:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:08] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [16:03:11] https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?orgId=1&refresh=1m&from=now-7d&to=now-1m [16:03:28] I'm looking at memory usage under "Saturation" [16:03:36] !log depool & reimage cp4023 as buster - T242093 [16:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:59] 10Operations, 10netops: cr3-knams:xe-0/1/3 down - https://phabricator.wikimedia.org/T244497 (10ayounsi) Other side doesn't receive the light though: ` ayounsi@asw2-esams> show interfaces diagnostics optics xe-6/0/4 Physical interface: xe-6/0/4 Laser output power : 1.3540 mW / 1.3... [16:04:03] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4023.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [16:04:38] (03CR) 10Vgutierrez: [C: 03+2] install_server: Reimage ncredir@eqsin as buster [puppet] - 10https://gerrit.wikimedia.org/r/570678 (https://phabricator.wikimedia.org/T243391) (owner: 10Vgutierrez) [16:06:12] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Improve ATS backend connection reuse against origin servers - https://phabricator.wikimedia.org/T241145 (10ema) >>! In T241145#5856431, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://too... [16:06:14] !log lvs4007 - restart pybal for dual bgp session config - T180069 [16:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:17] T180069: Pybal should be able to advertise to multiple routers - https://phabricator.wikimedia.org/T180069 [16:06:42] !log lvs4006 - restart pybal for dual bgp session config - T180069 [16:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:11] !log lvs4005 - restart pybal for dual bgp session config - T180069 [16:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:40] !log depool and reimage ncredir5002 as buster - T243391 [16:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:42] T243391: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 [16:09:26] (03CR) 10BBlack: [C: 03+2] pybal to both routers for codfw primary [puppet] - 10https://gerrit.wikimedia.org/r/570674 (https://phabricator.wikimedia.org/T165765) (owner: 10BBlack) [16:10:01] 10Operations, 10ORES, 10Scoring-platform-team (Current): Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10Halfak) OK rolled back. Looking at what happened, it seems like memory pressure was *way worse* in Eqiad than in Codfw * Eqiad: https://grafana.wikimedia.org/d/HIRrxQ6mk/o... [16:10:50] (03PS1) 10Superzerocool: Throttle rule for National Gallery of Canada Library and Archives edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570680 (https://phabricator.wikimedia.org/T244488) [16:15:20] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [16:17:49] RECOVERY - DPKG on mw2311 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [16:17:55] 10Operations, 10ORES, 10Scoring-platform-team (Current): Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10Halfak) When I start up the deployment ORES config locally with 4 workers, I can see that we are using ~2516000 bytes of RES for two processes. It looks like my available RA... [16:19:15] !log lvs2003 - restart pybal for dual bgp session config - T180069 [16:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:18] T180069: Pybal should be able to advertise to multiple routers - https://phabricator.wikimedia.org/T180069 [16:19:34] !log lvs2002 - restart pybal for dual bgp session config - T180069 [16:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:31] !log lvs2001 - restart pybal for dual bgp session config - T180069 [16:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:34] !log installing cyrus-sasl2 security updates on jessie [16:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:45] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [16:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:19] (03CR) 10BBlack: [C: 03+2] pybal to both routers for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/570675 (https://phabricator.wikimedia.org/T165765) (owner: 10BBlack) [16:23:34] (03PS2) 10BBlack: pybal to both routers for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/570675 (https://phabricator.wikimedia.org/T165765) [16:24:12] (03CR) 10Herron: "Is the commit message accurate about this creating a mixed raid level LVM layout? It looks like it would result in a non-LVM config (whic" [puppet] - 10https://gerrit.wikimedia.org/r/570600 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [16:24:58] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:34] (03CR) 10BBlack: [V: 03+2 C: 03+2] pybal to both routers for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/570675 (https://phabricator.wikimedia.org/T165765) (owner: 10BBlack) [16:25:39] 10Operations, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10Dzahn) Aww, thanks for conforming and thanks Moritz for fixing it. This is exactly why i wanted to test it. The capitalization c... [16:28:39] !log restarting apache on bromine to pick up SASL security updates [16:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:09] !log lvs1016 - restart pybal for dual bgp session config - T180069 [16:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:12] T180069: Pybal should be able to advertise to multiple routers - https://phabricator.wikimedia.org/T180069 [16:29:25] PROBLEM - DPKG on install1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [16:30:03] !log lvs1015 - restart pybal for dual bgp session config - T180069 [16:30:05] 10Operations, 10ops-codfw, 10ops-eqiad, 10ops-eqsin, and 3 others: Audit & update spares part tracking for all sites - https://phabricator.wikimedia.org/T243450 (10RobH) I'm adding in each site's project. Once an on-site engineer has audited and updated the spares tracking sheet for hardware, this task sh... [16:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:39] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Improve ATS backend connection reuse against origin servers - https://phabricator.wikimedia.org/T241145 (10Gilles) therockapplauds We will keep an eye on the trend in coming days to check how much of a dent it made in the perf regres... [16:30:46] !log lvs1014 - restart pybal for dual bgp session config - T180069 [16:30:47] Aha! Our code for dropping those assets from memory isn't working. [16:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:00] Fixing this'll definitely help! [16:31:37] !log lvs1013 - restart pybal for dual bgp session config - T180069 [16:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:30] !log remove AS prepending in esams/knams [16:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:42] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@524be2b]: airflow: Update ores data transfer from drafttopic -> articletopic [16:35:49] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4023.ulsfo.wmnet'] ` and were **ALL** successful. [16:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:02] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@524be2b]: airflow: Update ores data transfer from drafttopic -> articletopic (duration: 00m 19s) [16:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:52] (03PS3) 10Clarakosi: Add restbase202[123] to hiera [puppet] - 10https://gerrit.wikimedia.org/r/570094 (https://phabricator.wikimedia.org/T244178) [16:38:15] !log pooling cp4023 with buster - T242093 [16:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:18] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [16:38:39] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor pybal/LVS config for shared failover - https://phabricator.wikimedia.org/T165765 (10BBlack) Status update: what's missing here is codfw, which will happen when we finish its hardware upgrade switch to lvs2007-10 [16:38:44] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [16:39:23] (03PS6) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) [16:41:10] RECOVERY - puppet last run on mw2311 is OK: OK: Puppet is currently enabled, last run 28 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:41:41] (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [16:43:35] (03CR) 10Jhedden: [C: 03+1] "should use lookup instead of hiera, but overall looks great!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [16:47:04] (03CR) 10Arturo Borrero Gonzalez: Keystone: rotate and sync fernet tokens (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [16:49:12] !log pooling ncredir5002 running buster - T243391 [16:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:15] T243391: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 [16:49:33] 10Operations, 10Traffic: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 (10Vgutierrez) [16:52:23] (03PS7) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) [16:54:44] (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [16:56:49] (03PS8) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) [16:59:08] (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [17:00:04] godog and _joe_: Time to snap out of that daydream and deploy Puppet SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200206T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:00:12] PROBLEM - rsyslog TLS listener on port 6514 on centrallog2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [17:01:00] RECOVERY - rsyslog TLS listener on port 6514 on centrallog2001 is OK: SSL OK - Certificate centrallog2001.codfw.wmnet valid until 2024-11-16 16:04:24 +0000 (expires in 1744 days) https://wikitech.wikimedia.org/wiki/Logs [17:03:53] (03CR) 10Effie Mouzeli: [C: 03+1] "worth a shot" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/570255 (owner: 10Giuseppe Lavagetto) [17:06:23] (03PS2) 10Andrew Bogott: Keystone: set max_active_keys for fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570507 (https://phabricator.wikimedia.org/T243418) [17:06:25] (03PS9) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) [17:08:37] 10Operations, 10serviceops: Test and deploy mcrouter 0.41 - https://phabricator.wikimedia.org/T244476 (10jijiki) [17:08:55] (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [17:09:06] (03CR) 10Eevans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/569564 (https://phabricator.wikimedia.org/T242585) (owner: 10Filippo Giunchedi) [17:11:26] (03PS6) 10Filippo Giunchedi: cassandra: restbase-dev logs to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/569564 (https://phabricator.wikimedia.org/T242585) [17:18:37] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10Krinkle) In an IRC conversation with @volans we considered whether it's possible that the 50ms regression com... [17:19:02] (03PS3) 10Mholloway: Add chart for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/570162 (https://phabricator.wikimedia.org/T218733) [17:19:35] (03PS1) 10Elukey: profile::cdh::apt: add bigtop repository [puppet] - 10https://gerrit.wikimedia.org/r/570685 (https://phabricator.wikimedia.org/T244499) [17:23:23] 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10RobH) p:05Triage→03Medium [17:23:34] 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10RobH) [17:25:52] 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10wiki_willy) test [17:26:16] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 78522104 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:26:57] (03PS4) 10Mholloway: WIP: Proton charts first draft [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos) [17:27:55] (03PS4) 10Mholloway: Add chart for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/570162 (https://phabricator.wikimedia.org/T218733) [17:28:08] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 29776 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:30:33] (03CR) 10Mholloway: "> Please add TLS termination, see cxserver or termbox as examples." [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos) [17:32:15] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/20663/" [puppet] - 10https://gerrit.wikimedia.org/r/570685 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [17:32:17] !log set performance cpu scaling governor on maps* [17:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:06] PROBLEM - Host es2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:44:10] PROBLEM - Host ganeti2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:44:29] uh? [17:44:56] Looks like only mgmt [17:45:04] PROBLEM - Host ps1-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:45:06] PROBLEM - Host db2077.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:06] PROBLEM - Host db2087.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:12] papaul: you working on mgmt stuff? ^ [17:45:20] PROBLEM - Host db2125.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:22] XioNoX: ^ [17:45:26] PROBLEM - Host elastic2031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:26] PROBLEM - Host ores2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:34] PROBLEM - Host db2112.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:44] PROBLEM - Host scs-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:46:18] PROBLEM - Host ganeti2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:46:54] (03CR) 10Ppchelko: "Per Joe adding it to conftool-data can be done at any point, so let's come back to PS2" [puppet] - 10https://gerrit.wikimedia.org/r/570094 (https://phabricator.wikimedia.org/T244178) (owner: 10Clarakosi) [17:46:54] PROBLEM - Host es2012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:47:20] PROBLEM - Host restbase2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:47:27] PROBLEM - Host mc2027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:47:27] PROBLEM - Host mc2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:47:38] PROBLEM - Host restbase2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:47:38] PROBLEM - Host restbase2022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:48:16] PROBLEM - Host pc2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:49:01] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2312.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [17:49:30] PROBLEM - Host cloudservices2002-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:07] 10Operations, 10Gerrit-Privilege-Requests, 10SRE-Access-Requests: Request for +2 access to mediawiki-config - https://phabricator.wikimedia.org/T244508 (10Eevans) [17:53:36] looks like rack C1 [17:54:00] yea.. all of those are in the same rack i think [17:54:11] papaul, wiki_willy ^ [18:00:04] cscott, arlolra, subbu, halfak, and accraze: That opportune time is upon us again. Time for a Services – Graphoid / Parsoid / Citoid / ORES deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200206T1800). [18:01:01] 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10elukey) These hosts need to be in 10G racks :) [18:03:52] RECOVERY - Host elastic2031.mgmt is UP: PING WARNING - Packet loss = 64%, RTA = 357.42 ms [18:03:52] RECOVERY - Host ores2005.mgmt is UP: PING WARNING - Packet loss = 64%, RTA = 357.34 ms [18:03:52] RECOVERY - Host db2112.mgmt is UP: PING WARNING - Packet loss = 37%, RTA = 37.05 ms [18:03:58] RECOVERY - Host ps1-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.60 ms [18:04:02] RECOVERY - Host scs-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.79 ms [18:04:04] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [18:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:34] RECOVERY - Host ganeti2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.00 ms [18:04:54] RECOVERY - Host db2077.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.78 ms [18:05:10] RECOVERY - Host es2012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.80 ms [18:05:30] RECOVERY - Host mc2027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms [18:05:38] RECOVERY - Host restbase2011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.45 ms [18:05:44] RECOVERY - Host mc2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.62 ms [18:05:54] RECOVERY - Host restbase2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.62 ms [18:05:54] RECOVERY - Host restbase2022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.30 ms [18:06:19] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:07] RECOVERY - Host cloudservices2002-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.92 ms [18:07:27] RECOVERY - Host pc2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.98 ms [18:07:40] RECOVERY - Host es2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.34 ms [18:07:40] RECOVERY - Host ganeti2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.00 ms [18:08:35] RECOVERY - Host db2087.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.00 ms [18:08:49] RECOVERY - Host db2125.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.56 ms [18:11:13] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2312.codfw.wmnet'] ` and were **ALL** successful. [18:19:31] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2313.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [18:28:49] (03PS1) 10Jforrester: [trwiki] Enable the WikidataPageBanner extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570691 (https://phabricator.wikimedia.org/T244369) [18:30:05] (03PS10) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) [18:30:58] (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [18:34:29] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [18:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:45] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:56] (03PS11) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) [18:39:20] (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [18:40:27] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2313.codfw.wmnet'] ` and were **ALL** successful. [18:41:40] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2314.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [18:41:46] (03PS12) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) [18:46:56] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@8e15868]: Update mobileapps to ceeb950 [18:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:00] 10Operations, 10ops-codfw: codfw: rack/setup/install wdqs200[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` wdqs2007.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020020... [18:50:05] 10Operations, 10ops-codfw: codfw: rack/setup/install wdqs200[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wdqs2007.codfw.wmnet'] ` Of which those **FAILED**: ` ['wdqs2007.codfw.wmnet'] ` [18:51:54] (03CR) 10Jhedden: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [18:52:15] 10Operations, 10ops-codfw: codfw: rack/setup/install wdqs200[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` wdqs2007.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020020... [18:52:18] 10Operations, 10ops-codfw: codfw: rack/setup/install wdqs200[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wdqs2007.codfw.wmnet'] ` Of which those **FAILED**: ` ['wdqs2007.codfw.wmnet'] ` [18:52:56] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: set max_active_keys for fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570507 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [18:53:23] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@8e15868]: Update mobileapps to ceeb950 (duration: 06m 27s) [18:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:29] 10Operations, 10ops-codfw, 10fundraising-tech-ops: codfw: rack/setup/install 3 new payments server for frack - https://phabricator.wikimedia.org/T244169 (10Papaul) [18:55:58] !log restarting apache on tungsten/dbmonitor to pick up cyrus-sasl security updates [18:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:41] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [18:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:39] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [18:58:56] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200206T1900). [19:00:04] Addshore: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:30] o/ [19:00:59] Pchelolo: you're up first :) [19:01:24] addshore: oh gosh sorry, I thought I removed it.. [19:01:26] !log restarting exim on mendelevium to pick up cyrus-sasl security updates [19:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:42] Pchelolo: no worries :D If you dont have anything I'll get started right away! [19:02:04] go for it. I've took my change off [19:02:10] (03CR) 10Addshore: [C: 03+2] Enable EntitySourceBasedFederation for group1 again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570642 (owner: 10Addshore) [19:03:10] (03Merged) 10jenkins-bot: Enable EntitySourceBasedFederation for group1 again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570642 (owner: 10Addshore) [19:03:40] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2314.codfw.wmnet'] ` and were **ALL** successful. [19:05:12] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2315.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [19:05:19] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation for group1 T243395 (duration: 01m 10s) [19:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:22] T243395: wmgUseEntitySourceBasedFederation true for Wikimedia clients (all sites) - https://phabricator.wikimedia.org/T243395 [19:06:43] (03CR) 10Jforrester: "Someone from the Web team should sign this off before deployment, in case there are any special requirements." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570691 (https://phabricator.wikimedia.org/T244369) (owner: 10Jforrester) [19:07:06] addshore: oh, can i add one instead? some no-op cleanup [19:07:11] yupp [19:07:26] (03CR) 10Addshore: [C: 03+2] wmgUseEntitySourceBasedFederation everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570584 (https://phabricator.wikimedia.org/T243395) (owner: 10Addshore) [19:07:39] We should sling out the Jdlrobson UBN one. [19:07:45] But I should review it first. [19:07:47] oooh, whtas that one? [19:07:55] wmf.18 i/r/ResourceLoaderSkinModule.php:264 PHP Notice: Undefined offset: 1 [19:08:07] aaah, thats all i see in logstash right now ;) [19:08:20] T244405 -> T244405 [19:08:21] T244405: PHP Notice: Undefined offset: 1 in ResourceLoaderSkinModule.php - https://phabricator.wikimedia.org/T244405 [19:08:22] Err. [19:08:24] (03Merged) 10jenkins-bot: wmgUseEntitySourceBasedFederation everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570584 (https://phabricator.wikimedia.org/T243395) (owner: 10Addshore) [19:08:28] T244405 -> https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/570666 [19:09:14] It looks good to me. addshore, want to deploy? [19:09:22] (When you have time. ;-)) [19:09:23] James_F: [19:09:27] ughh. [19:09:29] Yo. [19:09:33] that doesn't look like it fixes the issue? [19:10:06] It moves the creation of wgLogos['wordmark'] to /after/ it's copied into $wgLogosHD. [19:10:07] i was able to repro that bug locally without using $wgLogoHD [19:10:25] So the old code processing $wgLogosHD won't find a key it doesn't understand. [19:10:31] hmm, or maybe [19:10:33] Hmm. [19:10:37] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation everywhere T243395 (duration: 01m 07s) [19:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:43] i did not try having them both set. maybe that works [19:10:44] T243395: wmgUseEntitySourceBasedFederation true for Wikimedia clients (all sites) - https://phabricator.wikimedia.org/T243395 [19:10:59] we should try deploying that, at worst it's harmless [19:11:04] MatmaRex: Were you testing with master or with wmf.18? [19:11:17] Yeah. Current error log is grim: [19:11:27] master [19:11:34] https://www.irccloud.com/pastebin/B00H27uE/ [19:11:48] 233+12 = sad James. [19:12:34] also, is the whole syncing IS.php thing all sorted? or still something that might happen? [19:12:42] (sync but not actually get loaded) [19:12:43] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕑☕ sudo cumin A:netflow "disable-puppet 'rollout of I60692f0e8 T237587 cdanis'" [19:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:48] addshore: If you're unsure, sync it twice. [19:12:50] T237587: Determine & implement near-term method for escalating network alerts - https://phabricator.wikimedia.org/T237587 [19:12:52] I will then :D [19:13:29] addshore: Still occasionally happening. Roan was masterfully debugging the last time we noticed it in production, but we didn't determine the cause, just eliminated some potentials. [19:14:09] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation everywhere T243395, sync again for luck (duration: 01m 06s) [19:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:23] (03CR) 10CDanis: [C: 03+2] "> Patch Set 7: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587) (owner: 10CDanis) [19:14:24] cool, my config changes are done, just waiting for backports to merge [19:14:38] addshore: Did you touch IS before the second sync? [19:14:45] Not sure if that'd be necessary, but… [19:14:57] no (I didnt have to when I first encountered this issue) [19:15:03] Hmm, OK. [19:15:30] RECOVERY - DPKG on install1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [19:16:40] MatmaRex: shall i do yours? [19:16:51] (03PS2) 10Addshore: Fix incorrect spellings of "RESTBase" in config variables (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570409 (owner: 10Bartosz Dziewoński) [19:16:54] addshore: please do [19:17:01] (03PS2) 10Addshore: Fix incorrect spellings of "RESTBase" in config variables (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570410 (owner: 10Bartosz Dziewoński) [19:17:03] (03CR) 10Jforrester: [C: 03+1] Fix incorrect spellings of "RESTBase" in config variables (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570409 (owner: 10Bartosz Dziewoński) [19:17:08] addshore: have to sync them separately [19:17:11] ack [19:17:16] (03CR) 10Addshore: [C: 03+2] Fix incorrect spellings of "RESTBase" in config variables (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570409 (owner: 10Bartosz Dziewoński) [19:17:17] IS/CS/IS. [19:17:26] Otherwise prod will be sad. [19:17:38] yupp [19:17:43] (03CR) 10jerkins-bot: [V: 04-1] Fix incorrect spellings of "RESTBase" in config variables (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570409 (owner: 10Bartosz Dziewoński) [19:17:49] i dont enjoy making prod sad [19:18:08] (03CR) 10Addshore: [C: 03+2] Fix incorrect spellings of "RESTBase" in config variables (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570409 (owner: 10Bartosz Dziewoński) [19:18:20] jenkins had a little sneeze there [19:18:31] Theoretically scap will stop you making prod sad. [19:18:40] "remote: fatal: Not a git repository". sounds ominous [19:18:49] But that's like "theoretically, Wikipedia doesn't work". [19:19:01] (03Merged) 10jenkins-bot: Fix incorrect spellings of "RESTBase" in config variables (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570409 (owner: 10Bartosz Dziewoński) [19:20:16] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [19:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:40] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Fix incorrect spellings of "RESTBase" in config variables (1/2) (duration: 01m 06s) [19:20:41] (03CR) 10Addshore: [C: 03+2] Fix incorrect spellings of "RESTBase" in config variables (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570410 (owner: 10Bartosz Dziewoński) [19:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:35] (03Merged) 10jenkins-bot: Fix incorrect spellings of "RESTBase" in config variables (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570410 (owner: 10Bartosz Dziewoński) [19:22:32] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:31] !log manual puppet run on netflow1001 looked good; ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕑☕ sudo cumin A:netflow "run-puppet-agent --enable 'rollout of I60692f0e8 T237587 cdanis'" [19:23:32] !log addshore@deploy1001 Synchronized wmf-config/CommonSettings.php: Fix incorrect spellings of "RESTBase" in config variables (2/2) 1.CS (duration: 01m 07s) [19:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:33] T237587: Determine & implement near-term method for escalating network alerts - https://phabricator.wikimedia.org/T237587 [19:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:01] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Fix incorrect spellings of "RESTBase" in config variables (2/2) 2.IS (duration: 01m 06s) [19:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:06] MatmaRex: done [19:25:21] thanks addshore [19:27:16] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2315.codfw.wmnet'] ` and were **ALL** successful. [19:28:40] !log addshore@deploy1001 Synchronized php-1.35.0-wmf.18/extensions/Babel/includes/Babel.php: T243713 Timeout for meta api call from 10 to 2 seconds. (duration: 01m 07s) [19:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:43] T243713: Time-out error; Babel/WikibaseRepo being somehow uncached, overloading the API, and causing general outage - https://phabricator.wikimedia.org/T243713 [19:29:58] !log addshore@deploy1001 Synchronized php-1.35.0-wmf.16/extensions/Babel/includes/Babel.php: T243713 Timeout for meta api call from 10 to 2 seconds. (duration: 01m 07s) [19:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:42] !log depool cp1075 (eqiad text) for minor experimentation [19:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:58] 10Operations, 10ops-codfw: codfw: rack/setup/install wdqs200[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10Papaul) [19:32:17] !log addshore@deploy1001 Synchronized php-1.35.0-wmf.18/extensions/WikibaseLexemeCirrusSearch: T244479 Update namespace for PrefetchingTermLookup & fix tests (duration: 01m 06s) [19:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:19] T244479: Argument 5 passed to Wikibase\Lexeme\Search\Elastic\LexemeSearchEntity::__construct() must be an instance of Wikibase\Lib\Store\PrefetchingTermLookup, instance of Wikibase\DataAccess\ByTypeDispatchingPrefetchingTermLookup given, called in /srv/mediawiki/php-1.35.0-wmf.18/extensions/WikibaseLexemeCirrusSearch/WikibaseSearch.entitytypes.repo.php on line 41 - https://phabricator.wikimedia.org/T244479 [19:33:00] 10Operations, 10ops-codfw: codfw: rack/setup/install wdqs200[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10Papaul) a:05Papaul→03Gehel @Gehel All yours [19:33:46] papaul: thanks ! [19:33:56] !log SWAT done! [19:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:02] !log re-pool cp1075 (eqiad text) [19:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:38] oh it went live, yay [19:39:13] addshore: Were you not going to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/570666 ? :-) [19:39:19] jouncebot: next [19:39:19] In 0 hour(s) and 20 minute(s): Mediawiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200206T2000) [19:39:27] (I'll do it.) [19:39:34] (03PS2) 10Jforrester: Set wgLogoHD before adding wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570666 (https://phabricator.wikimedia.org/T244405) (owner: 10Jdlrobson) [19:39:42] (03CR) 10Jforrester: [C: 03+2] Set wgLogoHD before adding wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570666 (https://phabricator.wikimedia.org/T244405) (owner: 10Jdlrobson) [19:40:37] (03Merged) 10jenkins-bot: Set wgLogoHD before adding wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570666 (https://phabricator.wikimedia.org/T244405) (owner: 10Jdlrobson) [19:42:19] James_F: sorry, I didnt see the consensus about if it was right or not while i was deploying the other things! :) [19:42:35] No worries. :-) [19:42:47] Want to get it out before the train, to make the dashboard more readable. [19:43:03] yupp, very understandable [19:45:13] oh the logo fix... that would be nice, it's still flooding my logs last I looked [19:45:19] Yeah, sorry about that [19:45:24] eh things happen [19:45:28] I deployed the fix for the previous bug. [19:45:30] (03PS1) 10BryanDavis: Add "migrate" action for 2020 Kubernetes migration [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570702 (https://phabricator.wikimedia.org/T244293) [19:45:51] bit by bit [19:45:55] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T244405 Set wgLogoHD before adding wordmark (duration: 01m 06s) [19:45:57] Which replaced a frequent error with one that's just common. [19:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:59] T244405: PHP Notice: Undefined offset: 1 in ResourceLoaderSkinModule.php - https://phabricator.wikimedia.org/T244405 [19:46:36] heh [19:48:44] Fixed. Good. [19:49:09] 10Puppet, 10Release-Engineering-Team-TODO, 10User-brennen: logspam-watch: Some exceptions may be missing from logspam - https://phabricator.wikimedia.org/T244528 (10brennen) [19:49:14] (03PS2) 10BryanDavis: Add "migrate" action for 2020 Kubernetes migration [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570702 (https://phabricator.wikimedia.org/T244293) [19:50:21] 10Puppet, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10User-brennen: logspam-watch: Add interactive sorting / filtering - https://phabricator.wikimedia.org/T242882 (10brennen) [19:50:23] yay! [19:51:22] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10RobH) p:05Triage→03Medium [19:51:35] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10RobH) [19:52:05] 10Operations, 10hardware-requests: Expand Eqiad Ganeti row_A capacity - https://phabricator.wikimedia.org/T242885 (10RobH) 05Open→03Resolved memory ordered on T243442 and implementation tracking on T244530. resolving this task [19:52:27] (Prod clear) [19:54:41] 10Puppet, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO, 10User-brennen: logspam-watch: Some exceptions may be missing from logspam - https://phabricator.wikimedia.org/T244528 (10brennen) p:05Triage→03Low [19:56:31] 10Puppet, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services), 10User-brennen: logspam-watch: Some exceptions may be missing from logspam - https://phabricator.wikimedia.org/T244528 (10brennen) [19:56:56] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10Dzahn) As of today these hosts are in site.pp with spare::system role and not in production yet. So while standard Icinga alerts should be downtimed, no actual Ganeti service would be affected... [19:59:44] (03PS1) 10Jforrester: [cswikisource] Enable VisualEditor in the 'Edice' (102) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570706 (https://phabricator.wikimedia.org/T244133) [20:00:04] twentyafterfour and marxarelli: May I have your attention please! Mediawiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200206T2000) [20:00:58] (03CR) 10BryanDavis: "Completely untested at this point. As all the changes stayed within the webservice script iteself I should be able to test this out with s" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570702 (https://phabricator.wikimedia.org/T244293) (owner: 10BryanDavis) [20:04:13] (03PS2) 10Dzahn: site: define 2 codfw appservers as canary_appservers [puppet] - 10https://gerrit.wikimedia.org/r/570405 (https://phabricator.wikimedia.org/T242606) [20:08:57] twentyafterfour: o/ [20:10:12] (03CR) 10Dzahn: [C: 04-2] "ugh https://puppet-compiler.wmflabs.org/compiler1002/20670/mw2163.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/570405 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn) [20:12:16] (03CR) 10Dzahn: [C: 04-2] "so.. "canary API" vs "API" roles was no actual difference in puppet but "canary app" vs "app" are making a difference and a bunch of extra" [puppet] - 10https://gerrit.wikimedia.org/r/570405 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn) [20:13:35] marxarelli: hey, everything looks clear for the train [20:13:57] I'm about to deploy to all wikis shortly [20:14:43] right on. i'll keep an eye on errors [20:18:58] (03PS1) 1020after4: all wikis to 1.35.0-wmf.18 refs T233866 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570710 [20:19:00] (03CR) 1020after4: [C: 03+2] all wikis to 1.35.0-wmf.18 refs T233866 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570710 (owner: 1020after4) [20:20:17] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.18 refs T233866 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570710 (owner: 1020after4) [20:21:48] Whee. [20:24:02] (03PS1) 10EBernhardson: Give NS_HELP same weight as NS_MAIN in search on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570712 [20:24:53] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [20:24:57] uh oh [20:25:17] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [20:25:17] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [20:25:17] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:25:18] reverting to .16 [20:25:24] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.18 refs T233866 [20:25:25] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out be [20:25:25] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:27] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be [20:25:27] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:25:28] T233866: 1.35.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T233866 [20:25:31] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be [20:25:31] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:25:31] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain} [20:25:31] ections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was [20:25:31] n}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:25:32] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out be [20:25:32] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:25:33] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain} [20:25:33] ections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was [20:25:34] n}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:25:34] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: [20:25:35] most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{ [20:25:35] eatured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [20:25:36] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [20:25:36] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [20:25:37] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:25:37] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [20:25:38] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [20:25:38] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:25:39] PROBLEM - PHP7 rendering on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:25:39] PROBLEM - graphoid endpoints health on scb2004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:25:40] PROBLEM - PHP7 rendering on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:25:40] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out be [20:25:41] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:25:41] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [20:25:42] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [20:25:42] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:25:43] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [20:25:43] PROBLEM - Apache HTTP on mw1339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:25:44] PROBLEM - Apache HTTP on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:25:44] PROBLEM - PHP7 rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:25:45] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:25:45] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed ou [20:25:46] se was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:25:46] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200): /en.wikipedia.org/v1/pag [20:25:47] Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [20:25:47] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be [20:25:48] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:25:48] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most [20:25:49] January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/ra [20:25:49] eve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [20:25:50] PROBLEM - Nginx local proxy to apache on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:25:50] PROBLEM - PHP7 rendering on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:25:51] PROBLEM - Apache HTTP on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:25:51] PROBLEM - PHP7 rendering on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:25:54] (03CR) 10jerkins-bot: [V: 04-1] Give NS_HELP same weight as NS_MAIN in search on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570712 (owner: 10EBernhardson) [20:26:44] https://meta.wikimedia.org/w/api.php?action=sitematrix&format=json&smtype=language&smlangprop=dir%7Ccode%7Csite&smsiteprop=dbname&formatversion=2 seems to be broken [20:26:47] (03PS1) 1020after4: group2 wikis to 1.35.0-wmf.16 refs T233866 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570713 [20:26:49] (03CR) 1020after4: [C: 03+2] group2 wikis to 1.35.0-wmf.16 refs T233866 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570713 (owner: 1020after4) [20:27:30] (03CR) 10EBernhardson: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570712 (owner: 10EBernhardson) [20:27:38] Reedy: wmf.18 seems to have broken a lot of things, rolling back [20:27:41] that's a lot of pages [20:27:42] just got a pile of pages [20:27:42] revert? [20:27:44] sadface [20:27:50] TFW interrupted from writing an incident report by another page 🙃 [20:27:52] im here [20:27:52] timeouts [20:27:54] ack [20:28:11] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:28:12] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:28:13] jynus: yes reverting [20:28:14] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp3064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:28:16] let's revert? [20:28:20] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:28:24] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:28:26] (03CR) 10jerkins-bot: [V: 04-1] group2 wikis to 1.35.0-wmf.16 refs T233866 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570713 (owner: 1020after4) [20:28:26] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:28:26] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:28:31] mukunda already is. [13:26:49] (CR) 20after4: [C: +2] group2 wikis to 1.35.0-wmf.16 refs T233866 [mediawiki-config] - https://gerrit.wikimedia.org/r/570713 (owner: 20after4) [20:28:34] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:28:37] (03CR) 1020after4: [V: 03+2 C: 03+2] group2 wikis to 1.35.0-wmf.16 refs T233866 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570713 (owner: 1020after4) [20:28:38] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp3062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:28:38] PROBLEM - Apache HTTP on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:28:38] PROBLEM - Apache HTTP on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:28:38] PROBLEM - PHP7 rendering on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:28:38] PROBLEM - Apache HTTP on mw1243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:28:40] wow [20:28:41] PROBLEM - Apache HTTP on mw1255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:28:42] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:28:42] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:28:42] PROBLEM - Nginx local proxy to apache on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:28:42] PROBLEM - Apache HTTP on mw1326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:28:42] PROBLEM - PHP7 rendering on mw1324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:28:42] PROBLEM - Apache HTTP on mw1330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:28:42] PROBLEM - Apache HTTP on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:28:44] PROBLEM - PHP7 rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:28:52] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by ReadTimeoutError(HTTPConnectionPool(host=wikifeeds.svc.codfw.wmnet, port=8889): Read timed out. (read timeout=15),): /?spec https://wikitech.wikimedia.org/wiki/Wikifeeds [20:28:56] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:29:01] :( [20:29:14] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:29:14] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:29:14] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp3064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:29:22] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: [20:29:22] g/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:29:22] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [mostread] https://wikit [20:29:22] /wiki/Services/Monitoring/restbase [20:29:24] PROBLEM - graphoid endpoints health on scb2003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:29:25] <_joe_> grafana is not responding [20:29:25] #rip [20:29:41] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [20:29:44] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:29:47] no idea what went wrong with wmf.18 - I'm syncing the revert back to wmf.16 right now [20:29:50] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1344.eqiad.wmnet, mw1272.eqiad.wmnet, mw1320.eqiad.wmnet, mw1250.eqiad.wmnet, mw1266.eqiad.wmnet, mw1223.eqiad.wmnet, mw1282.eqiad.wmnet, mw1333.eqiad.wmnet, mw1241.eqiad.wmnet, mw1221.eqiad.wmnet, mw1317.eqiad.wmnet, mw1224.eqiad.wmnet, mw1316.eqiad.wmnet, mw1325.eqiad.wmnet, mw1312.eqiad.wmnet, mw1347.eqiad.wmnet, mw1342 [20:29:51] 270.eqiad.wmnet, mw1341.eqiad.wmnet, mw1332.eqiad.wmnet, mw1313.eqiad.wmnet, mw1346.eqiad.wmnet, mw1246.eqiad.wmnet, mw1322.eqiad.wmnet, mw1288.eqiad.wmnet, mw1281.eqiad.wmnet, mw1314.eqiad.wmnet, mw1323.eqiad.wmnet, mw1227.eqiad.wmnet, mw1233.eqiad.wmnet, mw1327.eqiad.wmnet, mw1245.eqiad.wmnet, mw1340.eqiad.wmnet, mw1258.eqiad.wmnet, mw1225.eqiad.wmnet, mw1264.eqiad.wmnet, mw1255.eqiad.wmnet, mw1257.eqiad.wmnet, mw1244.eqiad.wmn [20:29:51] wmnet, mw1234.eqiad.wmnet, mw1235.eqiad.wmnet, mw1231.eqiad.wmnet, mw1315.eqiad.wmnet, mw1 https://wikitech.wikimedia.org/wiki/PyBal [20:29:54] PROBLEM - LVS HTTP IPv4 on wikifeeds.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:30:07] !log twentyafterfour@deploy1001 Scap failed!: 9/11 canaries failed their endpoint checks(http://en.wikipedia.org) [20:30:08] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:30:10] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/media/math/check/{ty [20:30:10] eck test formula) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out befor [20:30:10] received: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response wa [20:30:10] rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [20:30:12] doh! [20:30:14] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:30:17] --force [20:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:22] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:30:22] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:30:22] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:30:23] <_joe_> twentyafterfour: use --force [20:30:27] <_joe_> sigh [20:30:34] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp3064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:30:40] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp3052 is OK: HTTP OK: HTTP/1.1 200 OK - 544 bytes in 9.958 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:30:42] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp3052 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 8.475 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:30:48] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp3054 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 9.298 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:30:49] !log sync-wikiversions --force [20:30:50] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:30:54] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp3062 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 8.225 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:30:56] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp3064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:30:56] PROBLEM - Nginx local proxy to apache on mw1263 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:30:56] PROBLEM - Nginx local proxy to apache on mw1267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:30:56] PROBLEM - Nginx local proxy to apache on mw1271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:30:56] PROBLEM - Nginx local proxy to apache on mw1275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:30:56] PROBLEM - PHP7 rendering on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:30:56] PROBLEM - Nginx local proxy to apache on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:30:57] PROBLEM - Apache HTTP on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:30:57] PROBLEM - Apache HTTP on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:58] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:30:58] PROBLEM - Nginx local proxy to apache on mw1331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:31:00] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:31:02] _joe_: yeah, should have used that the first time [20:31:04] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:31:06] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:31:10] we really need to make a rollback-wikiversions that always does --force [20:31:15] Yeah. [20:31:16] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:31:20] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3064 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 6.330 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:31:21] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 6.529 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:31:24] RECOVERY - PHP7 rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 8.895 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:31:24] RECOVERY - Apache HTTP on mw1270 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 4.740 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:24] RECOVERY - Nginx local proxy to apache on mw1346 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.836 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:26] RECOVERY - PHP7 rendering on mw1344 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 8.486 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:31:26] RECOVERY - PHP7 rendering on mw1341 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 7.447 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:31:26] RECOVERY - Apache HTTP on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.155 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:26] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 7.175 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:31:28] RECOVERY - Apache HTTP on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.783 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:28] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp3054 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 2.176 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:31:30] RECOVERY - Apache HTTP on mw1225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.504 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:30] RECOVERY - Apache HTTP on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.709 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:30] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp3064 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 1.634 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:31:31] RECOVERY - Apache HTTP on mw1340 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.161 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:31] RECOVERY - PHP7 rendering on mw1324 is OK: HTTP OK: HTTP/1.1 200 OK - 75267 bytes in 8.202 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:31:31] RECOVERY - Nginx local proxy to apache on mw1317 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.231 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:32] RECOVERY - PHP7 rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 8.812 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:31:32] RECOVERY - Apache HTTP on mw1255 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.994 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:34] RECOVERY - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15057 bytes in 9.499 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:31:34] RECOVERY - Apache HTTP on mw1326 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.571 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:34] RECOVERY - Apache HTTP on mw1339 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.943 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:34] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain} [20:31:35] ections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was [20:31:35] n}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:31:35] PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not [20:31:35] xistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [20:31:36] PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:31:38] RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15036 bytes in 9.619 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:31:38] RECOVERY - Apache HTTP on mw1247 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.289 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:38] RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:38] RECOVERY - PHP7 rendering on mw1345 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 8.529 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:31:38] RECOVERY - PHP7 rendering on mw1274 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 8.219 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:31:38] RECOVERY - Nginx local proxy to apache on mw1320 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.656 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:42] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 544 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:31:43] marxarelli: yeah a fast rollback command would be nice [20:31:44] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/availability (Retrieve feed content availability from \wikipedia.org\) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve featured article info for [20:31:44] (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received: /{domain}/v1/page/news (get In the News content) timed out before a response was received: /{domain}/v1/page/news (get In the News content for unsupported language (with aggregated=true)) timed out before a response was received: / (spec from root) timed out b [20:31:44] was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before [20:31:44] eceived: /{domain}/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [20:31:46] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp3064 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:31:46] RECOVERY - PHP7 rendering on mw1239 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 5.071 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:31:46] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 544 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:31:46] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp3056 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.240 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:31:48] RECOVERY - Nginx local proxy to apache on mw1249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.562 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:48] RECOVERY - Nginx local proxy to apache on mw1247 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.940 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:48] RECOVERY - PHP7 rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 7.723 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:31:48] RECOVERY - Nginx local proxy to apache on mw1251 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 7.911 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:49] that recovery sure was quick [20:31:50] RECOVERY - Apache HTTP on mw1297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.981 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:50] RECOVERY - Apache HTTP on mw1253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 4.593 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:52] RECOVERY - Apache HTTP on mw1242 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.622 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:54] RECOVERY - PHP7 rendering on mw1256 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 4.186 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:31:55] RECOVERY - Apache HTTP on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.451 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:55] RECOVERY - PHP7 rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 7.830 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:31:55] RECOVERY - PHP7 rendering on mw1339 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 8.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:31:55] RECOVERY - Apache HTTP on mw1320 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.110 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:55] RECOVERY - Apache HTTP on mw1327 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.240 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:55] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp3052 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:31:56] RECOVERY - PHP7 rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 8.717 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:31:56] RECOVERY - Nginx local proxy to apache on mw1268 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 7.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:56] RECOVERY - Apache HTTP on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.068 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:56] RECOVERY - Apache HTTP on mw1312 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.182 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:57] RECOVERY - PHP7 rendering on mw1258 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 6.816 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:31:58] RECOVERY - LVS HTTP IPv4 #page on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 14502 bytes in 9.632 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:31:58] RECOVERY - Nginx local proxy to apache on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 7.743 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:58] RECOVERY - PHP7 rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 7.086 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:31:59] that is _definitely_ a broken branch :-/ [20:31:59] RECOVERY - Apache HTTP on mw1248 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 3.477 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:31:59] RECOVERY - Apache HTTP on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.930 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:32:01] RECOVERY - Apache HTTP on mw1324 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 5.927 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:32:01] RECOVERY - Apache HTTP on mw1264 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.612 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:32:01] RECOVERY - Apache HTTP on mw1269 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.780 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:32:01] RECOVERY - PHP7 rendering on mw1320 is OK: HTTP OK: HTTP/1.1 200 OK - 75335 bytes in 7.137 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:32:02] RECOVERY - Nginx local proxy to apache on mw1228 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.278 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:32:02] RECOVERY - Apache HTTP on mw1245 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 6.346 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:32:03] RECOVERY - Nginx local proxy to apache on mw1232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.962 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:32:03] RECOVERY - Apache HTTP on mw1249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 4.293 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:32:04] RECOVERY - PHP7 rendering on mw1246 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 3.847 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:32:04] RECOVERY - PHP7 rendering on mw1240 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 3.896 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:32:05] RECOVERY - PHP7 rendering on mw1254 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 4.156 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:32:05] RECOVERY - Apache HTTP on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.219 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:32:06] RECOVERY - Nginx local proxy to apache on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.562 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:32:06] RECOVERY - PHP7 rendering on mw1252 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 5.916 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:32:07] RECOVERY - Nginx local proxy to apache on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 5.922 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:32:07] RECOVERY - PHP7 rendering on mw1329 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 6.361 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:32:08] RECOVERY - Apache HTTP on mw1251 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 5.416 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:32:08] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3062 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:32:09] RECOVERY - Nginx local proxy to apache on mw1342 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.890 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:32:09] RECOVERY - Apache HTTP on mw1231 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.872 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:32:10] RECOVERY - PHP7 rendering on mw1287 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 7.007 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:32:10] RECOVERY - Apache HTTP on mw1271 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.143 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:32:11] RECOVERY - Nginx local proxy to apache on mw1339 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 7.587 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:32:11] RECOVERY - Nginx local proxy to apache on mw1274 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.742 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:32:28] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: (no justification provided) [20:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:42] sorry everyone, the error logs were perfectly clean on group1 so I have no idea how that failed so spectacularly on group2 [20:33:16] Yeah. [20:33:44] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:33:44] RECOVERY - graphoid endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:33:45] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:33:45] RECOVERY - Nginx local proxy to apache on mw1286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:33:46] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:33:50] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:33:52] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3052 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:33:52] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp3056 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:33:52] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:33:54] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:33:58] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:33:58] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:33:58] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:33:58] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3056 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:34:02] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 71.93 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:34:04] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp3064 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:34:10] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3056 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:34:18] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 544 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:34:18] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:34:24] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp3064 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:34:24] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3056 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:34:26] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [20:34:26] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3064 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:34:28] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:34:30] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:34:32] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:34:36] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:34:38] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 544 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:34:44] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:34:46] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:34:53] (03PS2) 10EBernhardson: Give NS_HELP same weight as NS_MAIN in search on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570712 [20:35:00] RECOVERY - ATS TLS has reduced HTTP availability #page on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [20:35:00] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [20:35:04] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:35:18] RECOVERY - phpfpm_up reduced availability on icinga1001 is OK: (C)0.8 le (W)0.9 le 1 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:35:28] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:35:30] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [20:35:32] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:35:34] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:36:12] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)8 ge (W)1 ge 0.8875 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [20:37:06] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:37:50] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:38:42] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:42:50] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:44:52] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:43] !log restart restbase on restbase1027 [20:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:34] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2316.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [20:48:49] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [20:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:16] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikifeeds_8889: Servers kubernetes1001.eqiad.wmnet, kubernetes1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:49:22] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:49:32] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: / (spec from root) timed out before a response was received: /{domain}/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /{domain}/v1/feed/availability (Retrieve feed content availability from \wikipedia.org\) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: / [20:49:32] /image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received: /{domain}/v1/page/news (get In the News content) timed out before a response was received: /{domain}/v1/page/news (get In the News content for unsupported language (with aggregated=true)) timed out [20:49:32] e was received: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve featured article info for unsupported site (with aggregated=true)) timed out [20:49:32] was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve most-read articles for date with no data (with [20:49:32] ) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [20:51:06] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:52:00] !log restart all wikifeeds pods [20:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:20] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:52:56] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:52:56] RECOVERY - LVS HTTP IPv4 on wikifeeds.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 945 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:53:00] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [20:53:01] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:53:01] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:53:01] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:53:01] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:53:01] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:53:02] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:53:02] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:53:02] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:53:03] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:53:03] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:53:04] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:53:04] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:53:05] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:53:05] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:53:06] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:53:46] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:55:45] 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Release-Engineering-Team-TODO, and 2 others: Horizon hiera UI: investigate data type handling - https://phabricator.wikimedia.org/T243422 (10Andrew) I've confirmed that the behavior with the yaml-based UI is correct. For the guided interface, st... [20:59:16] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [21:03:23] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [21:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:40] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:23] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2316.codfw.wmnet'] ` and were **ALL** successful. [21:12:43] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2317.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [21:17:29] (03PS1) 10Alexandros Kosiaris: wikifeeds: Redefine CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/570726 [21:18:27] (03PS2) 10Alexandros Kosiaris: wikifeeds: Redefine CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/570726 (https://phabricator.wikimedia.org/T244535) [21:22:46] (03CR) 10Jforrester: wikifeeds: Redefine CPU limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/570726 (https://phabricator.wikimedia.org/T244535) (owner: 10Alexandros Kosiaris) [21:23:09] 10Operations, 10Scap, 10serviceops-radar, 10User-brennen, 10User-jijiki: Introduce state to Scap - https://phabricator.wikimedia.org/T209881 (10brennen) [21:27:46] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [21:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:03] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:56] Is the train moving tonight or blocked? Not seen anything on phab [21:34:45] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2317.codfw.wmnet'] ` and were **ALL** successful. [21:35:13] (03CR) 10Bstorm: [C: 03+1] "It definitely *looks* like it would work. It may even be useful for end users. Not merging to wait for test results :)" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570702 (https://phabricator.wikimedia.org/T244293) (owner: 10BryanDavis) [21:37:02] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2318.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [21:38:31] twentyafterfour: see my Q above [21:39:18] RhinosF1: blocked due to a massive outage after deploying wmf.18. The root cause is still under investigation [21:40:15] twentyafterfour: is there going to be a phab update on the task? Cause it’s looking like we’ll miss tonight’s window [21:40:33] !log train blocked due to serious incident related to deploying the latest branch. Incident documentation: https://wikitech.wikimedia.org/wiki/Incident_documentation/20200206-mediawiki refs T233866 [21:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:37] T233866: 1.35.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T233866 [21:42:21] Thanks twentyafterfour [21:43:10] (03PS4) 10Clarakosi: Add restbase202[123] to hiera [puppet] - 10https://gerrit.wikimedia.org/r/570094 (https://phabricator.wikimedia.org/T244178) [21:52:03] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [21:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:19] 10Operations, 10Traffic: ats-tls performance issues under production load - https://phabricator.wikimedia.org/T244538 (10Vgutierrez) [21:52:59] 10Operations, 10Traffic: ats-tls performance issues under production load - https://phabricator.wikimedia.org/T244538 (10Vgutierrez) p:05Triage→03High [21:54:17] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:00] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2318.codfw.wmnet'] ` and were **ALL** successful. [21:58:59] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2319.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [22:03:24] 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Release-Engineering-Team-TODO, and 2 others: Horizon hiera UI: investigate data type handling - https://phabricator.wikimedia.org/T243422 (10Andrew) This is happening because yaml.safe_dump() (and yaml.dump()) does some weird arbitrary quoting of... [22:08:37] (03PS1) 10Jforrester: Don't trying to assign $wgLogos to $wgLogoHD if it's unset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570729 (https://phabricator.wikimedia.org/T244405) [22:08:46] (03CR) 10Dzahn: [C: 03+1] Give NS_HELP same weight as NS_MAIN in search on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570712 (owner: 10EBernhardson) [22:10:07] (03CR) 10Dzahn: [C: 03+1] "the difference is normal since a) mediawiki-testers user group gets added b) scap sql scripts get removed intentionally" [puppet] - 10https://gerrit.wikimedia.org/r/570405 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn) [22:11:27] twentyafterfour: that’s window over so train resumes Monday assuming unblocked right? [22:12:01] (03CR) 10Dzahn: [C: 03+2] site: define 2 codfw appservers as canary_appservers [puppet] - 10https://gerrit.wikimedia.org/r/570405 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn) [22:12:18] RhinosF1: yes I believe you are correct, unless a miraculous discovery determines the root cause of the issues somewhat soon [22:12:57] twentyafterfour: cool. [22:13:40] !log turning mw2271 and mw2163 into canary appservers for codfw, this adds mediawiki-testers shell users and removes scap sql scripts, rest stays as is (T242606) [22:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:44] T242606: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 [22:15:04] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [22:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:50] (03CR) 10Dzahn: "things done by puppet:" [puppet] - 10https://gerrit.wikimedia.org/r/570405 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn) [22:18:55] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:30] 10Operations, 10serviceops, 10Patch-For-Review: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10Dzahn) [22:23:03] 10Operations, 10serviceops, 10Patch-For-Review: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10Dzahn) mw2163 and mw2271 have been turned into canary appservers now. As opposed to canary API appservers this means actual puppet changes which are: - mediawiki-testers shell access... [22:23:14] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2320.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [22:23:37] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2319.codfw.wmnet'] ` and were **ALL** successful. [22:24:02] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2321.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [22:24:05] 10Operations, 10serviceops, 10Patch-For-Review: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10Dzahn) @jijiki What do you think ? Is this good now? 4 of each type and in different rows/racks. [22:27:10] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10Dzahn) @Muehlenhoff Added them with private IPs, created VMs with the cookbook, then attempted OS install but on both of them it failed at the very end with GRUB install. I don't think this happen... [22:38:15] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [22:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:04] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [22:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:33] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:46] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:48] (03PS1) 10Bstorm: kubernetes: resource requests should be proportional to limits [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570734 (https://phabricator.wikimedia.org/T244289) [22:46:21] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2320.codfw.wmnet'] ` and were **ALL** successful. [22:47:10] I'm live on mwdebug1001. [22:47:28] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2321.codfw.wmnet'] ` and were **ALL** successful. [22:50:37] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.18/extensions/VisualEditor: T242184 Change tags method so anon edits will go through (duration: 01m 08s) [22:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:42] T242184: Create a change tag for edits made using DiscussionTools - https://phabricator.wikimedia.org/T242184 [22:51:37] (03CR) 10Jforrester: [C: 03+2] Don't trying to assign $wgLogos to $wgLogoHD if it's unset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570729 (https://phabricator.wikimedia.org/T244405) (owner: 10Jforrester) [22:52:52] (03Merged) 10jenkins-bot: Don't trying to assign $wgLogos to $wgLogoHD if it's unset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570729 (https://phabricator.wikimedia.org/T244405) (owner: 10Jforrester) [22:58:08] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2322.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [22:58:12] (03PS1) 10Dzahn: httpd: add x-request-id to apache httpd logs [puppet] - 10https://gerrit.wikimedia.org/r/570735 (https://phabricator.wikimedia.org/T244545) [22:58:25] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2323.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [22:58:30] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2323.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2323.codfw.wmnet'] ` [23:02:42] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2320.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [23:03:16] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2320.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2320.codfw.wmnet'] ` [23:04:38] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2323.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [23:05:17] (03PS3) 10Dzahn: switch webproxy CNAMEs to new install servers [dns] - 10https://gerrit.wikimedia.org/r/569680 (https://phabricator.wikimedia.org/T224576) [23:08:20] (03CR) 10Dzahn: switch apt.wikimedia.org from install1002 to install1003 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/569682 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [23:10:04] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T244405 Don't trying to assign to if it's unset (duration: 01m 07s) [23:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:08] T244405: PHP Notice: Undefined offset: 1 in ResourceLoaderSkinModule.php - https://phabricator.wikimedia.org/T244405 [23:10:26] (03CR) 10Dzahn: [C: 04-2] "needed instead from old servers to new separate VM for apt repo" [puppet] - 10https://gerrit.wikimedia.org/r/569691 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [23:11:13] (03PS1) 10BBlack: depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/570736 [23:11:39] * Reedy hands James_F a \ [23:13:11] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [23:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:23] Oh, oops, yeah. [23:13:28] * James_F coughs. [23:15:27] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:39] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10Jclark-ctr) [23:17:47] (03PS3) 10Dzahn: install_server: switch tftp server in DHCP to new install servers [puppet] - 10https://gerrit.wikimedia.org/r/569684 (https://phabricator.wikimedia.org/T224576) [23:18:09] (03CR) 10BryanDavis: [C: 03+1] "Yes, please :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570712 (owner: 10EBernhardson) [23:19:36] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [23:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:41] (03PS2) 10Jforrester: [nlwiki] Enable VisualEditor in the Project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570419 (https://phabricator.wikimedia.org/T159711) [23:19:53] (03CR) 10Jforrester: [C: 03+2] [nlwiki] Enable VisualEditor in the Project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570419 (https://phabricator.wikimedia.org/T159711) (owner: 10Jforrester) [23:20:10] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2322.codfw.wmnet'] ` and were **ALL** successful. [23:20:13] (03PS2) 10Jforrester: [cswikisource] Enable VisualEditor in the 'Edice' (102) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570706 (https://phabricator.wikimedia.org/T244133) [23:20:18] (03CR) 10Jforrester: [C: 03+2] [cswikisource] Enable VisualEditor in the 'Edice' (102) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570706 (https://phabricator.wikimedia.org/T244133) (owner: 10Jforrester) [23:20:33] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2324.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [23:20:57] (03Merged) 10jenkins-bot: [nlwiki] Enable VisualEditor in the Project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570419 (https://phabricator.wikimedia.org/T159711) (owner: 10Jforrester) [23:21:47] (03Merged) 10jenkins-bot: [cswikisource] Enable VisualEditor in the 'Edice' (102) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570706 (https://phabricator.wikimedia.org/T244133) (owner: 10Jforrester) [23:21:53] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [23:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:42] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T159711 T161365 T164435 [nlwiki] Enable VisualEditor in the Project namespace (duration: 01m 08s) [23:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:46] T164435: Enable visual editor for the Wikipedia namespace on nl.wikipedia.org - https://phabricator.wikimedia.org/T164435 [23:22:47] T161365: Enable VisualEditor by default for all users of the Dutch Wikipedia - https://phabricator.wikimedia.org/T161365 [23:22:47] T159711: Enable visual editor on Dutch Wikipedia in User namespace and possibly Wikipedia namespace - https://phabricator.wikimedia.org/T159711 [23:25:33] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T244133 [cswikisource] Enable VisualEditor in the Edice namespace (duration: 01m 07s) [23:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:36] T244133: Enable VIsualEditor for ns:102 on cs.wikisource - https://phabricator.wikimedia.org/T244133 [23:26:37] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 37 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:26:37] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2323.codfw.wmnet'] ` and were **ALL** successful. [23:27:35] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2325.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [23:31:57] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 32 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:35:37] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [23:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:56] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:41] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2324.codfw.wmnet'] ` and were **ALL** successful. [23:42:36] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [23:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:53] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:23] (03CR) 10BryanDavis: [C: 03+2] kubernetes: resource requests should be proportional to limits (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570734 (https://phabricator.wikimedia.org/T244289) (owner: 10Bstorm) [23:48:06] (03Merged) 10jenkins-bot: kubernetes: resource requests should be proportional to limits [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570734 (https://phabricator.wikimedia.org/T244289) (owner: 10Bstorm) [23:48:35] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2325.codfw.wmnet'] ` and were **ALL** successful. [23:49:22] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:53:57] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2326.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [23:54:20] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2327.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [23:54:34] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 35 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:56:44] PROBLEM - Host mw2327 is DOWN: PING CRITICAL - Packet loss = 100%