[00:00:05] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200206T0000).
[00:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[00:02:20] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] "I see this depends on the patch to enable forensic logging, which I haven't looked at yet -- but the changes here LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/570256 (owner: 10Giuseppe Lavagetto)
[00:04:47] <icinga-wm>	 PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100%
[00:07:30] <wikibugs>	 10Operations, 10SRE-tools: Homer: commit> no causes stacktrace - https://phabricator.wikimedia.org/T244362 (10Volans) a:03Volans
[00:09:54] <wikibugs>	 (03PS1) 10Volans: Handle commit abort separately [software/homer] - 10https://gerrit.wikimedia.org/r/570483 (https://phabricator.wikimedia.org/T244362)
[00:11:32] <wikibugs>	 10Operations, 10ops-eqiad, 10Dumps-Generation: (Need By Jan 25) rack/setup/install snapshot1010.eqiad.wmnet - https://phabricator.wikimedia.org/T241794 (10Jclark-ctr) Host is racked  rack B5   U25 . Switchport 14
[00:12:01] <wikibugs>	 10Operations, 10ops-eqiad, 10Dumps-Generation: (Need By Jan 25) rack/setup/install snapshot1010.eqiad.wmnet - https://phabricator.wikimedia.org/T241794 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson
[00:12:18] <wikibugs>	 10Operations, 10ops-eqiad, 10Dumps-Generation: (Need By Jan 25) rack/setup/install snapshot1010.eqiad.wmnet - https://phabricator.wikimedia.org/T241794 (10Jclark-ctr)
[00:27:16] <wikibugs>	 10Operations, 10SRE-tools: Homer: commit timeout on MX104 and SRXs - https://phabricator.wikimedia.org/T244363 (10Volans) a:03Volans
[00:29:04] <icinga-wm>	 RECOVERY - Host mr1-esams.oob is UP: PING OK - Packet loss = 0%, RTA = 84.10 ms
[00:35:50] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10Krinkle) Looks like we've recovered about 5-6%, but still consistantly regressed by 9-10% overall, e.g. 484ms...
[00:35:52] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Tested and works as expected." [software/homer] - 10https://gerrit.wikimedia.org/r/570483 (https://phabricator.wikimedia.org/T244362) (owner: 10Volans)
[00:37:34] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10Krinkle) Currently open patches:  >>! From T238494#5750475: > [operations/puppet@production] mediawiki::webse...
[01:00:04] <jouncebot>	 twentyafterfour: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Phabricator update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200206T0100).
[01:14:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "@gehel sure you don't want to use the opportunity to test buster on one of them while there is hardware that is not in production yet?" [puppet] - 10https://gerrit.wikimedia.org/r/570465 (https://phabricator.wikimedia.org/T242301) (owner: 10Papaul)
[01:23:34] <wikibugs>	 10Operations, 10ops-codfw: codfw: new mw servers not getting an IP when default to Stretch - https://phabricator.wikimedia.org/T244438 (10Papaul)
[01:23:49] <wikibugs>	 10Operations, 10ops-codfw: codfw: new mw servers not getting an IP when default to Stretch - https://phabricator.wikimedia.org/T244438 (10Papaul) p:05Triage→03Medium
[01:31:21] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "the "wdqs*" line would match first and echo. i think you need to move the special case before the wildcard case" [puppet] - 10https://gerrit.wikimedia.org/r/570465 (https://phabricator.wikimedia.org/T242301) (owner: 10Papaul)
[01:37:01] <wikibugs>	 (03PS2) 10Dzahn: add private IPs for new install servers [dns] - 10https://gerrit.wikimedia.org/r/570468 (https://phabricator.wikimedia.org/T224576)
[01:37:06] <wikibugs>	 (03PS2) 10Papaul: DHCP: Add wdqs200[7-8] to netboot.cfg and MAC address [puppet] - 10https://gerrit.wikimedia.org/r/570465 (https://phabricator.wikimedia.org/T242301)
[01:39:26] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] DHCP: Add wdqs200[7-8] to netboot.cfg and MAC address [puppet] - 10https://gerrit.wikimedia.org/r/570465 (https://phabricator.wikimedia.org/T242301) (owner: 10Papaul)
[01:53:36] <wikibugs>	 (03PS1) 10Volans: uwsgi: fix removal of init.d links on buster [puppet] - 10https://gerrit.wikimedia.org/r/570489
[01:54:23] <wikibugs>	 (03PS2) 10Volans: uwsgi: fix removal of init.d links on buster [puppet] - 10https://gerrit.wikimedia.org/r/570489
[01:56:09] <wikibugs>	 (03CR) 10Volans: "The current logged error (visible only in debug mode, thanks Faidon for finding it) is:" [puppet] - 10https://gerrit.wikimedia.org/r/570489 (owner: 10Volans)
[02:02:37] <wikibugs>	 (03PS3) 10Volans: uwsgi: fix removal of init.d links on buster [puppet] - 10https://gerrit.wikimedia.org/r/570489
[02:02:39] <wikibugs>	 (03PS1) 10Volans: uwsgi: update check for buster [puppet] - 10https://gerrit.wikimedia.org/r/570490
[02:04:06] <wikibugs>	 (03CR) 10Papaul: [C: 03+1] add private IPs for new install servers [dns] - 10https://gerrit.wikimedia.org/r/570468 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn)
[02:04:32] <wikibugs>	 (03CR) 10Volans: "compiler results here:" [puppet] - 10https://gerrit.wikimedia.org/r/570489 (owner: 10Volans)
[02:04:42] <wikibugs>	 (03PS3) 10Dzahn: add private IPs for new install servers [dns] - 10https://gerrit.wikimedia.org/r/570468 (https://phabricator.wikimedia.org/T224576)
[02:04:58] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add private IPs for new install servers [dns] - 10https://gerrit.wikimedia.org/r/570468 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn)
[02:13:18] <wikibugs>	 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10Dzahn)
[02:13:54] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops-radar: codfw: new mw servers not getting an IP when default to Stretch - https://phabricator.wikimedia.org/T244438 (10Dzahn)
[02:15:19] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Dzahn) 05Open→03Stalled currently blocked on T244438 , an installer issue on stretch that only happens on stretch and buster would not have a problem
[02:18:42] <wikibugs>	 10Operations, 10vm-requests: VM requests for install_server replacements - https://phabricator.wikimedia.org/T244390 (10Dzahn) https://gerrit.wikimedia.org/r/c/operations/dns/+/570468
[02:19:03] <wikibugs>	 10Operations, 10vm-requests: VM requests for install_server replacements - https://phabricator.wikimedia.org/T244390 (10Dzahn) reverted / removed public IPs, added private IPs
[02:20:57] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm
[02:20:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:21:36] <mutante>	 !log ganeti - Creating new VM named install1003.eqiad.wmnet in eqiad with row=C vcpu=1 memory=1 gigabytes disk=20 gigabytes link=private (T244390)
[02:21:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:21:38] <stashbot>	 T244390: VM requests for install_server replacements - https://phabricator.wikimedia.org/T244390
[02:30:24] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[02:30:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:39:00] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm
[02:39:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:42:57] <mutante>	 !log ganeti - Creating new VM named install2003.codfw.wmnet in codfw with row=A vcpu=1 memory=1 gigabytes disk=20 gigabytes link=private (T244390)
[02:42:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:43:00] <stashbot>	 T244390: VM requests for install_server replacements - https://phabricator.wikimedia.org/T244390
[02:44:55] <wikibugs>	 10Operations, 10vm-requests: VM requests for install_server replacements - https://phabricator.wikimedia.org/T244390 (10Dzahn)
[02:49:21] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[02:49:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:51:03] <wikibugs>	 (03PS2) 10Dzahn: DHCP: add install1003/install2003, using current install servers [puppet] - 10https://gerrit.wikimedia.org/r/569685 (https://phabricator.wikimedia.org/T224576)
[02:52:41] <wikibugs>	 (03PS1) 10Dzahn: DHCP: remove 'buster-installer' lines, now default and superfluous [puppet] - 10https://gerrit.wikimedia.org/r/570498
[02:54:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] DHCP: add install1003/install2003, using current install servers [puppet] - 10https://gerrit.wikimedia.org/r/569685 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn)
[02:54:22] <wikibugs>	 (03PS3) 10Dzahn: DHCP: add install1003/install2003, using current install servers [puppet] - 10https://gerrit.wikimedia.org/r/569685 (https://phabricator.wikimedia.org/T224576)
[02:56:39] <wikibugs>	 (03PS2) 10Dzahn: DHCP: remove 'buster-installer' lines, now default and superfluous [puppet] - 10https://gerrit.wikimedia.org/r/570498
[02:59:56] <wikibugs>	 (03PS3) 10Dzahn: DHCP: remove 'buster-installer' lines, now default and superfluous, linting [puppet] - 10https://gerrit.wikimedia.org/r/570498
[03:01:25] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] DHCP: remove 'buster-installer' lines, now default and superfluous, linting [puppet] - 10https://gerrit.wikimedia.org/r/570498 (owner: 10Dzahn)
[03:13:32] <wikibugs>	 (03PS1) 10Dzahn: install: fremove next-server for new install servers for OS install [puppet] - 10https://gerrit.wikimedia.org/r/570505 (https://phabricator.wikimedia.org/T224576)
[03:14:31] <wikibugs>	 (03PS2) 10Dzahn: install: remove next-server for new install servers for OS install [puppet] - 10https://gerrit.wikimedia.org/r/570505 (https://phabricator.wikimedia.org/T224576)
[03:16:11] <wikibugs>	 (03PS1) 10Andrew Bogott: Keystone: set max_active_keys for fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570507 (https://phabricator.wikimedia.org/T243418)
[03:16:23] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] install: remove next-server for new install servers for OS install [puppet] - 10https://gerrit.wikimedia.org/r/570505 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn)
[03:25:32] <wikibugs>	 (03PS1) 10CDanis: fastnetmon: connect via NRPE to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587)
[03:26:21] <wikibugs>	 (03PS1) 10Volans: junos: handle timeouts separately [software/homer] - 10https://gerrit.wikimedia.org/r/570510 (https://phabricator.wikimedia.org/T244363)
[03:28:45] <wikibugs>	 (03PS2) 10CDanis: fastnetmon: connect via NRPE to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587)
[03:31:34] <wikibugs>	 (03PS3) 10CDanis: fastnetmon: connect via NRPE to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587)
[03:35:14] <wikibugs>	 (03PS4) 10CDanis: fastnetmon: connect via NRPE to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587)
[03:39:48] <wikibugs>	 (03PS5) 10CDanis: fastnetmon: connect via NRPE to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587)
[03:40:08] <wikibugs>	 (03CR) 10Dzahn: fastnetmon: connect via NRPE to Icinga (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587) (owner: 10CDanis)
[03:40:10] <wikibugs>	 (03CR) 10CDanis: "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1002/20644/netflow1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587) (owner: 10CDanis)
[03:43:15] <wikibugs>	 (03CR) 10CDanis: fastnetmon: connect via NRPE to Icinga (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587) (owner: 10CDanis)
[03:45:43] <wikibugs>	 (03PS6) 10CDanis: fastnetmon: connect via NRPE to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587)
[03:46:19] <wikibugs>	 (03CR) 10Dzahn: fastnetmon: connect via NRPE to Icinga (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587) (owner: 10CDanis)
[03:51:36] <wikibugs>	 (03PS7) 10CDanis: fastnetmon: connect to Icinga via NRPE [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587)
[03:58:42] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw2311 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T241852
[04:32:40] <wikibugs>	 (03CR) 10Ppchelko: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570393 (https://phabricator.wikimedia.org/T243106) (owner: 10Ppchelko)
[04:35:18] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2020-02-05-051751-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/570515 (https://phabricator.wikimedia.org/T244230)
[05:59:25] <icinga-wm>	 PROBLEM - Memory correctable errors -EDAC- on mw1239 is CRITICAL: 4.001 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1239&var-datasource=eqiad+prometheus/ops
[06:28:57] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:28:57] <icinga-wm>	 PROBLEM - configured eth on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:28:59] <icinga-wm>	 PROBLEM - configured eth on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:28:59] <icinga-wm>	 PROBLEM - ores uWSGI web app on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores
[06:28:59] <icinga-wm>	 PROBLEM - dhclient process on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:28:59] <icinga-wm>	 PROBLEM - dhclient process on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:28:59] <icinga-wm>	 PROBLEM - puppet last run on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:29:00] <icinga-wm>	 PROBLEM - puppet last run on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:29:00] <icinga-wm>	 PROBLEM - puppet last run on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:29:01] <icinga-wm>	 PROBLEM - Check size of conntrack table on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:29:03] <icinga-wm>	 PROBLEM - dhclient process on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:29:04] <icinga-wm>	 PROBLEM - dhclient process on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:29:04] <icinga-wm>	 PROBLEM - dhclient process on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:29:05] <icinga-wm>	 PROBLEM - Check systemd state on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:29:05] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:29:05] <icinga-wm>	 PROBLEM - DPKG on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:29:05] <icinga-wm>	 PROBLEM - configured eth on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:29:07] <icinga-wm>	 PROBLEM - Disk space on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1002&var-datasource=eqiad+prometheus/ops
[06:29:07] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:29:07] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:29:07] <icinga-wm>	 PROBLEM - MD RAID on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:29:09] <icinga-wm>	 PROBLEM - configured eth on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:29:09] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP
[06:29:11] <icinga-wm>	 PROBLEM - dhclient process on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:29:11] <icinga-wm>	 PROBLEM - DPKG on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:29:11] <icinga-wm>	 PROBLEM - ores uWSGI web app on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores
[06:29:11] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:29:11] <icinga-wm>	 PROBLEM - DPKG on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:29:13] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ores1008 is CRITICAL: connect to address 10.64.48.27 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:29:13] <icinga-wm>	 PROBLEM - configured eth on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:29:13] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:29:15] <icinga-wm>	 PROBLEM - Check systemd state on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:29:15] <icinga-wm>	 PROBLEM - MD RAID on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:29:15] <icinga-wm>	 PROBLEM - Disk space on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1003&var-datasource=eqiad+prometheus/ops
[06:29:17] <icinga-wm>	 PROBLEM - ores uWSGI web app on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores
[06:29:17] <icinga-wm>	 PROBLEM - Disk space on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2006&var-datasource=codfw+prometheus/ops
[06:29:17] <icinga-wm>	 PROBLEM - dhclient process on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:29:19] <icinga-wm>	 PROBLEM - MD RAID on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:29:21] <icinga-wm>	 PROBLEM - MD RAID on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:29:21] <icinga-wm>	 PROBLEM - ores uWSGI web app on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores
[06:29:21] <icinga-wm>	 PROBLEM - Check systemd state on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:29:23] <icinga-wm>	 PROBLEM - configured eth on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:29:25] <icinga-wm>	 PROBLEM - Check size of conntrack table on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:29:25] <icinga-wm>	 PROBLEM - Disk space on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2002&var-datasource=codfw+prometheus/ops
[06:29:27] <icinga-wm>	 PROBLEM - dhclient process on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:29:29] <icinga-wm>	 PROBLEM - DPKG on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:29:29] <icinga-wm>	 PROBLEM - DPKG on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:29:29] <icinga-wm>	 PROBLEM - DPKG on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:29:29] <icinga-wm>	 PROBLEM - configured eth on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:29:31] <icinga-wm>	 PROBLEM - MD RAID on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:29:33] <icinga-wm>	 PROBLEM - Check size of conntrack table on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:29:33] <icinga-wm>	 PROBLEM - ores uWSGI web app on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores
[06:29:34] <icinga-wm>	 PROBLEM - Check size of conntrack table on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:29:37] <icinga-wm>	 PROBLEM - Check size of conntrack table on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:29:39] <icinga-wm>	 PROBLEM - Disk space on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1009&var-datasource=eqiad+prometheus/ops
[06:29:39] <icinga-wm>	 PROBLEM - Disk space on ores1008 is CRITICAL: connect to address 10.64.48.27 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1008&var-datasource=eqiad+prometheus/ops
[06:29:41] <icinga-wm>	 PROBLEM - configured eth on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:29:41] <icinga-wm>	 PROBLEM - puppet last run on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:29:41] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:29:41] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:29:43] <icinga-wm>	 PROBLEM - DPKG on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:29:43] <icinga-wm>	 PROBLEM - DPKG on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:29:44] <icinga-wm>	 PROBLEM - Check size of conntrack table on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:29:45] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:29:45] <icinga-wm>	 PROBLEM - ores uWSGI web app on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores
[06:29:47] <icinga-wm>	 PROBLEM - Disk space on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2009&var-datasource=codfw+prometheus/ops
[06:29:47] <icinga-wm>	 PROBLEM - DPKG on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:29:47] <icinga-wm>	 PROBLEM - Disk space on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1005&var-datasource=eqiad+prometheus/ops
[06:29:47] <icinga-wm>	 PROBLEM - DPKG on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:29:49] <icinga-wm>	 PROBLEM - Check size of conntrack table on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:29:49] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:29:49] <icinga-wm>	 PROBLEM - Check size of conntrack table on ores1008 is CRITICAL: connect to address 10.64.48.27 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:29:49] <icinga-wm>	 PROBLEM - Check size of conntrack table on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:29:49] <icinga-wm>	 PROBLEM - Check systemd state on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:29:50] <icinga-wm>	 PROBLEM - Check systemd state on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:29:51] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:29:51] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:29:53] <icinga-wm>	 PROBLEM - configured eth on ores1008 is CRITICAL: connect to address 10.64.48.27 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:29:53] <icinga-wm>	 PROBLEM - ores uWSGI web app on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores
[06:29:53] <icinga-wm>	 PROBLEM - MD RAID on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:29:54] <icinga-wm>	 PROBLEM - Check size of conntrack table on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:29:54] <icinga-wm>	 PROBLEM - ores uWSGI web app on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores
[06:29:55] <icinga-wm>	 PROBLEM - dhclient process on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:29:57] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP
[06:29:57] <icinga-wm>	 PROBLEM - Check size of conntrack table on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:29:59] <icinga-wm>	 PROBLEM - configured eth on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:29:59] <icinga-wm>	 PROBLEM - DPKG on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:29:59] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:30:01] <icinga-wm>	 PROBLEM - configured eth on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:30:01] <icinga-wm>	 PROBLEM - Disk space on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2003&var-datasource=codfw+prometheus/ops
[06:30:01] <icinga-wm>	 PROBLEM - DPKG on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:30:03] <icinga-wm>	 PROBLEM - ores uWSGI web app on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores
[06:30:03] <icinga-wm>	 PROBLEM - Check systemd state on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:30:03] <icinga-wm>	 PROBLEM - dhclient process on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:30:04] <icinga-wm>	 PROBLEM - MD RAID on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:30:05] <icinga-wm>	 PROBLEM - puppet last run on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:30:05] <icinga-wm>	 PROBLEM - dhclient process on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:30:07] <icinga-wm>	 PROBLEM - ores uWSGI web app on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores
[06:30:09] <icinga-wm>	 PROBLEM - DPKG on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:30:09] <icinga-wm>	 PROBLEM - MD RAID on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:30:09] <icinga-wm>	 PROBLEM - Disk space on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2008&var-datasource=codfw+prometheus/ops
[06:30:09] <icinga-wm>	 PROBLEM - MD RAID on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:30:11] <icinga-wm>	 PROBLEM - Check size of conntrack table on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:30:13] <icinga-wm>	 PROBLEM - configured eth on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:30:13] <icinga-wm>	 PROBLEM - Check size of conntrack table on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:30:15] <icinga-wm>	 PROBLEM - ores uWSGI web app on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores
[06:30:15] <icinga-wm>	 PROBLEM - Disk space on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1006&var-datasource=eqiad+prometheus/ops
[06:30:17] <icinga-wm>	 PROBLEM - DPKG on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:30:17] <icinga-wm>	 PROBLEM - MD RAID on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:30:17] <icinga-wm>	 PROBLEM - MD RAID on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:30:18] <icinga-wm>	 PROBLEM - dhclient process on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:30:21] <icinga-wm>	 PROBLEM - Check size of conntrack table on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:30:21] <icinga-wm>	 PROBLEM - Check systemd state on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:30:21] <icinga-wm>	 PROBLEM - configured eth on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:30:23] <icinga-wm>	 PROBLEM - configured eth on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:30:23] <icinga-wm>	 PROBLEM - MD RAID on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:30:24] <icinga-wm>	 PROBLEM - ores uWSGI web app on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores
[06:30:25] <icinga-wm>	 PROBLEM - Check size of conntrack table on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:30:27] <icinga-wm>	 PROBLEM - Check systemd state on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:30:27] <icinga-wm>	 PROBLEM - Check systemd state on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:30:29] <icinga-wm>	 PROBLEM - Disk space on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1004&var-datasource=eqiad+prometheus/ops
[06:30:29] <icinga-wm>	 PROBLEM - Disk space on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1001&var-datasource=eqiad+prometheus/ops
[06:30:29] <icinga-wm>	 PROBLEM - Disk space on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2004&var-datasource=codfw+prometheus/ops
[06:30:31] <icinga-wm>	 PROBLEM - dhclient process on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:30:31] <icinga-wm>	 PROBLEM - DPKG on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:30:35] <icinga-wm>	 PROBLEM - puppet last run on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:30:35] <icinga-wm>	 PROBLEM - Check size of conntrack table on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:30:35] <icinga-wm>	 PROBLEM - Check systemd state on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:30:35] <icinga-wm>	 PROBLEM - Disk space on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2005&var-datasource=codfw+prometheus/ops
[06:30:37] <icinga-wm>	 PROBLEM - Check systemd state on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:30:37] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:30:39] <icinga-wm>	 PROBLEM - MD RAID on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:30:39] <icinga-wm>	 PROBLEM - Check systemd state on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:30:41] <icinga-wm>	 PROBLEM - ores uWSGI web app on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores
[06:30:43] <icinga-wm>	 PROBLEM - configured eth on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:30:43] <icinga-wm>	 PROBLEM - MD RAID on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:30:45] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP
[06:30:47] <icinga-wm>	 PROBLEM - configured eth on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:30:47] <icinga-wm>	 PROBLEM - Disk space on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2001&var-datasource=codfw+prometheus/ops
[06:30:47] <icinga-wm>	 PROBLEM - puppet last run on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:30:47] <icinga-wm>	 PROBLEM - Check systemd state on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:30:49] <icinga-wm>	 PROBLEM - ores uWSGI web app on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores
[06:30:49] <icinga-wm>	 PROBLEM - Check systemd state on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:30:51] <icinga-wm>	 PROBLEM - dhclient process on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:30:51] <icinga-wm>	 PROBLEM - Check systemd state on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:30:55] <icinga-wm>	 PROBLEM - dhclient process on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:30:55] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:30:59] <icinga-wm>	 RECOVERY - Check size of conntrack table on ores2006 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:30:59] <icinga-wm>	 PROBLEM - puppet last run on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:31:03] <icinga-wm>	 RECOVERY - dhclient process on ores2006 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:31:05] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ores2006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:31:09] <icinga-wm>	 RECOVERY - DPKG on ores2006 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:31:09] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ores1008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:31:11] <icinga-wm>	 RECOVERY - configured eth on ores2006 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:31:11] <icinga-wm>	 RECOVERY - Check systemd state on ores2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:31:13] <icinga-wm>	 RECOVERY - Disk space on ores2006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2006&var-datasource=codfw+prometheus/ops
[06:31:27] <icinga-wm>	 PROBLEM - puppet last run on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:31:31] <icinga-wm>	 PROBLEM - Check systemd state on ores1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:31:35] <icinga-wm>	 RECOVERY - Disk space on ores1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1008&var-datasource=eqiad+prometheus/ops
[06:31:43] <icinga-wm>	 RECOVERY - Check size of conntrack table on ores1008 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:31:47] <icinga-wm>	 RECOVERY - configured eth on ores1008 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:32:07] <icinga-wm>	 PROBLEM - puppet last run on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:32:31] <icinga-wm>	 PROBLEM - puppet last run on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:32:39] <icinga-wm>	 RECOVERY - configured eth on ores2002 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:33:13] <icinga-wm>	 RECOVERY - Disk space on ores2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2002&var-datasource=codfw+prometheus/ops
[06:33:31] <icinga-wm>	 RECOVERY - DPKG on ores2002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:33:37] <icinga-wm>	 RECOVERY - Check systemd state on ores2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:33:43] <icinga-wm>	 PROBLEM - puppet last run on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:33:43] <icinga-wm>	 PROBLEM - puppet last run on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:33:47] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ores2002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:33:51] <icinga-wm>	 RECOVERY - dhclient process on ores2008 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:33:57] <icinga-wm>	 RECOVERY - Disk space on ores2008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2008&var-datasource=codfw+prometheus/ops
[06:33:57] <icinga-wm>	 RECOVERY - MD RAID on ores2002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:34:05] <icinga-wm>	 RECOVERY - dhclient process on ores2002 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:34:07] <icinga-wm>	 RECOVERY - Check systemd state on ores2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:34:09] <icinga-wm>	 RECOVERY - configured eth on ores2008 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:34:11] <icinga-wm>	 RECOVERY - Check size of conntrack table on ores2002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:34:15] <icinga-wm>	 RECOVERY - Disk space on ores1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1004&var-datasource=eqiad+prometheus/ops
[06:34:27] <icinga-wm>	 RECOVERY - MD RAID on ores2008 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:34:29] <icinga-wm>	 RECOVERY - configured eth on ores1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:34:43] <icinga-wm>	 PROBLEM - puppet last run on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:34:49] <icinga-wm>	 RECOVERY - configured eth on ores2009 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:34:49] <icinga-wm>	 PROBLEM - puppet last run on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:34:51] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ores2008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:34:54] <icinga-wm>	 RECOVERY - DPKG on ores2008 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:34:59] <icinga-wm>	 RECOVERY - MD RAID on ores1004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:35:01] <icinga-wm>	 RECOVERY - dhclient process on ores2009 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:35:11] <icinga-wm>	 RECOVERY - dhclient process on ores1004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:35:11] <icinga-wm>	 RECOVERY - DPKG on ores1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:35:17] <icinga-wm>	 RECOVERY - Check size of conntrack table on ores2008 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:35:23] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ores1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:35:27] <icinga-wm>	 RECOVERY - Check size of conntrack table on ores2009 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:35:29] <icinga-wm>	 RECOVERY - Disk space on ores2009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2009&var-datasource=codfw+prometheus/ops
[06:35:29] <icinga-wm>	 RECOVERY - DPKG on ores2009 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:35:37] <icinga-wm>	 RECOVERY - Check size of conntrack table on ores1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:35:37] <icinga-wm>	 RECOVERY - puppet last run on ores2006 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:35:47] <icinga-wm>	 RECOVERY - Check systemd state on ores1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:35:47] <icinga-wm>	 RECOVERY - MD RAID on ores2009 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:35:59] <icinga-wm>	 RECOVERY - puppet last run on ores1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:36:07] <icinga-wm>	 RECOVERY - Check systemd state on ores2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:36:27] <wikibugs>	 (03PS1) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418)
[06:36:35] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ores2009 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:36:53] <icinga-wm>	 RECOVERY - puppet last run on ores2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:37:19] <icinga-wm>	 RECOVERY - puppet last run on ores2008 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:38:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott)
[06:38:59] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP
[06:39:11] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ores1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:39:19] <icinga-wm>	 RECOVERY - MD RAID on ores2003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:39:27] <icinga-wm>	 RECOVERY - Disk space on ores2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2003&var-datasource=codfw+prometheus/ops
[06:39:35] <icinga-wm>	 RECOVERY - puppet last run on ores2009 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:39:39] <icinga-wm>	 RECOVERY - Check size of conntrack table on ores2003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:39:41] <icinga-wm>	 RECOVERY - DPKG on ores1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:39:45] <icinga-wm>	 RECOVERY - Check size of conntrack table on ores1002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:39:47] <icinga-wm>	 RECOVERY - MD RAID on ores1002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:39:51] <icinga-wm>	 RECOVERY - Check systemd state on ores2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:39:59] <icinga-wm>	 RECOVERY - Check systemd state on ores2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:39:59] <icinga-wm>	 RECOVERY - Disk space on ores2005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2005&var-datasource=codfw+prometheus/ops
[06:40:07] <icinga-wm>	 RECOVERY - MD RAID on ores2005 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:40:13] <icinga-wm>	 RECOVERY - Check systemd state on ores1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:40:17] <icinga-wm>	 RECOVERY - dhclient process on ores2003 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:40:19] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ores2003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:40:19] <icinga-wm>	 RECOVERY - dhclient process on ores1002 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:40:19] <icinga-wm>	 RECOVERY - configured eth on ores2003 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:40:27] <icinga-wm>	 RECOVERY - Disk space on ores1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1002&var-datasource=eqiad+prometheus/ops
[06:40:31] <icinga-wm>	 RECOVERY - configured eth on ores2005 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:40:37] <icinga-wm>	 RECOVERY - puppet last run on ores2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:40:43] <icinga-wm>	 RECOVERY - configured eth on ores1002 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:41:07] <icinga-wm>	 RECOVERY - DPKG on ores2003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:41:09] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ores2005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:41:13] <icinga-wm>	 RECOVERY - dhclient process on ores2005 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:41:17] <icinga-wm>	 RECOVERY - DPKG on ores2005 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:41:29] <icinga-wm>	 RECOVERY - Check size of conntrack table on ores2005 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:41:47] <icinga-wm>	 RECOVERY - dhclient process on ores2001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:42:01] <icinga-wm>	 RECOVERY - Disk space on ores2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2001&var-datasource=codfw+prometheus/ops
[06:42:05] <icinga-wm>	 RECOVERY - dhclient process on ores1001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:42:21] <icinga-wm>	 RECOVERY - puppet last run on ores1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:42:21] <icinga-wm>	 RECOVERY - MD RAID on ores2001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:42:23] <icinga-wm>	 PROBLEM - ores_workers_running on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/ORES
[06:42:23] <icinga-wm>	 PROBLEM - ores_workers_running on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/ORES
[06:42:25] <icinga-wm>	 PROBLEM - ores_workers_running on ores2001 is CRITICAL: PROCS CRITICAL: 39 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[06:42:41] <icinga-wm>	 RECOVERY - DPKG on ores1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:42:44] <wikibugs>	 (03PS2) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418)
[06:42:45] <icinga-wm>	 RECOVERY - Check size of conntrack table on ores2001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:42:53] <icinga-wm>	 RECOVERY - configured eth on ores2001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:42:53] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ores1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:42:59] <icinga-wm>	 RECOVERY - Check size of conntrack table on ores1006 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:43:01] <icinga-wm>	 RECOVERY - Check systemd state on ores2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:43:03] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ores2001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:43:09] <icinga-wm>	 RECOVERY - Check size of conntrack table on ores1001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:43:11] <icinga-wm>	 RECOVERY - configured eth on ores1001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:43:13] <icinga-wm>	 RECOVERY - DPKG on ores2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:43:17] <icinga-wm>	 PROBLEM - ores_workers_running on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/ORES
[06:43:17] <icinga-wm>	 PROBLEM - ores_workers_running on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/ORES
[06:43:19] <icinga-wm>	 PROBLEM - ores_workers_running on ores1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[06:43:23] <icinga-wm>	 RECOVERY - configured eth on ores1006 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:43:27] <icinga-wm>	 RECOVERY - Disk space on ores1006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1006&var-datasource=eqiad+prometheus/ops
[06:43:27] <icinga-wm>	 RECOVERY - MD RAID on ores1001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:43:27] <icinga-wm>	 RECOVERY - MD RAID on ores1006 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:43:39] <icinga-wm>	 RECOVERY - Disk space on ores1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1001&var-datasource=eqiad+prometheus/ops
[06:43:41] <icinga-wm>	 PROBLEM - ores_workers_running on ores1001 is CRITICAL: PROCS CRITICAL: 75 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[06:43:47] <icinga-wm>	 RECOVERY - Check systemd state on ores1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:43:49] <icinga-wm>	 RECOVERY - Check systemd state on ores1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:43:51] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP
[06:44:01] <icinga-wm>	 PROBLEM - ores_workers_running on ores1006 is CRITICAL: PROCS CRITICAL: 61 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[06:44:11] <icinga-wm>	 RECOVERY - dhclient process on ores1006 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:44:11] <icinga-wm>	 RECOVERY - DPKG on ores1006 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:44:15] <icinga-wm>	 RECOVERY - puppet last run on ores1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:44:19] <icinga-wm>	 RECOVERY - ores_workers_running on ores2001 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[06:44:19] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ores1006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:44:41] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10serviceops: Allow list admins to train spam filters - https://phabricator.wikimedia.org/T244241 (10Aklapper) Oh darrn! Thanks Reedy! I never realized that this is a custom patch in GNOME's Mailman instance, sorry!  Feel free to decline if this is too much maintenanc...
[06:45:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott)
[06:45:27] <icinga-wm>	 RECOVERY - puppet last run on ores2005 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:45:34] <icinga-wm>	 RECOVERY - ores_workers_running on ores1001 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[06:45:55] <icinga-wm>	 RECOVERY - ores_workers_running on ores1006 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[06:46:15] <elukey>	 !log run puppet on all ores[12]* nodes
[06:46:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:46:34] <icinga-wm>	 RECOVERY - puppet last run on ores2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:46:37] <icinga-wm>	 RECOVERY - puppet last run on ores1006 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:46:37] <icinga-wm>	 RECOVERY - puppet last run on ores1005 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:46:41] <icinga-wm>	 RECOVERY - DPKG on ores1005 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:46:45] <icinga-wm>	 RECOVERY - Disk space on ores1005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1005&var-datasource=eqiad+prometheus/ops
[06:46:45] <icinga-wm>	 RECOVERY - Check size of conntrack table on ores1005 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:46:45] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ores1005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:47:04] <icinga-wm>	 RECOVERY - dhclient process on ores1009 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:47:04] <icinga-wm>	 RECOVERY - DPKG on ores1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:47:04] <icinga-wm>	 RECOVERY - MD RAID on ores1005 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:47:17] <icinga-wm>	 RECOVERY - configured eth on ores1005 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:47:29] <icinga-wm>	 RECOVERY - DPKG on ores1009 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:47:31] <icinga-wm>	 RECOVERY - Check size of conntrack table on ores1003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:47:33] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ores1009 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:47:41] <icinga-wm>	 RECOVERY - Check systemd state on ores1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:47:45] <icinga-wm>	 RECOVERY - Check systemd state on ores1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:47:49] <icinga-wm>	 RECOVERY - configured eth on ores1003 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:47:51] <icinga-wm>	 RECOVERY - dhclient process on ores1005 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:47:55] <icinga-wm>	 RECOVERY - dhclient process on ores1003 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:48:03] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ores1003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:48:07] <icinga-wm>	 RECOVERY - Disk space on ores1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1003&var-datasource=eqiad+prometheus/ops
[06:48:11] <icinga-wm>	 RECOVERY - MD RAID on ores1009 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:48:11] <icinga-wm>	 RECOVERY - Check systemd state on ores1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:48:15] <icinga-wm>	 RECOVERY - Check size of conntrack table on ores1009 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:48:21] <icinga-wm>	 RECOVERY - configured eth on ores1009 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:48:23] <icinga-wm>	 RECOVERY - MD RAID on ores1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:48:24] <icinga-wm>	 RECOVERY - puppet last run on ores1009 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:48:27] <icinga-wm>	 RECOVERY - Check systemd state on ores1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:48:31] <icinga-wm>	 RECOVERY - Disk space on ores1009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1009&var-datasource=eqiad+prometheus/ops
[06:48:51] <icinga-wm>	 RECOVERY - configured eth on ores2004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:48:57] <icinga-wm>	 RECOVERY - ores_workers_running on ores1003 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[06:48:57] <icinga-wm>	 RECOVERY - ores_workers_running on ores1009 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[06:48:57] <icinga-wm>	 RECOVERY - ores_workers_running on ores1007 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[06:49:19] <icinga-wm>	 RECOVERY - Disk space on ores2004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2004&var-datasource=codfw+prometheus/ops
[06:49:45] <icinga-wm>	 RECOVERY - puppet last run on ores2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:49:48] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1098: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/570523
[06:49:51] <icinga-wm>	 RECOVERY - Check systemd state on ores2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:49:51] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ores2004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:49:55] <icinga-wm>	 RECOVERY - ores_workers_running on ores1005 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[06:49:57] <icinga-wm>	 RECOVERY - dhclient process on ores2004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:50:03] <icinga-wm>	 RECOVERY - MD RAID on ores2004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:50:13] <icinga-wm>	 RECOVERY - DPKG on ores2004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:50:21] <icinga-wm>	 RECOVERY - Check size of conntrack table on ores2004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:51:45] <wikibugs>	 (03CR) 10Andrew Bogott: "Despite jenkins objections about cross-module includes, I don't feel great about moving these rsync bits into a profile.  Overriding the -" [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott)
[06:51:47] <icinga-wm>	 RECOVERY - ores_workers_running on ores2004 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[06:51:55] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1098: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/570523 (owner: 10Marostegui)
[06:52:29] <icinga-wm>	 RECOVERY - puppet last run on ores1003 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:52:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1098:3317 - T239453', diff saved to https://phabricator.wikimedia.org/P10318 and previous config saved to /var/cache/conftool/dbconfig/20200206-065238-marostegui.json
[06:52:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:52:42] <stashbot>	 T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453
[06:59:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1101:3317 - T239453', diff saved to https://phabricator.wikimedia.org/P10319 and previous config saved to /var/cache/conftool/dbconfig/20200206-065906-marostegui.json
[06:59:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:10] <stashbot>	 T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453
[07:00:01] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on ores2002 is OK: OK: synced at Thu 2020-02-06 06:59:59 UTC. https://wikitech.wikimedia.org/wiki/NTP
[07:00:23] <wikibugs>	 (03PS1) 10Marostegui: db1101: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570525 (https://phabricator.wikimedia.org/T239453)
[07:00:47] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on ores1003 is OK: OK: synced at Thu 2020-02-06 07:00:46 UTC. https://wikitech.wikimedia.org/wiki/NTP
[07:01:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1101: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570525 (https://phabricator.wikimedia.org/T239453) (owner: 10Marostegui)
[07:01:37] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on ores2009 is OK: OK: synced at Thu 2020-02-06 07:01:34 UTC. https://wikitech.wikimedia.org/wiki/NTP
[07:09:51] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on ores2003 is OK: OK: synced at Thu 2020-02-06 07:09:49 UTC. https://wikitech.wikimedia.org/wiki/NTP
[07:12:57] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:14:43] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on ores2004 is OK: OK: synced at Thu 2020-02-06 07:14:42 UTC. https://wikitech.wikimedia.org/wiki/NTP
[07:22:08] <icinga-wm>	 PROBLEM - Check systemd state on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:22:10] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:22:28] <icinga-wm>	 PROBLEM - MD RAID on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[07:22:56] <icinga-wm>	 PROBLEM - Check size of conntrack table on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[07:23:02] <icinga-wm>	 PROBLEM - puppet last run on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[07:23:06] <icinga-wm>	 PROBLEM - configured eth on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[07:31:12] <icinga-wm>	 PROBLEM - ores_workers_running on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/ORES
[07:33:23] <wikibugs>	 (03PS1) 10Matthias Mullie: Revert "Remove handler deleted from the MachineVision extension on beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570571
[07:39:42] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:41:30] <wikibugs>	 (03PS1) 10Elukey: presto: enable kerberos and TLS for Analytics Presto [puppet] - 10https://gerrit.wikimedia.org/r/570573
[07:41:46] <icinga-wm>	 RECOVERY - MD RAID on ores2001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[07:42:28] <icinga-wm>	 RECOVERY - Check size of conntrack table on ores2001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[07:42:40] <icinga-wm>	 RECOVERY - configured eth on ores2001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[07:42:52] <icinga-wm>	 RECOVERY - Check systemd state on ores2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:42:56] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ores2001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:43:30] <icinga-wm>	 RECOVERY - ores_workers_running on ores2001 is OK: PROCS OK: 89 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[07:44:27] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] presto: enable kerberos and TLS for Analytics Presto [puppet] - 10https://gerrit.wikimedia.org/r/570573 (owner: 10Elukey)
[07:44:54] <icinga-wm>	 RECOVERY - puppet last run on ores2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[07:47:47] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize dbproxy1012 [puppet] - 10https://gerrit.wikimedia.org/r/570574 (https://phabricator.wikimedia.org/T202367)
[07:59:46] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Productionize dbproxy1012 [puppet] - 10https://gerrit.wikimedia.org/r/570574 (https://phabricator.wikimedia.org/T202367)
[08:03:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize dbproxy1012 [puppet] - 10https://gerrit.wikimedia.org/r/570574 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui)
[08:05:25] <wikibugs>	 (03PS1) 10Elukey: presto: set correct discovery URI for Analytics Presto [puppet] - 10https://gerrit.wikimedia.org/r/570577
[08:09:30] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] presto: set correct discovery URI for Analytics Presto [puppet] - 10https://gerrit.wikimedia.org/r/570577 (owner: 10Elukey)
[08:10:13] <icinga-wm>	 PROBLEM - MediaWiki centralauth errors on graphite1004 is CRITICAL: CRITICAL: 93.33% of data above the critical threshold [1.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1
[08:17:56] <akosiaris>	 !log switchover selectively eventgate-analytics.discovery.wmnet to codfw for mw1331 and mw1348 to 
[08:17:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:02] <dcausse>	 !log restarting blazegraph on wdqs1006: T242453
[08:18:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:04] <stashbot>	 T242453: wdqs1005 stopped to handle updates - https://phabricator.wikimedia.org/T242453
[08:23:03] <marostegui>	 !log Reboot dbproxy1012 and dbproxy1014 for upgrade
[08:23:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:29] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] uwsgi: update check for buster [puppet] - 10https://gerrit.wikimedia.org/r/570490 (owner: 10Volans)
[08:23:53] <wikibugs>	 (03PS1) 10Elukey: profile::presto::server: use kerberos auth between worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/570579
[08:24:25] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: uwsgi: update check for buster [puppet] - 10https://gerrit.wikimedia.org/r/570490 (owner: 10Volans)
[08:24:28] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] uwsgi: update check for buster [puppet] - 10https://gerrit.wikimedia.org/r/570490 (owner: 10Volans)
[08:26:47] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::presto::server: use kerberos auth between worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/570579 (owner: 10Elukey)
[08:27:53] <icinga-wm>	 RECOVERY - MediaWiki centralauth errors on graphite1004 is OK: OK: Less than 30.00% above the threshold [0.5] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1
[08:30:35] <addshore>	 jouncebot: next
[08:30:35] <jouncebot>	 In 3 hour(s) and 29 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200206T1200)
[08:30:56] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Failover dbproxy1001 to dbproxy1014 [dns] - 10https://gerrit.wikimedia.org/r/570580 (https://phabricator.wikimedia.org/T202367)
[08:32:15] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1012: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570581 (https://phabricator.wikimedia.org/T202367)
[08:32:31] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Double tested on ores2001 as well (both with files existing and without). LGTM, merging" [puppet] - 10https://gerrit.wikimedia.org/r/570489 (owner: 10Volans)
[08:32:40] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: uwsgi: fix removal of init.d links on buster [puppet] - 10https://gerrit.wikimedia.org/r/570489 (owner: 10Volans)
[08:33:47] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy1012: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570581 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui)
[08:38:08] <wikibugs>	 (03PS2) 10Addshore: Enable EntitySourceBasedFederation for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566795 (https://phabricator.wikimedia.org/T243395) (owner: 10Ladsgroup)
[08:39:02] <dcausse>	 going to deploy wdqs if there are no objections
[08:39:10] <wikibugs>	 (03PS1) 10Addshore: Enable EntitySourceBasedFederation for group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570583 (https://phabricator.wikimedia.org/T243395)
[08:41:21] <wikibugs>	 (03PS1) 10Addshore: wmgUseEntitySourceBasedFederation everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570584 (https://phabricator.wikimedia.org/T243395)
[08:44:28] <wikibugs>	 (03CR) 10Muehlenhoff: "The patch looks fine, but the more fundamental fix would be what's outlined in T222874. Yuvi's original patch is from 2016 where we still " [puppet] - 10https://gerrit.wikimedia.org/r/570489 (owner: 10Volans)
[08:44:33] <logmsgbot>	 !log dcausse@deploy1001 Started deploy [wdqs/wdqs@4306c64]: deploying wdqs 0.3.14-SNAPSHOT and gui 5a1af3b to wdqs1010.eqiad.wmnet
[08:44:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:03] <logmsgbot>	 !log dcausse@deploy1001 Finished deploy [wdqs/wdqs@4306c64]: deploying wdqs 0.3.14-SNAPSHOT and gui 5a1af3b to wdqs1010.eqiad.wmnet (duration: 00m 29s)
[08:45:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:46] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team (Current): Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10akosiaris) I 've tried to reproduce this. It's easily reproducible after all. Just do what logrotate does and issue `systemctl reload uwsgi-ores`. CPU usage spikes and reache...
[08:49:49] <icinga-wm>	 PROBLEM - Check systemd state on ores2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:56:20] <logmsgbot>	 !log dcausse@deploy1001 Started deploy [wdqs/wdqs@4306c64]: deploying wdqs 0.3.14-SNAPSHOT and gui 5a1af3b
[08:56:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:06] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] site: add new ganeti hosts for refresh/expansion with spare role [puppet] - 10https://gerrit.wikimedia.org/r/570390 (https://phabricator.wikimedia.org/T228924) (owner: 10Dzahn)
[08:57:13] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: site: add new ganeti hosts for refresh/expansion with spare role [puppet] - 10https://gerrit.wikimedia.org/r/570390 (https://phabricator.wikimedia.org/T228924) (owner: 10Dzahn)
[08:57:25] <icinga-wm>	 PROBLEM - ores_workers_running on ores2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[08:58:35] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.01117 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[09:03:21] <wikibugs>	 (03PS1) 10Vgutierrez: install_server: Reimage cp3065 as buster [puppet] - 10https://gerrit.wikimedia.org/r/570585 (https://phabricator.wikimedia.org/T242093)
[09:03:55] <wikibugs>	 (03PS1) 10Elukey: profile::presto::server: use FQDN when workers communicate with the coord [puppet] - 10https://gerrit.wikimedia.org/r/570586
[09:04:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::presto::server: use FQDN when workers communicate with the coord [puppet] - 10https://gerrit.wikimedia.org/r/570586 (owner: 10Elukey)
[09:06:37] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I'm on the verge here." [puppet] - 10https://gerrit.wikimedia.org/r/570330 (https://phabricator.wikimedia.org/T244222) (owner: 10Jbond)
[09:07:08] <wikibugs>	 (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/570586 (owner: 10Elukey)
[09:07:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] "After our chat on IRC...proceeding" [dns] - 10https://gerrit.wikimedia.org/r/570580 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui)
[09:08:01] <logmsgbot>	 !log dcausse@deploy1001 Finished deploy [wdqs/wdqs@4306c64]: deploying wdqs 0.3.14-SNAPSHOT and gui 5a1af3b (duration: 11m 41s)
[09:08:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:34] <wikibugs>	 (03CR) 10Ema: [C: 03+1] install_server: Reimage cp3065 as buster [puppet] - 10https://gerrit.wikimedia.org/r/570585 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez)
[09:08:50] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] install_server: Reimage cp3065 as buster [puppet] - 10https://gerrit.wikimedia.org/r/570585 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez)
[09:09:33] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::presto::server: use FQDN when workers communicate with the coord [puppet] - 10https://gerrit.wikimedia.org/r/570586 (owner: 10Elukey)
[09:10:13] <elukey>	 vgutierrez: hola, can I merge yours too?
[09:10:20] <vgutierrez>	 elukey: go ahead <3
[09:10:43] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] Release 8.0.5-1wm14 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/566814 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez)
[09:14:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch mw2311 to stretch bootif tftpboot environment [puppet] - 10https://gerrit.wikimedia.org/r/570587 (https://phabricator.wikimedia.org/T244438)
[09:16:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch mw2311 to stretch bootif tftpboot environment [puppet] - 10https://gerrit.wikimedia.org/r/570587 (https://phabricator.wikimedia.org/T244438) (owner: 10Muehlenhoff)
[09:23:33] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1001: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570588 (https://phabricator.wikimedia.org/T244463)
[09:24:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy1001: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570588 (https://phabricator.wikimedia.org/T244463) (owner: 10Marostegui)
[09:26:24] <wikibugs>	 (03PS1) 10Jcrespo: Revert "uwsgi: fix removal of init.d links on buster" [puppet] - 10https://gerrit.wikimedia.org/r/570589
[09:27:00] <addshore>	 jouncebot: next
[09:27:00] <jouncebot>	 In 2 hour(s) and 32 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200206T1200)
[09:27:33] <wikibugs>	 (03PS2) 10Jcrespo: Revert "uwsgi: fix removal of init.d links on buster" [puppet] - 10https://gerrit.wikimedia.org/r/570589
[09:33:08] <wikibugs>	 10Operations, 10Traffic: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 (10Vgutierrez)
[09:33:32] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: uwsgi: fix removal of init.d script (revisited) [puppet] - 10https://gerrit.wikimedia.org/r/570590
[09:33:55] <wikibugs>	 (03PS3) 10Filippo Giunchedi: elasticsearch: cirrus logs to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/570374 (https://phabricator.wikimedia.org/T225125)
[09:35:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I'm not sure whether 'type' in appender properties is case sensitive or not (syslog vs Gelf)." [puppet] - 10https://gerrit.wikimedia.org/r/570374 (https://phabricator.wikimedia.org/T225125) (owner: 10Filippo Giunchedi)
[09:36:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/570590 (owner: 10Giuseppe Lavagetto)
[09:37:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] uwsgi: fix removal of init.d script (revisited) [puppet] - 10https://gerrit.wikimedia.org/r/570590 (owner: 10Giuseppe Lavagetto)
[09:37:55] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] uwsgi: fix removal of init.d script (revisited) [puppet] - 10https://gerrit.wikimedia.org/r/570590 (owner: 10Giuseppe Lavagetto)
[09:38:28] <wikibugs>	 (03Abandoned) 10Jcrespo: Revert "uwsgi: fix removal of init.d links on buster" [puppet] - 10https://gerrit.wikimedia.org/r/570589 (owner: 10Jcrespo)
[09:38:46] <wikibugs>	 (03PS3) 10Muehlenhoff: Pass down MAC address of to installing system via BOOTIF on Buster [puppet] - 10https://gerrit.wikimedia.org/r/564729 (https://phabricator.wikimedia.org/T242481)
[09:39:22] <wikibugs>	 (03PS4) 10Muehlenhoff: Pass down MAC address of the installing system via BOOTIF on Buster [puppet] - 10https://gerrit.wikimedia.org/r/564729 (https://phabricator.wikimedia.org/T242481)
[09:41:15] <icinga-wm>	 RECOVERY - Check systemd state on ores2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:43:29] <_joe_>	 uhm
[09:43:51] <wikibugs>	 (03PS1) 10Vgutierrez: ATS: Allow configuring via hiera KA against origin servers [puppet] - 10https://gerrit.wikimedia.org/r/570594 (https://phabricator.wikimedia.org/T244464)
[09:43:51] <icinga-wm>	 RECOVERY - ores_workers_running on ores2001 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[09:44:01] <_joe_>	 so I can confirm the new puppet run works and it removed the init.d link on acmechief1001
[09:44:10] <_joe_>	 vgutierrez: you might want to verify you wanted that :P
[09:44:29] <vgutierrez>	 _joe_: wut? what?
[09:44:46] <_joe_>	 don't freak out
[09:44:55] <_joe_>	 that's a part of our uwsgi puppet code
[09:45:00] <_joe_>	 that was broken on buster
[09:45:09] <_joe_>	 now it works
[09:45:15] <vgutierrez>	 hmm uwsgi on buster has been working fine for acmecheif instances
[09:45:17] <_joe_>	 so we remove the init.d script for uwsgi
[09:45:24] <wikibugs>	 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Lea_Lacroix_WMDE) Over the past weeks, we noticed a huge increase of content in Wikidata. Maybe that's...
[09:46:14] <vgutierrez>	 ack, I'll check for possible side effects, thanks for pingin
[09:46:17] <vgutierrez>	 *pinging
[09:47:25] <vgutierrez>	 but it shouldn't be a big issue considering this
[09:47:27] <vgutierrez>	 Loaded: loaded (/lib/systemd/system/uwsgi-acme-chief.service; enabled; vendor preset: enabled)
[09:49:07] <wikibugs>	 (03PS2) 10Vgutierrez: ATS: Allow configuring via hiera KA against origin servers [puppet] - 10https://gerrit.wikimedia.org/r/570594 (https://phabricator.wikimedia.org/T244464)
[09:49:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch elastic* to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/570596 (https://phabricator.wikimedia.org/T156955)
[09:51:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ATS: Allow configuring via hiera KA against origin servers [puppet] - 10https://gerrit.wikimedia.org/r/570594 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez)
[09:52:06] <wikibugs>	 (03PS3) 10Vgutierrez: ATS: Allow configuring via hiera KA against origin servers [puppet] - 10https://gerrit.wikimedia.org/r/570594 (https://phabricator.wikimedia.org/T244464)
[09:52:10] <moritzm>	 vgutierrez: this is typically only an issue during upgrades of the uwsgi package (and I don't think we've had one on buster since acmechief was setup)
[09:52:28] <vgutierrez>	 ack
[09:52:30] <wikibugs>	 (03PS4) 10Vgutierrez: ATS: Allow configuring via hiera KA against origin servers [puppet] - 10https://gerrit.wikimedia.org/r/570594 (https://phabricator.wikimedia.org/T244464)
[09:52:49] <wikibugs>	 (03CR) 10Gehel: "A few questions:" [puppet] - 10https://gerrit.wikimedia.org/r/570374 (https://phabricator.wikimedia.org/T225125) (owner: 10Filippo Giunchedi)
[09:53:30] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/570596 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff)
[09:54:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ATS: Allow configuring via hiera KA against origin servers [puppet] - 10https://gerrit.wikimedia.org/r/570594 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez)
[09:54:41] <vgutierrez>	 sigh.. what's wrong with jerkins
[09:56:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "IIRC there were some side effects of using LVM with readahead that Erik discovered (CC'ing) but other than that LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/570596 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff)
[09:56:04] <vgutierrez>	 (yet another L8 issue)
[09:56:06] <wikibugs>	 (03PS5) 10Vgutierrez: ATS: Allow configuring via hiera KA against origin servers [puppet] - 10https://gerrit.wikimedia.org/r/570594 (https://phabricator.wikimedia.org/T244464)
[09:57:45] <wikibugs>	 (03CR) 10Vgutierrez: "pcc shows a NOOP on records.config: https://puppet-compiler.wmflabs.org/compiler1003/20651/" [puppet] - 10https://gerrit.wikimedia.org/r/570594 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez)
[09:58:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/570374 (https://phabricator.wikimedia.org/T225125) (owner: 10Filippo Giunchedi)
[09:58:59] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 (10Vgutierrez) p:05Triage→03Medium
[09:59:16] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster) - https://phabricator.wikimedia.org/T243226 (10Krenair) Last night I tried moving cache-text05 to use the new puppet master. It doesn't seem to work just yet as for some reason it's (the puppetm...
[09:59:46] <vgutierrez>	 !log upload trafficserver 8.0.5-1wm14 to apt.wm.o (buster) - T242093
[09:59:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:49] <stashbot>	 T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093
[09:59:52] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "We do have an explicit udev rule for readahead which should still work with this update: https://github.com/wikimedia/puppet/blob/producti" [puppet] - 10https://gerrit.wikimedia.org/r/570596 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff)
[10:00:05] <wikibugs>	 (03CR) 10Ema: [C: 03+1] ATS: Allow configuring via hiera KA against origin servers [puppet] - 10https://gerrit.wikimedia.org/r/570594 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez)
[10:00:11] <vgutierrez>	 !log depool and reimage cp3065 as buster - T242093
[10:00:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:33] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] ATS: Allow configuring via hiera KA against origin servers [puppet] - 10https://gerrit.wikimedia.org/r/570594 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez)
[10:06:05] <icinga-wm>	 RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.004885 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[10:10:28] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp3065.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reima...
[10:11:38] <wikibugs>	 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10ArielGlenn) >>! In T243701#5855352, @Lea_Lacroix_WMDE wrote: > Over the past weeks, we noticed a huge i...
[10:12:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/570596 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff)
[10:13:09] <wikibugs>	 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10JeanFred)
[10:14:02] <wikibugs>	 (03PS1) 10Vgutierrez: ATS: Enable KA between ats-tls and varnish-fe on cp4031 [puppet] - 10https://gerrit.wikimedia.org/r/570599 (https://phabricator.wikimedia.org/T244464)
[10:15:42] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch logstash hosts to standard Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/570600 (https://phabricator.wikimedia.org/T156955)
[10:16:08] <wikibugs>	 (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/20652/" [puppet] - 10https://gerrit.wikimedia.org/r/570599 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez)
[10:17:28] <wikibugs>	 (03CR) 10Ema: [C: 03+1] ATS: Enable KA between ats-tls and varnish-fe on cp4031 [puppet] - 10https://gerrit.wikimedia.org/r/570599 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez)
[10:18:13] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] ATS: Enable KA between ats-tls and varnish-fe on cp4031 [puppet] - 10https://gerrit.wikimedia.org/r/570599 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez)
[10:19:21] <vgutierrez>	 !log Enabling HTTP keepalive between ats-tls and varnish-frontend on cp4031 - T244464
[10:19:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:24] <stashbot>	 T244464: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464
[10:20:43] <akosiaris>	 !log undo "switchover selectively eventgate-analytics.discovery.wmnet to codfw for mw1331 and mw1348"
[10:20:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:08] <akosiaris>	 !log undo "switchover selectively eventgate-analytics.discovery.wmnet to codfw for mw1331 and mw1348". no effect observed
[10:21:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:18] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] standard: Add linux-perf to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/570254 (owner: 10Alexandros Kosiaris)
[10:25:29] <wikibugs>	 (03PS4) 10Jbond: realm global: make the realm variable a global in labs [puppet] - 10https://gerrit.wikimedia.org/r/570331 (https://phabricator.wikimedia.org/T244222)
[10:27:28] <wikibugs>	 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10JeanFred)  >>>! In T243701#5834751, @jcrespo wrote: >> While I understand the need of "slowing bot edit...
[10:28:56] <wikibugs>	 (03PS10) 10Jbond: wmflib::require_domains: add new function to to replace require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570343
[10:32:06] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: site.pp: Add new ganeti codfw hosts as role::spare [puppet] - 10https://gerrit.wikimedia.org/r/570601 (https://phabricator.wikimedia.org/T224603)
[10:34:41] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] dynamicproxy: urlproxy: introduce support for domain-based routing [puppet] - 10https://gerrit.wikimedia.org/r/565556 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez)
[10:35:52] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "thanks! way more elegant solution that my initial approach, which involved updating the regex." [puppet] - 10https://gerrit.wikimedia.org/r/570433 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis)
[10:41:08] <wikibugs>	 (03PS1) 10Vgutierrez: install_server: Fix cescout syntax error [puppet] - 10https://gerrit.wikimedia.org/r/570602
[10:41:19] <vgutierrez>	 it looks like we need a linter for netboot.cfg :)
[10:42:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570602 (owner: 10Vgutierrez)
[10:43:15] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] install_server: Fix cescout syntax error [puppet] - 10https://gerrit.wikimedia.org/r/570602 (owner: 10Vgutierrez)
[10:44:18] <wikibugs>	 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3065.esams.wmnet'] `  Of which those **FAILED**: ` ['cp3065.esams.wmnet'] `
[10:45:49] <wikibugs>	 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp3065.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002061045_vgutie...
[10:47:31] <_joe_>	 vgutierrez: if confd fails, please wait before rerunning puppet
[10:47:55] <vgutierrez>	 on cp3065?
[10:47:59] <vgutierrez>	 ack
[10:48:03] <wikibugs>	 (03PS5) 10Jbond: puppet_compiler: add rich_data support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/557050
[10:48:15] <vgutierrez>	 I'll let you know
[10:48:24] <vgutierrez>	 but yesterday we only had issues on a text node, cp3065 is upload
[10:55:04] <_joe_>	 that shouldn't change
[11:06:08] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10hnowlan) Unfortunately it seems I don't have permissions to issue commands. I attempted to downtime a service on a host that's n...
[11:11:54] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[11:11:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:04] <kart_>	 Updating cxserver..
[11:12:10] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10MoritzMuehlenhoff) @hnowlan There's an error in the username configured in https://gerrit.wikimedia.org/r/566823, let me fix tha...
[11:13:18] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2020-02-05-051751-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/570515 (https://phabricator.wikimedia.org/T244230) (owner: 10KartikMistry)
[11:13:38] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2020-02-05-051751-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/570515 (https://phabricator.wikimedia.org/T244230) (owner: 10KartikMistry)
[11:14:10] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[11:14:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:54] <kart_>	 akosiaris: Oh, I see some changes pending on cxserver, safe to upgrade?
[11:18:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix username in Icinga authorization config for Hugh [puppet] - 10https://gerrit.wikimedia.org/r/570611 (https://phabricator.wikimedia.org/T242309)
[11:18:23] <kart_>	 akosiaris: charts 0.0.9->0.0.11
[11:18:48] <wikibugs>	 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3065.esams.wmnet'] `  Of which those **FAILED**: ` ['cp3065.esams.wmnet'] `
[11:22:50] <akosiaris>	 kart_: lemme have a look
[11:23:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix username in Icinga authorization config for Hugh [puppet] - 10https://gerrit.wikimedia.org/r/570611 (https://phabricator.wikimedia.org/T242309) (owner: 10Muehlenhoff)
[11:23:48] <_joe_>	 kart_: that might have been me, sorry, taking a look as well
[11:24:01] <_joe_>	 I think I added the ability to terminate TLS
[11:24:22] <_joe_>	 but I didn't deploy TLS to production still so I didn't do the deploy
[11:25:21] <_joe_>	 lemme check
[11:25:23] <wikibugs>	 (03PS1) 10Ema: tlsproxy::localssl: allow setting keepalive_requests [puppet] - 10https://gerrit.wikimedia.org/r/570612 (https://phabricator.wikimedia.org/T241145)
[11:25:25] <wikibugs>	 (03PS1) 10Ema: profile::mediawiki::webserver: increase nginx keepalive_requests [puppet] - 10https://gerrit.wikimedia.org/r/570613 (https://phabricator.wikimedia.org/T241145)
[11:25:52] <_joe_>	 uhm no it wasn't me :)
[11:25:57] <kart_>	 _joe_: Sure. I'm not sure how-to handle it, akosiaris should be able to find way here :)
[11:26:16] <kart_>	 _joe_: looks like you did that in December, and seems unrelated by log.
[11:26:19] <wikibugs>	 10Operations, 10Traffic, 10Inuka-Team (Kanban), 10MW-1.35-notes (1.35.0-wmf.16; 2020-01-21), 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10phuedx) >>! In T238029#5819088, @SBisson wrote: > A very special thanks to @phuedx for reviewing a...
[11:26:45] <akosiaris>	 0.0.10 -> 0.0.11 is 4f179cf333486e05d930d7da98c7dbe945c28d6f and is just a removal of comments
[11:27:25] <akosiaris>	 and 0.0.9 -> 0.0.10 is dae344baf57a995f4239ed4b11f3e4ba4628c3ed and is just minor changes to NOTES.txt
[11:27:36] <akosiaris>	 kart_: go ahead, both by me, both innocuous 
[11:27:47] <kart_>	 akosiaris: cool. Thanks!
[11:28:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] tlsproxy::localssl: allow setting keepalive_requests [puppet] - 10https://gerrit.wikimedia.org/r/570612 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema)
[11:28:36] <logmsgbot>	 !log kartik@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' .
[11:28:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:42] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "let's test this on the canaries first, that would be role::mediawiki::canary_appserver and role::mediawiki::appserver::canary_api" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570613 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema)
[11:31:22] <logmsgbot>	 !log kartik@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' .
[11:31:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:01] <akosiaris>	 !log upload etherpad-lite_1.8.0-1 to apt.wikimedia.org buster-wikimedia/main
[11:33:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:05] <logmsgbot>	 !log kartik@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' .
[11:35:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:32] <kart_>	 !log Updated cxserver to 2020-02-05-051751-production (T244230, T234323)
[11:38:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:36] <stashbot>	 T234323: Load a single section in Content translation's editor - https://phabricator.wikimedia.org/T234323
[11:38:36] <stashbot>	 T244230: Add Yandex translation for Chuvash - https://phabricator.wikimedia.org/T244230
[11:41:05] <akosiaris>	 !log upgrade etherpad-lite on etherpad1002 to 1.8.0-1
[11:41:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:54] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "Thanks for working on this! Somme comments inline." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott)
[11:44:39] <wikibugs>	 10Operations, 10observability, 10serviceops: Stream a subset of mediawiki apache logs to logstash - https://phabricator.wikimedia.org/T244472 (10jijiki)
[11:50:27] <wikibugs>	 (03PS3) 10Cparle: Re-enable delayed new upload jobs for MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570287 (https://phabricator.wikimedia.org/T241072) (owner: 10Matthias Mullie)
[11:51:13] <wikibugs>	 (03PS4) 10Cparle: Remove handler deleted from the MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566860 (https://phabricator.wikimedia.org/T241242) (owner: 10Matthias Mullie)
[12:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200206T1200).
[12:00:04] <jouncebot>	 kart_, matthiasmullie, cparle, and addshore: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[12:00:13] * kart_ is here
[12:00:28] <kart_>	 Anyone want to deploy my patch or should I go ahead?
[12:00:40] <Urbanecm>	 kart_: if you want, I can do so - but feel free to deploy yourself!
[12:01:04] <kart_>	 Urbanecm: Please deploy my patch :)
[12:01:08] <Urbanecm>	 will do!
[12:01:22] <kart_>	 oh, I see 9 patches in SWAT.
[12:01:48] <wikibugs>	 (03PS2) 10Urbanecm: Enable CX in te, kn, gu, mr and pawiki as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570051 (https://phabricator.wikimedia.org/T243271) (owner: 10KartikMistry)
[12:01:51] <Urbanecm>	 we'll deploy whatever we have time for
[12:01:58] <cormacparle__>	 I can deploy mine
[12:01:58] <Lucas_WMDE>	 that sounds like a lot of patches
[12:01:59] <addshore>	 o/
[12:02:00] <kart_>	 OK!
[12:02:06] * addshore can deploy mine
[12:02:12] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570051 (https://phabricator.wikimedia.org/T243271) (owner: 10KartikMistry)
[12:02:35] <Urbanecm>	 cormacparle__: please +2 your backport(s), so we don't have to wait for jenkins later on
[12:02:41] <Urbanecm>	 addshore: ^^
[12:02:44] <cormacparle__>	 already done
[12:02:47] <addshore>	 already merged ;)
[12:02:48] <Urbanecm>	 cool!
[12:03:07] <wikibugs>	 (03Merged) 10jenkins-bot: Enable CX in te, kn, gu, mr and pawiki as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570051 (https://phabricator.wikimedia.org/T243271) (owner: 10KartikMistry)
[12:03:26] <Urbanecm>	 kart_: please test on mwdebug1001 and lmk
[12:03:34] <kart_>	 OK!
[12:06:08] <kart_>	 Urbanecm: looks good. Go ahead!
[12:06:15] <Urbanecm>	 deploying
[12:06:49] <wikibugs>	 (03PS5) 10Matthias Mullie: Remove handler deleted from the MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566860 (https://phabricator.wikimedia.org/T241242)
[12:07:38] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 5e1cbb2: Enable CX in te, kn, gu, mr and pawiki as a default tool (T243271, T243272, T243273, T243274, T243275) (duration: 01m 09s)
[12:07:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:46] <stashbot>	 T243273: Enable Content Translation in Gujarati Wikipedia as a default tool - https://phabricator.wikimedia.org/T243273
[12:07:46] <stashbot>	 T243274: Enable Content Translation in Marathi Wikipedia as a default tool - https://phabricator.wikimedia.org/T243274
[12:07:46] <stashbot>	 T243271: Enable Content Translation in Telugu Wikipedia as a default tool - https://phabricator.wikimedia.org/T243271
[12:07:47] <stashbot>	 T243275: Enable Content Translation in Punjabi Wikipedia as a default tool - https://phabricator.wikimedia.org/T243275
[12:07:47] <stashbot>	 T243272: Enable Content Translation in Kannada Wikipedia as a default tool - https://phabricator.wikimedia.org/T243272
[12:08:03] <Urbanecm>	 kart_: done
[12:08:10] <Urbanecm>	 cormacparle__: go ahead
[12:08:22] <cormacparle__>	 ok cool
[12:08:35] <kart_>	 Urbanecm: thanks a lot!
[12:08:38] <Urbanecm>	 yw
[12:12:43] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster) - https://phabricator.wikimedia.org/T243226 (10jbond) I have preformed the following actions  * copied the CA from `deployment-puppetmaster03` to `deployment-puppetmaster04`  * on `deployment-pu...
[12:14:04] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10hnowlan) Moritz clarified how case sensitive logins affect Icinga - I've since logged in as Hnowlan and I can confirm I can run...
[12:14:04] <wikibugs>	 (03PS4) 10Krinkle: mediawiki: Add reqId/file/line to php7-fatal-error.php's 'message' field [puppet] - 10https://gerrit.wikimedia.org/r/554599
[12:15:50] <wikibugs>	 (03PS5) 10Krinkle: mediawiki: Add reqId/file/line to php7-fatal-error.php's 'message' field [puppet] - 10https://gerrit.wikimedia.org/r/554599
[12:15:58] <wikibugs>	 (03CR) 10Cparle: [C: 03+2] Remove handler deleted from the MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566860 (https://phabricator.wikimedia.org/T241242) (owner: 10Matthias Mullie)
[12:17:04] <wikibugs>	 (03Merged) 10jenkins-bot: Remove handler deleted from the MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566860 (https://phabricator.wikimedia.org/T241242) (owner: 10Matthias Mullie)
[12:18:06] <wikibugs>	 10Operations, 10serviceops: Test and deploy mcrouter 0.41 - https://phabricator.wikimedia.org/T244476 (10jijiki)
[12:18:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/562852 (owner: 10Jbond)
[12:18:21] <addshore>	 cormacparle__: matthiasmullie give me a ping when yours are done :)
[12:18:42] <matthiasmullie>	 addshore: will do!
[12:19:49] <wikibugs>	 10Operations, 10serviceops: Test and deploy mcrouter 0.41 - https://phabricator.wikimedia.org/T244476 (10jijiki)
[12:19:55] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T240684 (10jijiki)
[12:24:54] <logmsgbot>	 !log cparle@deploy1001 Synchronized php-1.35.0-wmf.18/extensions/MachineVision: Use the wbsetclaim API to add depicts statements (duration: 01m 09s)
[12:24:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:39] <XioNoX>	 !log remove full-duplex statement from eqsin Tata link (not supported on Junos 18, as 10G is full duplex anyway)
[12:25:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:26:25] <logmsgbot>	 !log cparle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Remove handler deleted from the MachineVision extension (duration: 01m 05s)
[12:26:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:26:53] <wikibugs>	 (03PS4) 10Matthias Mullie: Re-enable delayed new upload jobs for MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570287 (https://phabricator.wikimedia.org/T241072)
[12:29:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] ldap - idp:  add ldap helper script for enabling u2f on cas [puppet] - 10https://gerrit.wikimedia.org/r/562852 (owner: 10Jbond)
[12:30:40] <wikibugs>	 (03CR) 10Cparle: [C: 03+2] Re-enable delayed new upload jobs for MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570287 (https://phabricator.wikimedia.org/T241072) (owner: 10Matthias Mullie)
[12:31:11] <wikibugs>	 (03PS2) 10Matthias Mullie: Revert "Remove handler deleted from the MachineVision extension on beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570571
[12:31:40] <wikibugs>	 (03Merged) 10jenkins-bot: Re-enable delayed new upload jobs for MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570287 (https://phabricator.wikimedia.org/T241072) (owner: 10Matthias Mullie)
[12:33:57] <wikibugs>	 (03CR) 10Cparle: [C: 03+2] Revert "Remove handler deleted from the MachineVision extension on beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570571 (owner: 10Matthias Mullie)
[12:34:16] <logmsgbot>	 !log cparle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Re-enable delayed new upload jobs for MachineVision extension (duration: 01m 08s)
[12:34:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:50] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Remove handler deleted from the MachineVision extension on beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570571 (owner: 10Matthias Mullie)
[12:36:20] <cormacparle__>	 ok all done - addshore you're up
[12:36:36] <addshore>	 sweet
[12:36:50] <wikibugs>	 (03PS3) 10Addshore: Enable EntitySourceBasedFederation for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566795 (https://phabricator.wikimedia.org/T243395) (owner: 10Ladsgroup)
[12:36:52] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] Enable EntitySourceBasedFederation for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566795 (https://phabricator.wikimedia.org/T243395) (owner: 10Ladsgroup)
[12:37:01] <wikibugs>	 (03PS2) 10Addshore: Enable EntitySourceBasedFederation for group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570583 (https://phabricator.wikimedia.org/T243395)
[12:37:10] <wikibugs>	 (03PS2) 10Addshore: wmgUseEntitySourceBasedFederation everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570584 (https://phabricator.wikimedia.org/T243395)
[12:37:52] <wikibugs>	 (03Merged) 10jenkins-bot: Enable EntitySourceBasedFederation for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566795 (https://phabricator.wikimedia.org/T243395) (owner: 10Ladsgroup)
[12:39:41] <wikibugs>	 (03PS1) 10Vgutierrez: ATS: Allow configuring via hiera server session sharing settings [puppet] - 10https://gerrit.wikimedia.org/r/570622 (https://phabricator.wikimedia.org/T244464)
[12:39:50] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation for group0 T243395 (duration: 01m 07s)
[12:39:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:54] <stashbot>	 T243395: wmgUseEntitySourceBasedFederation true for Wikimedia clients (all sites) - https://phabricator.wikimedia.org/T243395
[12:40:14] <vgutierrez>	 !log pooling cp3065 - T242093
[12:40:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:17] <stashbot>	 T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093
[12:42:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thanks for looking into this!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570600 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff)
[12:43:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Pass down MAC address of the installing system via BOOTIF on Buster [puppet] - 10https://gerrit.wikimedia.org/r/564729 (https://phabricator.wikimedia.org/T242481) (owner: 10Muehlenhoff)
[12:43:37] <wikibugs>	 (03PS2) 10Vgutierrez: ATS: Allow configuring via hiera server session sharing settings [puppet] - 10https://gerrit.wikimedia.org/r/570622 (https://phabricator.wikimedia.org/T244464)
[12:44:41] <logmsgbot>	 !log addshore@deploy1001 sync-file aborted: Fetch central babel information over SQL query, not API (T243726) (duration: 01m 04s)
[12:44:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:44] <stashbot>	 T243726: Babel should get cross-wiki languages via DB instead of making an HTTP request - https://phabricator.wikimedia.org/T243726
[12:45:49] <wikibugs>	 (03PS5) 10Muehlenhoff: Pass down MAC address of the installing system via BOOTIF on Buster [puppet] - 10https://gerrit.wikimedia.org/r/564729 (https://phabricator.wikimedia.org/T242481)
[12:46:03] <wikibugs>	 (03CR) 10Vgutierrez: "pcc shows a NOOP on records.config (except for the removal of an empty line): https://puppet-compiler.wmflabs.org/compiler1002/20656/" [puppet] - 10https://gerrit.wikimedia.org/r/570622 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez)
[12:46:20] <addshore>	 reverting that last one
[12:46:20] <wikibugs>	 (03CR) 10Ema: "pcc here: https://puppet-compiler.wmflabs.org/compiler1001/20655/" [puppet] - 10https://gerrit.wikimedia.org/r/570612 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema)
[12:46:50] <wikibugs>	 (03PS1) 10Polishdeveloper: Clean-up decommisioned Print schema configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570625 (https://phabricator.wikimedia.org/T196159)
[12:46:51] <logmsgbot>	 !log addshore@deploy1001 Synchronized php-1.35.0-wmf.18/extensions/Babel: REVERT Fetch central babel information over SQL query, not API (T243726) (duration: 01m 07s)
[12:46:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:01] <addshore>	 will investigate that after swat...
[12:47:27] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] Enable EntitySourceBasedFederation for group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570583 (https://phabricator.wikimedia.org/T243395) (owner: 10Addshore)
[12:47:31] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[12:47:46] <addshore>	 ^^ that was me and already fixed (reverted the change)
[12:48:20] <wikibugs>	 (03Merged) 10jenkins-bot: Enable EntitySourceBasedFederation for group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570583 (https://phabricator.wikimedia.org/T243395) (owner: 10Addshore)
[12:49:21] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[12:51:03] <wikibugs>	 (03PS3) 10Vgutierrez: ATS: Allow configuring via hiera server session sharing settings [puppet] - 10https://gerrit.wikimedia.org/r/570622 (https://phabricator.wikimedia.org/T244464)
[12:52:45] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation for group1 T243395 (duration: 01m 06s)
[12:52:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:52:48] <stashbot>	 T243395: wmgUseEntitySourceBasedFederation true for Wikimedia clients (all sites) - https://phabricator.wikimedia.org/T243395
[12:54:47] <wikibugs>	 (03PS1) 10Addshore: Revert "Enable EntitySourceBasedFederation for group1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570627
[12:55:22] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] Revert "Enable EntitySourceBasedFederation for group1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570627 (owner: 10Addshore)
[12:56:08] <wikibugs>	 (03CR) 10Ema: [C: 03+2] tlsproxy::localssl: allow setting keepalive_requests [puppet] - 10https://gerrit.wikimedia.org/r/570612 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema)
[12:56:19] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Enable EntitySourceBasedFederation for group1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570627 (owner: 10Addshore)
[12:56:53] <wikibugs>	 (03PS4) 10Vgutierrez: ATS: Allow configuring via hiera server session sharing settings [puppet] - 10https://gerrit.wikimedia.org/r/570622 (https://phabricator.wikimedia.org/T244464)
[12:57:17] <wikibugs>	 (03CR) 10Addshore: [C: 04-2] "T244479 needs fixing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570584 (https://phabricator.wikimedia.org/T243395) (owner: 10Addshore)
[12:58:03] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: REVERT Enable EntitySourceBasedFederation for group1 T243395, due to T244479 (duration: 01m 07s)
[12:58:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:07] <stashbot>	 T243395: wmgUseEntitySourceBasedFederation true for Wikimedia clients (all sites) - https://phabricator.wikimedia.org/T243395
[12:58:07] <stashbot>	 T244479: Argument 5 passed to Wikibase\Lexeme\Search\Elastic\LexemeSearchEntity::__construct() must be an instance of Wikibase\Lib\Store\PrefetchingTermLookup, instance of Wikibase\DataAccess\ByTypeDispatchingPrefetchingTermLookup given, called in /srv/mediawiki/php-1.35.0-wmf.18/extensions/WikibaseLexemeCirrusSearch/WikibaseSearch.entitytypes.repo.php on line 41 - https://phabricator.wikimedia.org/T244479
[12:59:38] <XioNoX>	 !log deactivate BGP transits on cr2-eqsin
[12:59:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:59:43] <wikibugs>	 (03CR) 10Vgutierrez: "100% NOOP now: https://puppet-compiler.wmflabs.org/compiler1002/20657/" [puppet] - 10https://gerrit.wikimedia.org/r/570622 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez)
[13:00:06] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: resync REVERT Enable EntitySourceBasedFederation for group1 (duration: 01m 07s)
[13:00:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:45] <addshore>	 !log SWAT done
[13:00:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:47] <XioNoX>	 !log reboot cr2-eqsin for sw upgrade
[13:00:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:54] <wikibugs>	 (03PS2) 10Ema: profile::mediawiki::webserver: increase canary keepalive_requests [puppet] - 10https://gerrit.wikimedia.org/r/570613 (https://phabricator.wikimedia.org/T241145)
[13:03:05] <wikibugs>	 10Operations, 10Traffic: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 (10Vgutierrez) @bblack confirmed that we can drop RSA support on ncredir during the All Hands
[13:03:16] <XioNoX>	 and it's booting up
[13:03:17] <revi>	 Seems like I've been hit by something now occuring at eqsin
[13:04:02] <wikibugs>	 (03CR) 10Ema: [C: 03+1] ATS: Allow configuring via hiera server session sharing settings [puppet] - 10https://gerrit.wikimedia.org/r/570622 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez)
[13:05:20] <wikibugs>	 (03PS1) 10Vgutierrez: ncredir: Drop RSA support [puppet] - 10https://gerrit.wikimedia.org/r/570629 (https://phabricator.wikimedia.org/T243391)
[13:05:42] <icinga-wm>	 PROBLEM - LVS HTTP IPv6 #page on upload-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[13:05:44] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] ATS: Allow configuring via hiera server session sharing settings [puppet] - 10https://gerrit.wikimedia.org/r/570622 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez)
[13:05:49] <icinga-wm>	 PROBLEM - OSPF status on mr1-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:05:58] <_joe_>	 XioNoX: ^^
[13:06:00] <icinga-wm>	 PROBLEM - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is CRITICAL: /private-info/info.json (private tile service info for osm-intl) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received: /v4/marker/pin-m+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m+ffffff@2x.png (Untitled test) timed 
[13:06:00] <icinga-wm>	 onse was received: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /osm-intl/info.json (tile service info for osm-intl) timed out before a response was received: /_info (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:06:03] <vgutierrez>	 XioNoX: I guess that's expected? :)
[13:06:09] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:06:10] <XioNoX>	 yepp looking
[13:06:13] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqsin is CRITICAL: CRITICAL: host 103.102.166.129, interfaces up: 78, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:06:17] <XioNoX>	 the host is coming back up
[13:06:18] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 #page on upload-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[13:06:21] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[13:06:25] <XioNoX>	 but no reason that triggerd
[13:06:41] <icinga-wm>	 PROBLEM - LVS HTTP IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[13:06:51] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.eqsin.wikimedia.org, port=443): Read timed out. (read timeout=15),): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase
[13:06:59] * vgutierrez orders a t-shirt for XioNoX O:)
[13:07:04] <_joe_>	 should we depool eqsin?
[13:07:19] <jbond42>	 here if needed
[13:07:23] <revi>	 loading for me, but bit slower
[13:07:37] <icinga-wm>	 RECOVERY - LVS HTTP IPv6 #page on upload-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 432 bytes in 7.607 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[13:07:37] <icinga-wm>	 RECOVERY - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:07:37] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 246 probes of 604 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:07:39] <_joe_>	 revi: ack :)
[13:07:45] <icinga-wm>	 RECOVERY - OSPF status on mr1-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:07:47] <_joe_>	 it's all coming back anyways
[13:07:50] * apergos peeking in and following along
[13:08:01] <XioNoX>	 yeah, wtf
[13:08:06] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 #page on upload-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 843 bytes in 0.949 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[13:08:07] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:08:07] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqsin is OK: OK: host 103.102.166.129, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:08:07] * akosiaris around as well, since the page
[13:08:09] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15066 bytes in 1.647 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[13:08:12] <revi>	 my friend had more PITA time getting disturbed while doing CU lol
[13:08:13] <icinga-wm>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[13:08:13] <revi>	 :P
[13:08:20] <akosiaris>	 XioNoX: BGP coalescing?
[13:08:20] <XioNoX>	 everything is back to normal afaik
[13:08:25] * effie o/
[13:08:27] <icinga-wm>	 RECOVERY - LVS HTTP IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 0.470 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[13:08:40] <akosiaris>	 this was just a reboot of cr4 ?
[13:09:03] <effie>	 ok cool 
[13:09:06] <XioNoX>	 akosiaris: reboot of cr2-eqsin, which was vrrp backup, doesn't have the transport links, and had its transits disabled
[13:09:27] <akosiaris>	 weird
[13:09:42] <XioNoX>	 exactly
[13:10:05] <akosiaris>	 no traffic should be flowing through it from what you say. How about monitoring? would it flow via it for whatever reason ?
[13:10:21] <XioNoX>	 !log rollback: deactivate BGP transits on cr2-eqsin
[13:10:22] <bblack>	 traffic would still be coming in its transits
[13:10:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:29] <XioNoX>	 akosiaris: finishing up and will look at it
[13:10:56] <bblack>	 but that wouldn't explain the alerts that look more like transport flap, since we lost reachability from the core dcs
[13:10:58] <wikibugs>	 (03CR) 10Ema: "pcc follows." [puppet] - 10https://gerrit.wikimedia.org/r/570613 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema)
[13:11:24] <bblack>	 it did have the backup transport tunnel, is it possible we were somehow routing over that?
[13:11:49] <wikibugs>	 (03CR) 10Ema: profile::mediawiki::webserver: increase canary keepalive_requests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570613 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema)
[13:12:17] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] site.pp: Add new ganeti codfw hosts as role::spare [puppet] - 10https://gerrit.wikimedia.org/r/570601 (https://phabricator.wikimedia.org/T224603) (owner: 10Alexandros Kosiaris)
[13:12:26] <XioNoX>	 bblack: nah, it was going over the main one ( https://librenms.wikimedia.org/graphs/to=1580994600/id=16794/type=port_bits/from=1580908200/ )
[13:12:57] <wikibugs>	 (03PS1) 10Vgutierrez: ATS: Allow server session sharing by ip on ats-tls in cp4031 [puppet] - 10https://gerrit.wikimedia.org/r/570633 (https://phabricator.wikimedia.org/T244464)
[13:13:35] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 5 probes of 604 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:13:44] <mark>	 did it have pybal traffic perhaps?
[13:14:05] <XioNoX>	 somekeping was fine https://smokeping.wikimedia.org/?target=eqsin.Hosts.bast5001
[13:14:33] <bblack>	 heh, we never did finish switching all the pybals to dual bgp sessions
[13:14:46] <XioNoX>	 mark: ah maybe, I forgot to fail pybal over manually
[13:14:49] <XioNoX>	 damn
[13:14:51] <bblack>	 in eqsin, 5001->cr1, 500[23]->cr2
[13:15:07] <bblack>	 so text should've been ok, but upload would've failed, due to that?
[13:15:23] <bblack>	 we had alerts for both though
[13:15:39] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] profile::mediawiki::webserver: increase canary keepalive_requests [puppet] - 10https://gerrit.wikimedia.org/r/570613 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema)
[13:16:17] <bblack>	 although only v6 for text, maybe
[13:17:03] <XioNoX>	 also the icinga checks go through the transports so it can't be due to transit BGP convergence
[13:17:37] <wikibugs>	 (03CR) 10Ema: [C: 03+1] ATS: Allow server session sharing by ip on ats-tls in cp4031 [puppet] - 10https://gerrit.wikimedia.org/r/570633 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez)
[13:17:50] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] ATS: Allow server session sharing by ip on ats-tls in cp4031 [puppet] - 10https://gerrit.wikimedia.org/r/570633 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez)
[13:18:39] <wikibugs>	 10Operations, 10Puppet: confd fails to start after a reimage - https://phabricator.wikimedia.org/T244477 (10jbond) p:05Triage→03Medium
[13:18:41] <XioNoX>	 cr2-eqsin is now fully healthy
[13:19:03] <bblack>	 let's fix the pybal->both thing in eqsin before cr1?
[13:19:41] <XioNoX>	 bblack: I'm not doing cr1 today
[13:19:43] <XioNoX>	 but yeah :)
[13:20:05] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[13:20:23] <XioNoX>	 going to do cr3-knams now
[13:22:04] <wikibugs>	 (03PS1) 10Jbond: sslcert: ensure we run update-ca-certificates managing any services [puppet] - 10https://gerrit.wikimedia.org/r/570637 (https://phabricator.wikimedia.org/T244477)
[13:22:19] <wikibugs>	 (03CR) 10Ema: [C: 03+2] profile::mediawiki::webserver: increase canary keepalive_requests [puppet] - 10https://gerrit.wikimedia.org/r/570613 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema)
[13:22:32] <vgutierrez>	 !log Enable server session sharing on ats-tls in cp4031 - T244464
[13:22:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:36] <stashbot>	 T244464: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464
[13:27:29] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[13:27:35] <XioNoX>	 !log deactivate BGP transits on cr3-knams
[13:27:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:50] <elukey>	 !log depool mw1347 to test some mcrouter settings
[13:31:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:38] <XioNoX>	 !log reboot cr3-knams
[13:31:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:52] <elukey>	 !log repool mw1347 with mcrouter running with 10 proxy threads (was: 5)
[13:34:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:37] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:35:41] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:35:55] <icinga-wm>	 PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:36:35] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:38:25] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:38:25] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:39:09] <wikibugs>	 (03PS1) 10Ema: profile::mediawiki::webserver: increase keepalive_requests [puppet] - 10https://gerrit.wikimedia.org/r/570640 (https://phabricator.wikimedia.org/T241145)
[13:39:19] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:39:23] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:39:37] <icinga-wm>	 RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:40:05] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:41:46] <wikibugs>	 10Operations: Integrate Stretch 9.6 point update - https://phabricator.wikimedia.org/T209260 (10MoritzMuehlenhoff) 05Open→03Resolved This is done for quite a while, closing.
[13:43:19] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[13:45:49] <XioNoX>	 !log rollback deactivate BGP transits on cr3-knams
[13:45:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:12] <wikibugs>	 (03CR) 10Ema: [C: 03+1] ncredir: Drop RSA support [puppet] - 10https://gerrit.wikimedia.org/r/570629 (https://phabricator.wikimedia.org/T243391) (owner: 10Vgutierrez)
[13:50:09] <wikibugs>	 (03PS1) 10Addshore: Revert "Revert "Enable EntitySourceBasedFederation for group1"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570642
[13:50:25] <wikibugs>	 (03PS2) 10Addshore: Enable EntitySourceBasedFederation for group1 again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570642
[13:50:49] <wikibugs>	 (03PS3) 10Addshore: wmgUseEntitySourceBasedFederation everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570584 (https://phabricator.wikimedia.org/T243395)
[13:51:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1019 for onsite maintenance T243963', diff saved to https://phabricator.wikimedia.org/P10321 and previous config saved to /var/cache/conftool/dbconfig/20200206-135157-marostegui.json
[13:52:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:01] <stashbot>	 T243963: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963
[13:53:14] <wikibugs>	 (03PS1) 10Marostegui: es1019: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570643 (https://phabricator.wikimedia.org/T243963)
[13:53:48] <akosiaris>	 !log depool eqiad eventgate-analytics for testing purposes. Requests will flow to codfw, monitoring https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-30m&to=now for issues.
[13:53:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:51] <akosiaris>	 _joe_: ^ 
[13:54:09] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-analytics
[13:54:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:15] <_joe_>	 akosiaris: I'm at lunch though :P
[13:54:21] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es1019: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570643 (https://phabricator.wikimedia.org/T243963) (owner: 10Marostegui)
[13:54:28] <akosiaris>	 I 'll revert immediately if everything breaks, no worries
[13:54:46] <akosiaris>	 and restart php-fpms in a rolling fashion but I hope it doesn't come to it
[13:55:11] <marostegui>	 !log Stop MySQL on es1019, upgrade and poweroff for on-site maintenance - T243963
[13:55:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:33] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: elasticsearch: cirrus logs to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/570374 (https://phabricator.wikimedia.org/T225125) (owner: 10Filippo Giunchedi)
[13:58:48] <_joe_>	 i see a small bump in p95, which is expected
[13:59:24] <akosiaris>	 it's within normal parameters yet though
[13:59:37] <akosiaris>	 I mean we had more 5m before that
[13:59:40] <_joe_>	 sure, that's what I am saying
[13:59:42] <akosiaris>	 also, weren't you at lunch ?
[13:59:48] <akosiaris>	 go have your lunch in peace
[13:59:52] <_joe_>	 :D sure ttyl
[14:01:12] <ema>	 akosiaris: is the current increase in api_appserver avg response time related to your tests?
[14:01:25] <ema>	 I'm looking at https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-m
[14:02:23] <akosiaris>	 ema: probably not
[14:02:29] <akosiaris>	 it was like that before I started
[14:03:11] <akosiaris>	 appservers should also be affected btw, but I see nothing there
[14:03:26] <ema>	 akosiaris: ack so probably an api issue?
[14:03:44] <akosiaris>	 looks like it?
[14:04:06] <akosiaris>	 I don't see elevated request rates though, neither memcached rates being elevated
[14:04:24] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Marostegui) @Cmjohnson es1019 is off. Once you are done, just start it back and I will it from there.  Thank you!
[14:04:34] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install  es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) @Jclark-ctr @wiki_willy @Cmjohnson any rough estimation on when we can expect these hosts to be online? As I said, I am...
[14:04:41] <akosiaris>	 I am watching both RED appserver cluster and api_appserver cluster
[14:04:45] <_joe_>	 no, it's the canary api
[14:04:53] <_joe_>	 https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=All
[14:05:02] <_joe_>	 ema: I think it was related to your change somehow
[14:05:12] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[14:05:39] <akosiaris>	 it does look unrelated, but I think I am gonna revert, just to be on the safe side
[14:05:52] <_joe_>	 akosiaris: you change took effect way too late to be the cause
[14:05:56] <_joe_>	 and only on the canaries
[14:05:59] <ema>	 _joe_: interesting, reverting
[14:06:13] <akosiaris>	 ok, if ema is going to revert, probably better we don't both revert.
[14:06:16] <_joe_>	 ema: wait
[14:06:21] * ema waits
[14:06:24] <_joe_>	 it's apparently recovering
[14:06:35] <_joe_>	 or not
[14:06:36] <wikibugs>	 (03CR) 10Jhedden: Keystone: rotate and sync fernet tokens (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott)
[14:06:43] <ema>	 also interesting that appserver canaries aren't affected, only api
[14:06:54] <_joe_>	 yeah no, better to revert :/
[14:07:26] <ema>	 _joe_: ok to revert only for api canaries, or do you want me to revert both?
[14:07:32] <wikibugs>	 (03CR) 10Elukey: WIP - Introduce profile::mariadb::misc::analytics (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey)
[14:07:37] <_joe_>	 just for the apis for now
[14:07:44] <ema>	 alright, on it
[14:07:48] <_joe_>	 we can try to find out what's going on afterwards
[14:07:56] <ema>	 sorry for interrupting your lunch
[14:08:35] <wikibugs>	 (03PS2) 10Elukey: WIP - Introduce profile::mariadb::misc::analytics [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826)
[14:11:08] <wikibugs>	 (03PS1) 10Ema: profile::mediawiki::webserver: revert api canary keepalive_requests [puppet] - 10https://gerrit.wikimedia.org/r/570655 (https://phabricator.wikimedia.org/T241145)
[14:12:50] <wikibugs>	 (03Abandoned) 10Jbond: wmflib::end_with: create String.end_with function [puppet] - 10https://gerrit.wikimedia.org/r/570330 (https://phabricator.wikimedia.org/T244222) (owner: 10Jbond)
[14:13:03] <wikibugs>	 (03PS5) 10Jbond: realm global: make the realm variable a global in labs [puppet] - 10https://gerrit.wikimedia.org/r/570331 (https://phabricator.wikimedia.org/T244222)
[14:13:18] <wikibugs>	 (03CR) 10Ema: [C: 03+2] profile::mediawiki::webserver: revert api canary keepalive_requests [puppet] - 10https://gerrit.wikimedia.org/r/570655 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema)
[14:14:33] <ema>	 !log run puppet on mw-api-canary to revert nginx keepalive_requests bump T241145 
[14:14:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:36] <stashbot>	 T241145: Improve ATS backend connection reuse against origin servers - https://phabricator.wikimedia.org/T241145
[14:16:06] <akosiaris>	 !log 20mins in with eventgate-analytics/eqiad depooled from discovery, no issues yet.
[14:16:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:34] <akosiaris>	 if ema's revert fixes the api latency as well, I 'll be happy. ema won't on the other hand 
[14:16:36] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[14:16:48] <akosiaris>	 here we go
[14:17:11] <ema>	 fascinating
[14:17:39] <akosiaris>	 ema: it's like immediate: Look at https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-15m&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-m&refresh=1m&fullscreen&panelId=9
[14:17:43] <akosiaris>	 I mean... wow
[14:18:12] <akosiaris>	 what on earth? I wouldn't expect keepalive_requests to change that so dramatically
[14:18:50] <ema>	 me neither, and indeed bumping the setting didn't cause any trouble on non-api appservers
[14:20:00] * akosiaris wonders what else can I depool from eqiad 
[14:20:16] <ema>	 akosiaris: context?
[14:20:49] <akosiaris>	 indeed, sorry. So since the incident with eventgate-analytics being turned over to https and everything collapsing
[14:21:07] <akosiaris>	 we 've been wondering why. One of the possible explanations is/was latency increase
[14:21:14] <wikibugs>	 (03PS1) 10Vgutierrez: install_server: Remove already decommissioned cp40[09,10,17,18] [puppet] - 10https://gerrit.wikimedia.org/r/570659
[14:21:27] <akosiaris>	 so we were fearing moving eventgate-analytics over to codfw would cause a total collapse of the heart
[14:21:58] <akosiaris>	 so I 've been experimenting in small batches and mw servers to verify the hypothesis
[14:22:17] <akosiaris>	 turns out that the 40ms RTT that eqiad -> codfw introduces don't trigger an issue
[14:22:21] <wikibugs>	 (03PS1) 10Vgutierrez: install_server: Reimage upload@ulsfo as buster [puppet] - 10https://gerrit.wikimedia.org/r/570660 (https://phabricator.wikimedia.org/T242093)
[14:22:33] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10hnowlan)
[14:23:07] <akosiaris>	 now, I wouldn't want to test with say 300ms, but at least this is a relief. If this did cause a big issue, we would have severe problems with the switchover
[14:23:13] <wikibugs>	 10Operations, 10Elasticsearch, 10Wikimedia-Logstash, 10observability, and 2 others: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline - https://phabricator.wikimedia.org/T225125 (10fgiunchedi) Status update: out of the box json logging support has been introduced i...
[14:23:13] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] ncredir: Drop RSA support [puppet] - 10https://gerrit.wikimedia.org/r/570629 (https://phabricator.wikimedia.org/T243391) (owner: 10Vgutierrez)
[14:25:10] <wikibugs>	 (03PS6) 10Muehlenhoff: Pass down MAC address of the installing system via BOOTIF on Buster [puppet] - 10https://gerrit.wikimedia.org/r/564729 (https://phabricator.wikimedia.org/T242481)
[14:26:09] <wikibugs>	 (03CR) 10Ema: [C: 03+1] install_server: Reimage upload@ulsfo as buster [puppet] - 10https://gerrit.wikimedia.org/r/570660 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez)
[14:26:24] <wikibugs>	 (03CR) 10Ayounsi: "INFO:homer:Committing config for query mr1-ulsfo* with message: test" [software/homer] - 10https://gerrit.wikimedia.org/r/570510 (https://phabricator.wikimedia.org/T244363) (owner: 10Volans)
[14:27:07] <wikibugs>	 (03PS2) 10Vgutierrez: install_server: Remove already decommissioned cp40[09,10,17,18] [puppet] - 10https://gerrit.wikimedia.org/r/570659 (https://phabricator.wikimedia.org/T178815)
[14:27:10] <wikibugs>	 (03PS2) 10Vgutierrez: install_server: Reimage upload@ulsfo as buster [puppet] - 10https://gerrit.wikimedia.org/r/570660 (https://phabricator.wikimedia.org/T242093)
[14:27:54] <wikibugs>	 (03CR) 10Ema: [C: 03+1] install_server: Remove already decommissioned cp40[09,10,17,18] [puppet] - 10https://gerrit.wikimedia.org/r/570659 (https://phabricator.wikimedia.org/T178815) (owner: 10Vgutierrez)
[14:31:02] <icinga-wm>	 PROBLEM - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:33:10] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 114984152 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:33:25] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] install_server: Remove already decommissioned cp40[09,10,17,18] [puppet] - 10https://gerrit.wikimedia.org/r/570659 (https://phabricator.wikimedia.org/T178815) (owner: 10Vgutierrez)
[14:34:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Pass down MAC address of the installing system via BOOTIF on Buster [puppet] - 10https://gerrit.wikimedia.org/r/564729 (https://phabricator.wikimedia.org/T242481) (owner: 10Muehlenhoff)
[14:35:02] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 104544 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:37:06] <_joe_>	 akosiaris: try eventgate-main next :)
[14:37:17] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] install_server: Reimage upload@ulsfo as buster [puppet] - 10https://gerrit.wikimedia.org/r/570660 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez)
[14:37:19] <akosiaris>	 I am thinking about it
[14:37:31] <wikibugs>	 (03PS3) 10Vgutierrez: install_server: Reimage upload@ulsfo as buster [puppet] - 10https://gerrit.wikimedia.org/r/570660 (https://phabricator.wikimedia.org/T242093)
[14:39:02] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Overall LGTM at the condition of testing it manually with at least 2 parallel "attacks"." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587) (owner: 10CDanis)
[14:39:21] <wikibugs>	 (03PS2) 10Jbond: sslcert: ensure we run update-ca-certificates managing any services [puppet] - 10https://gerrit.wikimedia.org/r/570637 (https://phabricator.wikimedia.org/T244477)
[14:39:36] <akosiaris>	 _joe_: that being said, eventgate-main in codfw produces to a different topic
[14:39:43] <wikibugs>	 (03CR) 10Andrew Bogott: Keystone: rotate and sync fernet tokens (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott)
[14:39:47] <akosiaris>	 topics* more correctly
[14:39:57] <akosiaris>	 so I 'd like andrew's buy in before doing that
[14:40:37] <_joe_>	 akosiaris: sure but it's kind-of the point
[14:40:52] <addshore>	 jouncebot: next
[14:40:52] <jouncebot>	 In 2 hour(s) and 19 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200206T1700)
[14:40:53] <akosiaris>	 it is indeed, but I remember andrew having to switch something around 
[14:40:55] <_joe_>	 I think eventstreams needed to follow it
[14:41:05] <_joe_>	 but that should not be the case anymore
[14:41:12] <akosiaris>	 why ?
[14:41:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sslcert: ensure we run update-ca-certificates managing any services [puppet] - 10https://gerrit.wikimedia.org/r/570637 (https://phabricator.wikimedia.org/T244477) (owner: 10Jbond)
[14:41:30] <_joe_>	 because andrew fixed things IIRC
[14:41:42] <akosiaris>	 I guess we will wait and see
[14:41:44] <_joe_>	 but sure let's wait for him
[14:43:01] <wikibugs>	 (03PS3) 10Jbond: sslcert: ensure we run update-ca-certificates managing any services [puppet] - 10https://gerrit.wikimedia.org/r/570637 (https://phabricator.wikimedia.org/T244477)
[14:43:18] <ema>	 _joe_: so it turns out that restbase origins have great connection reuse given that they're behind envoy, hence they don't close connections after 100 reqs 
[14:44:23] <ema>	 _joe_: a possible explanation of why api canaries were negatively affected by the keepalive_requests bump is that api has decent connection reuse, so it happens often that connections are reused for > 100 reqs
[14:45:14] <_joe_>	 ema: so the TLDR is we need envoy?
[14:45:21] <_joe_>	 did you see what made the cpu explode?
[14:45:36] <_joe_>	 gimme 5 mins and I'm with you
[14:45:43] <ema>	 this is not the case for appservers; on cp3050 I've counted the number of times we reached 99 requests over one single connection for different origins for a few seconds; it happened 321 times for api and 6 times for appservers
[14:46:00] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "Overall LGTM, a few small corrections and it should be good to merge." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) (owner: 10Hnowlan)
[14:46:12] <vgutierrez>	 !log depool & reimage cp4025 as buster - T242093
[14:46:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:16] <stashbot>	 T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093
[14:46:58] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4025.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reima...
[14:47:02] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T240684 (10elukey) Just updated https://grafana.wikimedia.org/d/000000317/memcache-slabs adding a new row at the bottom '1.5.x metrics' with all the new m...
[14:47:12] <ema>	 _joe_: so my (conspiracy) theory is that nginx doesn't behave well with heavy connection reuse, and that's the actual reason why keepalive_requests is 100 by default 
[14:47:45] <wikibugs>	 (03PS1) 10Vgutierrez: install_server: Reimage ncredir@ulsfo as buster [puppet] - 10https://gerrit.wikimedia.org/r/570665 (https://phabricator.wikimedia.org/T243391)
[14:47:47] <wikibugs>	 (03PS1) 10Jdlrobson: Set wgLogoHD before adding wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570666 (https://phabricator.wikimedia.org/T244405)
[14:48:09] <_joe_>	 ema: did you dig deeper in the stats of those servers?
[14:48:55] <ema>	 _joe_: I've seen cpu usage, load, and temperature going up. Memory not affected
[14:49:04] <vgutierrez>	 ema: there are some memory cleanup routines/functions on nginx triggered only on connection close...
[14:49:07] <_joe_>	 ok lemme check something
[14:49:14] <wikibugs>	 (03CR) 10Jdlrobson: "its late and i am unable to swat this but this should fix the issue with the unexpected wordmark key in wgLogoHD" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570666 (https://phabricator.wikimedia.org/T244405) (owner: 10Jdlrobson)
[14:49:19] <wikibugs>	 (03CR) 10Andrew Bogott: Keystone: rotate and sync fernet tokens (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott)
[14:49:23] <vgutierrez>	 at least those involving TLS session cache
[14:50:08] <_joe_>	 ema: I think what happened is
[14:50:21] <_joe_>	 ats funneled through those servers a ton of requests
[14:50:43] <_joe_>	 https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-node=mw1277&fullscreen&panelId=52
[14:50:53] <_joe_>	 from 150 rps to 300 rps
[14:50:55] <_joe_>	 lol
[14:51:23] <_joe_>	 so we might want to either bump up to like 200 everywhere first
[14:53:18] <ema>	 _joe_: when you say "everywhere", do you mean both canaries and non-canaries?
[14:53:48] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] install_server: Reimage ncredir@ulsfo as buster [puppet] - 10https://gerrit.wikimedia.org/r/570665 (https://phabricator.wikimedia.org/T243391) (owner: 10Vgutierrez)
[14:54:40] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1015 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:54:53] <vgutierrez>	 BTW, now that we are talking nginx @ applayer.. if it's using the same TLS session cache settings as the old edge TLS termination.. maybe it's worth some tuning.. cause right now it would be just wasting memory on those servers
[14:56:42] <vgutierrez>	 !log depool and reimage ncredir4002 as buster - T243391
[14:56:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:45] <stashbot>	 T243391: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391
[14:59:32] <godog>	 !log extend graphite1004 / graphite2003 fs +200G
[14:59:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:13] <_joe_>	 vgutierrez: I want to move to use envoy soon
[15:00:20] <_joe_>	 ema: yes that's what I mean
[15:00:27] <ema>	 _joe_: ack, on it
[15:00:27] <vgutierrez>	 _joe_: that's going to reveal new issues
[15:00:44] <_joe_>	 vgutierrez: possibly
[15:00:53] <vgutierrez>	 I'm not opposing to that BTW
[15:00:57] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[15:01:00] <akosiaris>	 _joe_: another q. If eventgate-analytics caused such an issue moving to https
[15:01:06] <vgutierrez>	 just saying that having 3 different HTTP implementations on the chain is always tricky
[15:01:13] <_joe_>	 vgutierrez: but we use envoy exensively for other services
[15:01:18] <akosiaris>	 what's going to stop sessionstore from creating the same issue?
[15:01:22] <_joe_>	 so I'm not sure what new issues you expect
[15:01:26] <_joe_>	 akosiaris: sheer volume
[15:01:31] <akosiaris>	 even better, what's stopping echostore?
[15:01:35] <akosiaris>	 just the volume?
[15:01:37] <_joe_>	 ^^
[15:01:39] <_joe_>	 yes
[15:02:17] <akosiaris>	 ok, but let's keep an eye out on those. I have fear
[15:02:28] <_joe_>	 akosiaris: we can discuss later
[15:02:33] <_joe_>	 in the ops meeting
[15:02:42] <akosiaris>	 https://grafana.wikimedia.org/d/IfJykaTZk/echostore?orgId=1
[15:02:43] <_joe_>	 serviceops
[15:02:54] <akosiaris>	 is already at 1.1k
[15:04:02] <akosiaris>	 https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1 is also at 4k
[15:04:37] <akosiaris>	 and https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&var-dc=codfw%20prometheus%2Fk8s&var-service=eventgate-analytics&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All is at 10k
[15:05:03] <akosiaris>	 so, it isn't that far...
[15:06:19] <wikibugs>	 (03CR) 10Alexandros Kosiaris: profile::services_proxy: add temporarily entries for k8s services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570306 (owner: 10Giuseppe Lavagetto)
[15:07:45] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[15:07:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:29] <wikibugs>	 (03PS3) 10Jhedden: openstack: switch cloudvirt101[56] to ceph storage [puppet] - 10https://gerrit.wikimedia.org/r/570363 (https://phabricator.wikimedia.org/T243327)
[15:08:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "I think that's in the correct direction. It interferes indeed with the sharing we want to do of _helpers.tpl between the charts. Let's rev" [deployment-charts] - 10https://gerrit.wikimedia.org/r/570290 (owner: 10Giuseppe Lavagetto)
[15:09:01] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Revert "Update scaffold template names to use chart name" [deployment-charts] - 10https://gerrit.wikimedia.org/r/570290 (owner: 10Giuseppe Lavagetto)
[15:09:33] <wikibugs>	 (03PS1) 10Ema: profile::mediawiki::webserver: set api keepalive_requests to 200 [puppet] - 10https://gerrit.wikimedia.org/r/570670 (https://phabricator.wikimedia.org/T241145)
[15:09:58] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:09:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:05] <wikibugs>	 (03CR) 10Jhedden: [C: 03+2] "PCC results https://puppet-compiler.wmflabs.org/compiler1002/20661/" [puppet] - 10https://gerrit.wikimedia.org/r/570363 (https://phabricator.wikimedia.org/T243327) (owner: 10Jhedden)
[15:10:29] <ema>	 _joe_: like this? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/570670/
[15:10:48] <wikibugs>	 (03PS3) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418)
[15:11:00] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] profile::mediawiki::webserver: set api keepalive_requests to 200 [puppet] - 10https://gerrit.wikimedia.org/r/570670 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema)
[15:11:04] <_joe_>	 ema: let's try
[15:11:33] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "@Jeena, fyi, we had unfortunately to revert this as we want to share _helpers.tpl across all charts and enforce their consistency. I 'll a" [deployment-charts] - 10https://gerrit.wikimedia.org/r/570290 (owner: 10Giuseppe Lavagetto)
[15:12:35] <ema>	 _joe_: merging, perhaps we should speed up things a bit by foricing a puppet run to get a more uniform distribution of reqs?
[15:13:02] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Get access to Superset - https://phabricator.wikimedia.org/T244490 (10alexhollender)
[15:13:06] <_joe_>	 yes
[15:13:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott)
[15:13:11] <_joe_>	 but do like -b 25
[15:13:14] <ema>	 ack
[15:13:20] <wikibugs>	 (03CR) 10Ema: [C: 03+2] profile::mediawiki::webserver: set api keepalive_requests to 200 [puppet] - 10https://gerrit.wikimedia.org/r/570670 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema)
[15:14:17] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: helpers: Move most charts to common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/570064
[15:14:26] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/570064 fwiw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/570290 (owner: 10Giuseppe Lavagetto)
[15:14:52] <ema>	 !log A:mw-api: force puppet run to increase keepalive_requests from 100 to 200 https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/570670/ T241145
[15:14:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:55] <stashbot>	 T241145: Improve ATS backend connection reuse against origin servers - https://phabricator.wikimedia.org/T241145
[15:17:37] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4025.ulsfo.wmnet'] `  and were **ALL** successful.
[15:17:45] <wikibugs>	 (03PS4) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418)
[15:18:01] <wikibugs>	 (03PS8) 10Hnowlan: mediawiki: check mw versions match those on the deploy server [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023)
[15:18:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott)
[15:20:54] <wikibugs>	 (03CR) 10Hnowlan: mediawiki: check mw versions match those on the deploy server (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) (owner: 10Hnowlan)
[15:21:06] <wikibugs>	 (03PS5) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418)
[15:23:03] <vgutierrez>	 !log pooling cp4025 with buster - T242093
[15:23:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:06] <stashbot>	 T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093
[15:23:29] <ema>	 _joe_: change applied to all api servers, there's been a response time spike but it seems to be recovering now
[15:23:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott)
[15:24:12] <ema>	 I'm looking at https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-1h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-m&refresh=1m&fullscreen&panelId=9
[15:25:21] <wikibugs>	 (03PS3) 10Mholloway: WIP: Proton charts first draft [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos)
[15:27:26] <moritzm>	 !log installing sudo security updates on jessie
[15:27:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:22] <vgutierrez>	 !log pooling ncredir4002 running buster - T243391
[15:28:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:24] <stashbot>	 T243391: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391
[15:29:43] <vgutierrez>	 !log depool & reimage cp4024 as buster - T242093
[15:29:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:46] <stashbot>	 T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093
[15:30:21] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4024.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reima...
[15:30:30] <vgutierrez>	 !log depool & reimage ncredir4001 as buster - T243391
[15:30:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:39] <wikibugs>	 (03CR) 10Mholloway: "I'm keeping this moving while Mateus is out on vacation. Some comments (specifically the ones about using internal and not external endpoi" (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos)
[15:32:09] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Set wgLogoHD before adding wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570666 (https://phabricator.wikimedia.org/T244405) (owner: 10Jdlrobson)
[15:32:19] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "I can verify it today in beta and prod." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570666 (https://phabricator.wikimedia.org/T244405) (owner: 10Jdlrobson)
[15:35:28] <wikibugs>	 (03CR) 10Mholloway: "> Please add TLS termination, see cxserver or termbox as examples." [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos)
[15:36:04] <wikibugs>	 10Operations, 10Traffic: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 (10Vgutierrez)
[15:38:46] <icinga-wm>	 RECOVERY - Host mw2311 is UP: PING OK - Packet loss = 0%, RTA = 36.27 ms
[15:39:06] <icinga-wm>	 PROBLEM - configured eth on mw2311 is CRITICAL: connect to address 10.192.16.158 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[15:41:01] <moritzm>	 !log installing jsoup security updates
[15:41:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:36] <icinga-wm>	 PROBLEM - Check systemd state on mw2311 is CRITICAL: connect to address 10.192.16.158 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:42:50] <icinga-wm>	 PROBLEM - dhclient process on mw2311 is CRITICAL: connect to address 10.192.16.158 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[15:43:18] <icinga-wm>	 PROBLEM - DPKG on mw2311 is CRITICAL: connect to address 10.192.16.158 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[15:43:50] <wikibugs>	 (03PS1) 10BBlack: pybal to both routers for eqsin [puppet] - 10https://gerrit.wikimedia.org/r/570672 (https://phabricator.wikimedia.org/T165765)
[15:43:52] <wikibugs>	 (03PS1) 10BBlack: pybal to both routers for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/570673 (https://phabricator.wikimedia.org/T165765)
[15:43:55] <wikibugs>	 (03PS1) 10BBlack: pybal to both routers for codfw primary [puppet] - 10https://gerrit.wikimedia.org/r/570674 (https://phabricator.wikimedia.org/T165765)
[15:43:58] <wikibugs>	 (03PS1) 10BBlack: pybal to both routers for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/570675 (https://phabricator.wikimedia.org/T165765)
[15:45:19] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez)
[15:45:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "couple of typos, but otherwise LGTM" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/570331 (https://phabricator.wikimedia.org/T244222) (owner: 10Jbond)
[15:46:01] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez)
[15:47:16] <halfak>	 o/ akosiaris 
[15:47:26] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops-radar: codfw: new mw servers not getting an IP when default to Stretch - https://phabricator.wikimedia.org/T244438 (10Papaul) Did the re-install on mw2311 it works . Thanks
[15:47:50] <wikibugs>	 (03PS2) 10Mholloway: Add chart for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/570162 (https://phabricator.wikimedia.org/T218733)
[15:47:52] <halfak>	 I'm ready to do a rollback as elukey suggested 
[15:48:14] <halfak>	 Wanted to check if you are around since I'm out of the deploy window and wanted to make sure I'd have someone to help out if something goes weirdly. 
[15:48:15] <wikibugs>	 (03PS6) 10Jbond: realm global: make the realm variable a global in labs [puppet] - 10https://gerrit.wikimedia.org/r/570331 (https://phabricator.wikimedia.org/T244222)
[15:48:16] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops-radar: codfw: new mw servers not getting an IP when default to Stretch - https://phabricator.wikimedia.org/T244438 (10MoritzMuehlenhoff) The ethernet adapter is slightly different than the BCM5720 we otherwise already run on stretch. E.g. on ms-be2050 it reports as...
[15:48:29] <wikibugs>	 (03CR) 10Jbond: "thanks updated" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/570331 (https://phabricator.wikimedia.org/T244222) (owner: 10Jbond)
[15:48:31] <halfak>	 I don't expect it to but you never know. 
[15:48:38] <icinga-wm>	 PROBLEM - Check systemd state on mw2311 is CRITICAL: connect to address 10.192.16.158 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:48:45] <icinga-wm>	 PROBLEM - configured eth on mw2311 is CRITICAL: connect to address 10.192.16.158 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[15:48:51] <icinga-wm>	 PROBLEM - dhclient process on mw2311 is CRITICAL: connect to address 10.192.16.158 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[15:49:02] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] pybal to both routers for eqsin [puppet] - 10https://gerrit.wikimedia.org/r/570672 (https://phabricator.wikimedia.org/T165765) (owner: 10BBlack)
[15:49:35] <icinga-wm>	 PROBLEM - Disk space on mw2311 is CRITICAL: connect to address 10.192.16.158 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mw2311&var-datasource=codfw+prometheus/ops
[15:49:43] <icinga-wm>	 PROBLEM - puppet last run on mw2311 is CRITICAL: connect to address 10.192.16.158 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:50:06] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[15:50:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:50:19] <moritzm>	 !log installing python-ecdsa security updates
[15:50:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:50:26] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "+1, aside from 2 typos" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/570343 (owner: 10Jbond)
[15:50:53] <elukey>	 halfak: here to help if needed
[15:51:02] <halfak>	 Thanks elukey 
[15:51:08] <halfak>	 Did you create a task by any chance? 
[15:51:17] <halfak>	 Oh I see it in the email
[15:51:18] <halfak>	 :) 
[15:51:40] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] wmflib::require_domains: use require_domains instead of require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570348 (https://phabricator.wikimedia.org/T244222) (owner: 10Jbond)
[15:51:45] <icinga-wm>	 PROBLEM - MD RAID on mw2311 is CRITICAL: connect to address 10.192.16.158 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[15:51:46] <halfak>	 Oh that task is super old and probably unrelated. 
[15:51:47] <halfak>	 https://phabricator.wikimedia.org/T242705
[15:52:10] * akosiaris around as well
[15:52:16] <bblack>	 !log lvs5003 - restart pybal for dual bgp session config - T180069
[15:52:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:19] <stashbot>	 T180069: Pybal should be able to advertise to multiple routers - https://phabricator.wikimedia.org/T180069
[15:52:24] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:52:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:04] <logmsgbot>	 !log halfak@deploy1001 Started deploy [ores/deploy@50a101a]: T242705
[15:53:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:07] <stashbot>	 T242705: Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705
[15:53:14] <bblack>	 !log lvs5002 - restart pybal for dual bgp session config - T180069
[15:53:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:54] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] sslcert: ensure we run update-ca-certificates managing any services [puppet] - 10https://gerrit.wikimedia.org/r/570637 (https://phabricator.wikimedia.org/T244477) (owner: 10Jbond)
[15:54:01] <akosiaris>	 halfak: fwiw, that deploy is just exacerbating the issue. The trigger is uwsgi going a bit haywire on every reload/restart. Not sure why 
[15:54:16] <icinga-wm>	 RECOVERY - configured eth on mw2311 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[15:54:16] <halfak>	 Gotcha.  
[15:54:17] <akosiaris>	 hmm I wonder if it's on shutdown or startup 
[15:54:22] <icinga-wm>	 RECOVERY - dhclient process on mw2311 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[15:54:32] <bblack>	 !log lvs5001 - restart pybal for dual bgp session config - T180069
[15:54:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:39] <akosiaris>	 lemme know when you are doing with the deploy, I 'd like to verify that, but better not step on your toes
[15:54:44] <akosiaris>	 s/doing/done/
[15:55:00] <icinga-wm>	 RECOVERY - Disk space on mw2311 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mw2311&var-datasource=codfw+prometheus/ops
[15:55:05] <icinga-wm>	 RECOVERY - MD RAID on mw2311 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[15:55:08] <icinga-wm>	 RECOVERY - Check systemd state on mw2311 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:55:32] <moritzm>	 !log installing qemu security updates
[15:55:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:30] <vgutierrez>	 !log pooling ncredir4001 running buster - T243391
[15:56:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:32] <stashbot>	 T243391: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391
[15:56:55] <wikibugs>	 (03PS1) 10Ema: profile::mediawiki::webserver: set keepalive_requests to 200 [puppet] - 10https://gerrit.wikimedia.org/r/570677 (https://phabricator.wikimedia.org/T241145)
[15:57:25] <ema>	 _joe_: as things look stable on the api hosts, we should be fine setting keepalive_requests to 200 elsewhere too I hope? :)
[15:57:39] <logmsgbot>	 !log halfak@deploy1001 Finished deploy [ores/deploy@50a101a]: T242705 (duration: 04m 35s)
[15:57:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:46] <_joe_>	 sure
[15:57:52] <wikibugs>	 (03PS11) 10Jbond: wmflib::require_domains: add new function to to replace require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570343
[15:58:22] <ema>	 see this for the beneficial impact on ats-be new connections rate: https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&from=now-3h&to=now&fullscreen&panelId=6
[15:58:22] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] pybal to both routers for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/570673 (https://phabricator.wikimedia.org/T165765) (owner: 10BBlack)
[15:58:36] <ema>	 you have to click on 'api-rw.discovery.wmnet' to see the interesting part
[15:58:42] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=eventgate-analytics
[15:58:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:56] <wikibugs>	 (03CR) 10Jhedden: [C: 03+1] "LGTM, non-blocking thought: The host specific bits in openstack::keystone::fernet_tokens makes me think it should be a profile. I'm not su" [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott)
[15:59:25] <akosiaris>	 !log repool eventgate-analytics/eqiad. Experiment proved the failover wouldn't cause (on it's own) a problem. Experiment done.
[15:59:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:32] <halfak>	 OK looks like we're in the clear with ORES.  This memory usage is really absurd.  I'll investigate today. 
[15:59:37] <ema>	 _joe_: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/570677/
[15:59:41] <halfak>	 elukey, ^ 
[15:59:52] <akosiaris>	 halfak: thanks!
[16:00:29] <wikibugs>	 (03PS1) 10Vgutierrez: install_server: Reimage ncredir@eqsin as buster [puppet] - 10https://gerrit.wikimedia.org/r/570678 (https://phabricator.wikimedia.org/T243391)
[16:00:30] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] sslcert: ensure we run update-ca-certificates managing any services [puppet] - 10https://gerrit.wikimedia.org/r/570637 (https://phabricator.wikimedia.org/T244477) (owner: 10Jbond)
[16:00:50] <wikibugs>	 (03Abandoned) 10Ema: profile::mediawiki::webserver: increase keepalive_requests [puppet] - 10https://gerrit.wikimedia.org/r/570640 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema)
[16:00:55] <wikibugs>	 10Operations, 10netops: cr3-knams:xe-0/1/3 down - https://phabricator.wikimedia.org/T244497 (10ayounsi) p:05Triage→03High
[16:00:57] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4024.ulsfo.wmnet'] `  and were **ALL** successful.
[16:01:25] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi https://phabricator.wikimedia.org/T244497 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:01:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott)
[16:01:33] <wikibugs>	 (03CR) 10Jbond: "thanks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/570343 (owner: 10Jbond)
[16:02:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sslcert: ensure we run update-ca-certificates managing any services [puppet] - 10https://gerrit.wikimedia.org/r/570637 (https://phabricator.wikimedia.org/T244477) (owner: 10Jbond)
[16:03:03] <elukey>	 halfak: thanks!
[16:03:05] <halfak>	 Memory usage in eqiad was way worse than in codfw this time.  What's yp with that? 
[16:03:06] <vgutierrez>	 !log pooling cp4024 with buster - T242093
[16:03:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:08] <stashbot>	 T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093
[16:03:11] <halfak>	 https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?orgId=1&refresh=1m&from=now-7d&to=now-1m
[16:03:28] <halfak>	 I'm looking at memory usage under "Saturation" 
[16:03:36] <vgutierrez>	 !log depool & reimage cp4023 as buster - T242093
[16:03:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:59] <wikibugs>	 10Operations, 10netops: cr3-knams:xe-0/1/3 down - https://phabricator.wikimedia.org/T244497 (10ayounsi) Other side doesn't receive the light though: ` ayounsi@asw2-esams> show interfaces diagnostics optics xe-6/0/4    Physical interface: xe-6/0/4     Laser output power                        :  1.3540 mW / 1.3...
[16:04:03] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4023.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reima...
[16:04:38] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] install_server: Reimage ncredir@eqsin as buster [puppet] - 10https://gerrit.wikimedia.org/r/570678 (https://phabricator.wikimedia.org/T243391) (owner: 10Vgutierrez)
[16:06:12] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Improve ATS backend connection reuse against origin servers - https://phabricator.wikimedia.org/T241145 (10ema) >>! In T241145#5856431, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://too...
[16:06:14] <bblack>	 !log lvs4007 - restart pybal for dual bgp session config - T180069
[16:06:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:17] <stashbot>	 T180069: Pybal should be able to advertise to multiple routers - https://phabricator.wikimedia.org/T180069
[16:06:42] <bblack>	 !log lvs4006 - restart pybal for dual bgp session config - T180069
[16:06:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:11] <bblack>	 !log lvs4005 - restart pybal for dual bgp session config - T180069
[16:07:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:40] <vgutierrez>	 !log depool and reimage ncredir5002 as buster - T243391
[16:07:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:42] <stashbot>	 T243391: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391
[16:09:26] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] pybal to both routers for codfw primary [puppet] - 10https://gerrit.wikimedia.org/r/570674 (https://phabricator.wikimedia.org/T165765) (owner: 10BBlack)
[16:10:01] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team (Current): Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10Halfak) OK rolled back.    Looking at what happened, it seems like memory pressure was *way worse* in Eqiad than in Codfw * Eqiad: https://grafana.wikimedia.org/d/HIRrxQ6mk/o...
[16:10:50] <wikibugs>	 (03PS1) 10Superzerocool: Throttle rule for National Gallery of Canada Library and Archives edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570680 (https://phabricator.wikimedia.org/T244488)
[16:15:20] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez)
[16:17:49] <icinga-wm>	 RECOVERY - DPKG on mw2311 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[16:17:55] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team (Current): Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10Halfak) When I start up the deployment ORES config locally with 4 workers, I can see that we are using ~2516000 bytes of RES for two processes.  It looks like my available RA...
[16:19:15] <bblack>	 !log lvs2003 - restart pybal for dual bgp session config - T180069
[16:19:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:18] <stashbot>	 T180069: Pybal should be able to advertise to multiple routers - https://phabricator.wikimedia.org/T180069
[16:19:34] <bblack>	 !log lvs2002 - restart pybal for dual bgp session config - T180069
[16:19:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:31] <bblack>	 !log lvs2001 - restart pybal for dual bgp session config - T180069
[16:20:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:34] <moritzm>	 !log installing cyrus-sasl2 security updates on jessie
[16:22:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:45] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[16:22:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:19] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] pybal to both routers for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/570675 (https://phabricator.wikimedia.org/T165765) (owner: 10BBlack)
[16:23:34] <wikibugs>	 (03PS2) 10BBlack: pybal to both routers for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/570675 (https://phabricator.wikimedia.org/T165765)
[16:24:12] <wikibugs>	 (03CR) 10Herron: "Is the commit message accurate about this creating a mixed raid level LVM layout?  It looks like it would result in a non-LVM config (whic" [puppet] - 10https://gerrit.wikimedia.org/r/570600 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff)
[16:24:58] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:25:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:34] <wikibugs>	 (03CR) 10BBlack: [V: 03+2 C: 03+2] pybal to both routers for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/570675 (https://phabricator.wikimedia.org/T165765) (owner: 10BBlack)
[16:25:39] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10Dzahn) Aww, thanks for conforming and thanks Moritz for fixing it. This is exactly why i wanted to test it. The capitalization c...
[16:28:39] <moritzm>	 !log restarting apache on bromine to pick up SASL security updates
[16:28:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:09] <bblack>	 !log lvs1016 - restart pybal for dual bgp session config - T180069
[16:29:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:12] <stashbot>	 T180069: Pybal should be able to advertise to multiple routers - https://phabricator.wikimedia.org/T180069
[16:29:25] <icinga-wm>	 PROBLEM - DPKG on install1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[16:30:03] <bblack>	 !log lvs1015 - restart pybal for dual bgp session config - T180069
[16:30:05] <wikibugs>	 10Operations, 10ops-codfw, 10ops-eqiad, 10ops-eqsin, and 3 others: Audit & update spares part tracking for all sites - https://phabricator.wikimedia.org/T243450 (10RobH) I'm adding in each site's project.  Once an on-site engineer has audited and updated the spares tracking sheet for hardware, this task sh...
[16:30:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:39] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Improve ATS backend connection reuse against origin servers - https://phabricator.wikimedia.org/T241145 (10Gilles) therockapplauds  We will keep an eye on the trend in coming days to check how much of a dent it made in the perf regres...
[16:30:46] <bblack>	 !log lvs1014 - restart pybal for dual bgp session config - T180069
[16:30:47] <halfak>	 Aha!  Our code for dropping those assets from memory isn't working. 
[16:30:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:00] <halfak>	 Fixing this'll definitely help!
[16:31:37] <bblack>	 !log lvs1013 - restart pybal for dual bgp session config - T180069
[16:31:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:30] <XioNoX>	 !log remove AS prepending in esams/knams
[16:35:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:42] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@524be2b]: airflow: Update ores data transfer from drafttopic -> articletopic
[16:35:49] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4023.ulsfo.wmnet'] `  and were **ALL** successful.
[16:35:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:02] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@524be2b]: airflow: Update ores data transfer from drafttopic -> articletopic (duration: 00m 19s)
[16:36:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:52] <wikibugs>	 (03PS3) 10Clarakosi: Add restbase202[123] to hiera [puppet] - 10https://gerrit.wikimedia.org/r/570094 (https://phabricator.wikimedia.org/T244178)
[16:38:15] <vgutierrez>	 !log pooling cp4023 with buster - T242093
[16:38:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:18] <stashbot>	 T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093
[16:38:39] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor pybal/LVS config for shared failover - https://phabricator.wikimedia.org/T165765 (10BBlack) Status update: what's missing here is codfw, which will happen when we finish its hardware upgrade switch to lvs2007-10
[16:38:44] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez)
[16:39:23] <wikibugs>	 (03PS6) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418)
[16:41:10] <icinga-wm>	 RECOVERY - puppet last run on mw2311 is OK: OK: Puppet is currently enabled, last run 28 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:41:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott)
[16:43:35] <wikibugs>	 (03CR) 10Jhedden: [C: 03+1] "should use lookup instead of hiera, but overall looks great!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott)
[16:47:04] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: Keystone: rotate and sync fernet tokens (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott)
[16:49:12] <vgutierrez>	 !log pooling ncredir5002 running buster - T243391
[16:49:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:15] <stashbot>	 T243391: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391
[16:49:33] <wikibugs>	 10Operations, 10Traffic: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 (10Vgutierrez)
[16:52:23] <wikibugs>	 (03PS7) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418)
[16:54:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott)
[16:56:49] <wikibugs>	 (03PS8) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418)
[16:59:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott)
[17:00:04] <jouncebot>	 godog and _joe_: Time to snap out of that daydream and deploy Puppet SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200206T1700).
[17:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[17:00:12] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs
[17:01:00] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog2001 is OK: SSL OK - Certificate centrallog2001.codfw.wmnet valid until 2024-11-16 16:04:24 +0000 (expires in 1744 days) https://wikitech.wikimedia.org/wiki/Logs
[17:03:53] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] "worth a shot" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/570255 (owner: 10Giuseppe Lavagetto)
[17:06:23] <wikibugs>	 (03PS2) 10Andrew Bogott: Keystone: set max_active_keys for fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570507 (https://phabricator.wikimedia.org/T243418)
[17:06:25] <wikibugs>	 (03PS9) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418)
[17:08:37] <wikibugs>	 10Operations, 10serviceops: Test and deploy mcrouter 0.41 - https://phabricator.wikimedia.org/T244476 (10jijiki)
[17:08:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott)
[17:09:06] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/569564 (https://phabricator.wikimedia.org/T242585) (owner: 10Filippo Giunchedi)
[17:11:26] <wikibugs>	 (03PS6) 10Filippo Giunchedi: cassandra: restbase-dev logs to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/569564 (https://phabricator.wikimedia.org/T242585)
[17:18:37] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10Krinkle) In an IRC conversation with @volans we considered whether it's possible that the 50ms regression com...
[17:19:02] <wikibugs>	 (03PS3) 10Mholloway: Add chart for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/570162 (https://phabricator.wikimedia.org/T218733)
[17:19:35] <wikibugs>	 (03PS1) 10Elukey: profile::cdh::apt: add bigtop repository [puppet] - 10https://gerrit.wikimedia.org/r/570685 (https://phabricator.wikimedia.org/T244499)
[17:23:23] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10RobH) p:05Triage→03Medium
[17:23:34] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10RobH)
[17:25:52] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10wiki_willy) test
[17:26:16] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 78522104 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:26:57] <wikibugs>	 (03PS4) 10Mholloway: WIP: Proton charts first draft [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos)
[17:27:55] <wikibugs>	 (03PS4) 10Mholloway: Add chart for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/570162 (https://phabricator.wikimedia.org/T218733)
[17:28:08] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 29776 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:30:33] <wikibugs>	 (03CR) 10Mholloway: "> Please add TLS termination, see cxserver or termbox as examples." [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos)
[17:32:15] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/20663/" [puppet] - 10https://gerrit.wikimedia.org/r/570685 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey)
[17:32:17] <herron>	 !log set performance cpu scaling governor on maps*
[17:32:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:06] <icinga-wm>	 PROBLEM - Host es2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:44:10] <icinga-wm>	 PROBLEM - Host ganeti2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:44:29] <marostegui>	 uh?
[17:44:56] <marostegui>	 Looks like only mgmt
[17:45:04] <icinga-wm>	 PROBLEM - Host ps1-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[17:45:06] <icinga-wm>	 PROBLEM - Host db2077.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:45:06] <icinga-wm>	 PROBLEM - Host db2087.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:45:12] <marostegui>	 papaul: you working on mgmt stuff? ^
[17:45:20] <icinga-wm>	 PROBLEM - Host db2125.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:45:22] <marostegui>	 XioNoX: ^
[17:45:26] <icinga-wm>	 PROBLEM - Host elastic2031.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:45:26] <icinga-wm>	 PROBLEM - Host ores2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:45:34] <icinga-wm>	 PROBLEM - Host db2112.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:45:44] <icinga-wm>	 PROBLEM - Host scs-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[17:46:18] <icinga-wm>	 PROBLEM - Host ganeti2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:46:54] <wikibugs>	 (03CR) 10Ppchelko: "Per Joe adding it to conftool-data can be done at any point, so let's come back to PS2" [puppet] - 10https://gerrit.wikimedia.org/r/570094 (https://phabricator.wikimedia.org/T244178) (owner: 10Clarakosi)
[17:46:54] <icinga-wm>	 PROBLEM - Host es2012.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:47:20] <icinga-wm>	 PROBLEM - Host restbase2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:47:27] <icinga-wm>	 PROBLEM - Host mc2027.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:47:27] <icinga-wm>	 PROBLEM - Host mc2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:47:38] <icinga-wm>	 PROBLEM - Host restbase2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:47:38] <icinga-wm>	 PROBLEM - Host restbase2022.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:48:16] <icinga-wm>	 PROBLEM - Host pc2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:49:01] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2312.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[17:49:30] <icinga-wm>	 PROBLEM - Host cloudservices2002-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:51:07] <wikibugs>	 10Operations, 10Gerrit-Privilege-Requests, 10SRE-Access-Requests: Request for +2 access to mediawiki-config - https://phabricator.wikimedia.org/T244508 (10Eevans)
[17:53:36] <mutante>	 looks like rack C1
[17:54:00] <mutante>	 yea.. all of those are in the same rack i think
[17:54:11] <XioNoX>	 papaul, wiki_willy ^
[18:00:04] <jouncebot>	 cscott, arlolra, subbu, halfak, and accraze: That opportune time is upon us again. Time for a Services – Graphoid / Parsoid / Citoid / ORES deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200206T1800).
[18:01:01] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10elukey) These hosts need to be in 10G racks :)
[18:03:52] <icinga-wm>	 RECOVERY - Host elastic2031.mgmt is UP: PING WARNING - Packet loss = 64%, RTA = 357.42 ms
[18:03:52] <icinga-wm>	 RECOVERY - Host ores2005.mgmt is UP: PING WARNING - Packet loss = 64%, RTA = 357.34 ms
[18:03:52] <icinga-wm>	 RECOVERY - Host db2112.mgmt is UP: PING WARNING - Packet loss = 37%, RTA = 37.05 ms
[18:03:58] <icinga-wm>	 RECOVERY - Host ps1-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.60 ms
[18:04:02] <icinga-wm>	 RECOVERY - Host scs-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.79 ms
[18:04:04] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[18:04:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:04:34] <icinga-wm>	 RECOVERY - Host ganeti2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.00 ms
[18:04:54] <icinga-wm>	 RECOVERY - Host db2077.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.78 ms
[18:05:10] <icinga-wm>	 RECOVERY - Host es2012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.80 ms
[18:05:30] <icinga-wm>	 RECOVERY - Host mc2027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms
[18:05:38] <icinga-wm>	 RECOVERY - Host restbase2011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.45 ms
[18:05:44] <icinga-wm>	 RECOVERY - Host mc2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.62 ms
[18:05:54] <icinga-wm>	 RECOVERY - Host restbase2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.62 ms
[18:05:54] <icinga-wm>	 RECOVERY - Host restbase2022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.30 ms
[18:06:19] <logmsgbot>	 !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[18:06:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:07] <icinga-wm>	 RECOVERY - Host cloudservices2002-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.92 ms
[18:07:27] <icinga-wm>	 RECOVERY - Host pc2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.98 ms
[18:07:40] <icinga-wm>	 RECOVERY - Host es2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.34 ms
[18:07:40] <icinga-wm>	 RECOVERY - Host ganeti2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.00 ms
[18:08:35] <icinga-wm>	 RECOVERY - Host db2087.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.00 ms
[18:08:49] <icinga-wm>	 RECOVERY - Host db2125.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.56 ms
[18:11:13] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2312.codfw.wmnet'] `  and were **ALL** successful.
[18:19:31] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2313.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[18:28:49] <wikibugs>	 (03PS1) 10Jforrester: [trwiki] Enable the WikidataPageBanner extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570691 (https://phabricator.wikimedia.org/T244369)
[18:30:05] <wikibugs>	 (03PS10) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418)
[18:30:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott)
[18:34:29] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[18:34:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:45] <logmsgbot>	 !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[18:36:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:56] <wikibugs>	 (03PS11) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418)
[18:39:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott)
[18:40:27] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2313.codfw.wmnet'] `  and were **ALL** successful.
[18:41:40] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2314.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[18:41:46] <wikibugs>	 (03PS12) 10Andrew Bogott: Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418)
[18:46:56] <logmsgbot>	 !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@8e15868]: Update mobileapps to ceeb950
[18:46:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:50:00] <wikibugs>	 10Operations, 10ops-codfw: codfw: rack/setup/install wdqs200[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` wdqs2007.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020020...
[18:50:05] <wikibugs>	 10Operations, 10ops-codfw: codfw: rack/setup/install wdqs200[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wdqs2007.codfw.wmnet'] `  Of which those **FAILED**: ` ['wdqs2007.codfw.wmnet'] `
[18:51:54] <wikibugs>	 (03CR) 10Jhedden: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott)
[18:52:15] <wikibugs>	 10Operations, 10ops-codfw: codfw: rack/setup/install wdqs200[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` wdqs2007.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020020...
[18:52:18] <wikibugs>	 10Operations, 10ops-codfw: codfw: rack/setup/install wdqs200[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wdqs2007.codfw.wmnet'] `  Of which those **FAILED**: ` ['wdqs2007.codfw.wmnet'] `
[18:52:56] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Keystone: set max_active_keys for fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570507 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott)
[18:53:23] <logmsgbot>	 !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@8e15868]: Update mobileapps to ceeb950 (duration: 06m 27s)
[18:53:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:54:29] <wikibugs>	 10Operations, 10ops-codfw, 10fundraising-tech-ops: codfw: rack/setup/install 3 new payments server for frack - https://phabricator.wikimedia.org/T244169 (10Papaul)
[18:55:58] <moritzm>	 !log restarting apache on tungsten/dbmonitor to pick up cyrus-sasl security updates
[18:55:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:41] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[18:56:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:57:39] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Keystone: rotate and sync fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/570521 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott)
[18:58:56] <logmsgbot>	 !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[18:58:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200206T1900).
[19:00:04] <jouncebot>	 Addshore: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[19:00:30] <addshore>	 o/
[19:00:59] <addshore>	 Pchelolo: you're up first :)
[19:01:24] <Pchelolo>	 addshore: oh gosh sorry, I thought I removed it..
[19:01:26] <moritzm>	 !log restarting exim on mendelevium to pick up cyrus-sasl security updates
[19:01:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:42] <addshore>	 Pchelolo: no worries :D If you dont have anything I'll get started right away!
[19:02:04] <Pchelolo>	 go for it. I've took my change off
[19:02:10] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] Enable EntitySourceBasedFederation for group1 again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570642 (owner: 10Addshore)
[19:03:10] <wikibugs>	 (03Merged) 10jenkins-bot: Enable EntitySourceBasedFederation for group1 again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570642 (owner: 10Addshore)
[19:03:40] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2314.codfw.wmnet'] `  and were **ALL** successful.
[19:05:12] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2315.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[19:05:19] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation for group1 T243395 (duration: 01m 10s)
[19:05:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:05:22] <stashbot>	 T243395: wmgUseEntitySourceBasedFederation true for Wikimedia clients (all sites) - https://phabricator.wikimedia.org/T243395
[19:06:43] <wikibugs>	 (03CR) 10Jforrester: "Someone from the Web team should sign this off before deployment, in case there are any special requirements." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570691 (https://phabricator.wikimedia.org/T244369) (owner: 10Jforrester)
[19:07:06] <MatmaRex>	 addshore: oh, can i add one instead? some no-op cleanup
[19:07:11] <addshore>	 yupp
[19:07:26] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] wmgUseEntitySourceBasedFederation everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570584 (https://phabricator.wikimedia.org/T243395) (owner: 10Addshore)
[19:07:39] <James_F>	 We should sling out the Jdlrobson UBN one.
[19:07:45] <James_F>	 But I should review it first.
[19:07:47] <addshore>	 oooh, whtas that one?
[19:07:55] <James_F>	 wmf.18 i/r/ResourceLoaderSkinModule.php:264  PHP Notice: Undefined offset: 1
[19:08:07] <addshore>	 aaah, thats all i see in logstash right now ;)
[19:08:20] <James_F>	 T244405 -> T244405
[19:08:21] <stashbot>	 T244405: PHP Notice: Undefined offset: 1 in ResourceLoaderSkinModule.php - https://phabricator.wikimedia.org/T244405
[19:08:22] <James_F>	 Err.
[19:08:24] <wikibugs>	 (03Merged) 10jenkins-bot: wmgUseEntitySourceBasedFederation everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570584 (https://phabricator.wikimedia.org/T243395) (owner: 10Addshore)
[19:08:28] <James_F>	 T244405 -> https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/570666
[19:09:14] <James_F>	 It looks good to me. addshore, want to deploy?
[19:09:22] <James_F>	 (When you have time. ;-))
[19:09:23] <MatmaRex>	 James_F:
[19:09:27] <MatmaRex>	 ughh.
[19:09:29] <James_F>	 Yo.
[19:09:33] <MatmaRex>	 that doesn't look like it fixes the issue?
[19:10:06] <James_F>	 It moves the creation of wgLogos['wordmark'] to /after/ it's copied into $wgLogosHD.
[19:10:07] <MatmaRex>	 i was able to repro that bug locally without using $wgLogoHD
[19:10:25] <James_F>	 So the old code processing $wgLogosHD won't find a key it doesn't understand.
[19:10:31] <MatmaRex>	 hmm, or maybe
[19:10:33] <James_F>	 Hmm.
[19:10:37] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation everywhere T243395 (duration: 01m 07s)
[19:10:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:43] <MatmaRex>	 i did not try having them both set. maybe that works
[19:10:44] <stashbot>	 T243395: wmgUseEntitySourceBasedFederation true for Wikimedia clients (all sites) - https://phabricator.wikimedia.org/T243395
[19:10:59] <MatmaRex>	 we should try deploying that, at worst it's harmless
[19:11:04] <James_F>	 MatmaRex: Were you testing with master or with wmf.18?
[19:11:17] <James_F>	 Yeah. Current error log is grim:
[19:11:27] <MatmaRex>	 master
[19:11:34] <James_F>	 https://www.irccloud.com/pastebin/B00H27uE/
[19:11:48] <James_F>	 233+12 = sad James.
[19:12:34] <addshore>	 also, is the whole syncing IS.php thing all sorted? or still something that might happen?
[19:12:42] <addshore>	 (sync but not actually get loaded)
[19:12:43] <cdanis>	 !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕑☕ sudo cumin A:netflow "disable-puppet 'rollout of I60692f0e8 T237587 cdanis'"
[19:12:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:48] <James_F>	 addshore: If you're unsure, sync it twice.
[19:12:50] <stashbot>	 T237587: Determine & implement near-term method for escalating network alerts - https://phabricator.wikimedia.org/T237587
[19:12:52] <addshore>	 I will then :D
[19:13:29] <James_F>	 addshore: Still occasionally happening. Roan was masterfully debugging the last time we noticed it in production, but we didn't determine the cause, just eliminated some potentials.
[19:14:09] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation everywhere T243395, sync again for luck (duration: 01m 06s)
[19:14:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:23] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] "> Patch Set 7: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570509 (https://phabricator.wikimedia.org/T237587) (owner: 10CDanis)
[19:14:24] <addshore>	 cool, my config changes are done, just waiting for backports to merge
[19:14:38] <James_F>	 addshore: Did you touch IS before the second sync?
[19:14:45] <James_F>	 Not sure if that'd be necessary, but…
[19:14:57] <addshore>	 no (I didnt have to when I first encountered this issue)
[19:15:03] <James_F>	 Hmm, OK.
[19:15:30] <icinga-wm>	 RECOVERY - DPKG on install1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[19:16:40] <addshore>	 MatmaRex: shall i do yours?
[19:16:51] <wikibugs>	 (03PS2) 10Addshore: Fix incorrect spellings of "RESTBase" in config variables (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570409 (owner: 10Bartosz Dziewoński)
[19:16:54] <MatmaRex>	 addshore: please do
[19:17:01] <wikibugs>	 (03PS2) 10Addshore: Fix incorrect spellings of "RESTBase" in config variables (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570410 (owner: 10Bartosz Dziewoński)
[19:17:03] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] Fix incorrect spellings of "RESTBase" in config variables (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570409 (owner: 10Bartosz Dziewoński)
[19:17:08] <MatmaRex>	 addshore: have to sync them separately
[19:17:11] <addshore>	 ack
[19:17:16] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] Fix incorrect spellings of "RESTBase" in config variables (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570409 (owner: 10Bartosz Dziewoński)
[19:17:17] <James_F>	 IS/CS/IS.
[19:17:26] <James_F>	 Otherwise prod will be sad.
[19:17:38] <addshore>	 yupp
[19:17:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fix incorrect spellings of "RESTBase" in config variables (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570409 (owner: 10Bartosz Dziewoński)
[19:17:49] <addshore>	 i dont enjoy making prod sad
[19:18:08] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] Fix incorrect spellings of "RESTBase" in config variables (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570409 (owner: 10Bartosz Dziewoński)
[19:18:20] <addshore>	 jenkins had a little sneeze there
[19:18:31] <James_F>	 Theoretically scap will stop you making prod sad.
[19:18:40] <MatmaRex>	 "remote: fatal: Not a git repository". sounds ominous
[19:18:49] <James_F>	 But that's like "theoretically, Wikipedia doesn't work".
[19:19:01] <wikibugs>	 (03Merged) 10jenkins-bot: Fix incorrect spellings of "RESTBase" in config variables (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570409 (owner: 10Bartosz Dziewoński)
[19:20:16] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[19:20:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:40] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Fix incorrect spellings of "RESTBase" in config variables (1/2) (duration: 01m 06s)
[19:20:41] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] Fix incorrect spellings of "RESTBase" in config variables (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570410 (owner: 10Bartosz Dziewoński)
[19:20:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:21:35] <wikibugs>	 (03Merged) 10jenkins-bot: Fix incorrect spellings of "RESTBase" in config variables (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570410 (owner: 10Bartosz Dziewoński)
[19:22:32] <logmsgbot>	 !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[19:22:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:31] <cdanis>	 !log manual puppet run on netflow1001 looked good; ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕑☕ sudo cumin A:netflow "run-puppet-agent --enable 'rollout of I60692f0e8 T237587 cdanis'"
[19:23:32] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/CommonSettings.php: Fix incorrect spellings of "RESTBase" in config variables (2/2) 1.CS (duration: 01m 07s)
[19:23:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:33] <stashbot>	 T237587: Determine & implement near-term method for escalating network alerts - https://phabricator.wikimedia.org/T237587
[19:23:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:01] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Fix incorrect spellings of "RESTBase" in config variables (2/2) 2.IS (duration: 01m 06s)
[19:25:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:06] <addshore>	 MatmaRex: done
[19:25:21] <MatmaRex>	 thanks addshore
[19:27:16] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2315.codfw.wmnet'] `  and were **ALL** successful.
[19:28:40] <logmsgbot>	 !log addshore@deploy1001 Synchronized php-1.35.0-wmf.18/extensions/Babel/includes/Babel.php: T243713 Timeout for meta api call from 10 to 2 seconds. (duration: 01m 07s)
[19:28:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:43] <stashbot>	 T243713: Time-out error; Babel/WikibaseRepo being somehow uncached, overloading the API, and causing general outage - https://phabricator.wikimedia.org/T243713
[19:29:58] <logmsgbot>	 !log addshore@deploy1001 Synchronized php-1.35.0-wmf.16/extensions/Babel/includes/Babel.php: T243713 Timeout for meta api call from 10 to 2 seconds. (duration: 01m 07s)
[19:29:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:31:42] <bblack>	 !log depool cp1075 (eqiad text) for minor experimentation
[19:31:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:31:58] <wikibugs>	 10Operations, 10ops-codfw: codfw: rack/setup/install wdqs200[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10Papaul)
[19:32:17] <logmsgbot>	 !log addshore@deploy1001 Synchronized php-1.35.0-wmf.18/extensions/WikibaseLexemeCirrusSearch: T244479 Update namespace for PrefetchingTermLookup & fix tests (duration: 01m 06s)
[19:32:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:32:19] <stashbot>	 T244479: Argument 5 passed to Wikibase\Lexeme\Search\Elastic\LexemeSearchEntity::__construct() must be an instance of Wikibase\Lib\Store\PrefetchingTermLookup, instance of Wikibase\DataAccess\ByTypeDispatchingPrefetchingTermLookup given, called in /srv/mediawiki/php-1.35.0-wmf.18/extensions/WikibaseLexemeCirrusSearch/WikibaseSearch.entitytypes.repo.php on line 41 - https://phabricator.wikimedia.org/T244479
[19:33:00] <wikibugs>	 10Operations, 10ops-codfw: codfw: rack/setup/install wdqs200[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10Papaul) a:05Papaul→03Gehel @Gehel  All yours
[19:33:46] <gehel>	 papaul: thanks !
[19:33:56] <addshore>	 !log SWAT done!
[19:33:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:02] <bblack>	 !log re-pool cp1075 (eqiad text)
[19:36:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:38] <apergos>	 oh it went live, yay
[19:39:13] <James_F>	 addshore: Were you not going to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/570666 ? :-)
[19:39:19] <James_F>	 jouncebot: next
[19:39:19] <jouncebot>	 In 0 hour(s) and 20 minute(s): Mediawiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200206T2000)
[19:39:27] <James_F>	 (I'll do it.)
[19:39:34] <wikibugs>	 (03PS2) 10Jforrester: Set wgLogoHD before adding wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570666 (https://phabricator.wikimedia.org/T244405) (owner: 10Jdlrobson)
[19:39:42] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Set wgLogoHD before adding wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570666 (https://phabricator.wikimedia.org/T244405) (owner: 10Jdlrobson)
[19:40:37] <wikibugs>	 (03Merged) 10jenkins-bot: Set wgLogoHD before adding wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570666 (https://phabricator.wikimedia.org/T244405) (owner: 10Jdlrobson)
[19:42:19] <addshore>	 James_F: sorry, I didnt see the consensus about if it was right or not while i was deploying the other things! :)
[19:42:35] <James_F>	 No worries. :-)
[19:42:47] <James_F>	 Want to get it out before the train, to make the dashboard more readable.
[19:43:03] <addshore>	 yupp, very understandable
[19:45:13] <apergos>	 oh the logo fix... that would be nice, it's still flooding my logs last I looked
[19:45:19] <James_F>	 Yeah, sorry about that
[19:45:24] <apergos>	 eh things happen
[19:45:28] <James_F>	 I deployed the fix for the previous bug.
[19:45:30] <wikibugs>	 (03PS1) 10BryanDavis: Add "migrate" action for 2020 Kubernetes migration [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570702 (https://phabricator.wikimedia.org/T244293)
[19:45:51] <apergos>	 bit by bit
[19:45:55] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T244405 Set wgLogoHD before adding wordmark (duration: 01m 06s)
[19:45:57] <James_F>	 Which replaced a frequent error with one that's just common.
[19:45:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:45:59] <stashbot>	 T244405: PHP Notice: Undefined offset: 1 in ResourceLoaderSkinModule.php - https://phabricator.wikimedia.org/T244405
[19:46:36] <apergos>	 heh
[19:48:44] <James_F>	 Fixed. Good.
[19:49:09] <wikibugs>	 10Puppet, 10Release-Engineering-Team-TODO, 10User-brennen: logspam-watch: Some exceptions may be missing from logspam - https://phabricator.wikimedia.org/T244528 (10brennen)
[19:49:14] <wikibugs>	 (03PS2) 10BryanDavis: Add "migrate" action for 2020 Kubernetes migration [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570702 (https://phabricator.wikimedia.org/T244293)
[19:50:21] <wikibugs>	 10Puppet, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10User-brennen: logspam-watch: Add interactive sorting / filtering - https://phabricator.wikimedia.org/T242882 (10brennen)
[19:50:23] <apergos>	 yay!
[19:51:22] <wikibugs>	 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10RobH) p:05Triage→03Medium
[19:51:35] <wikibugs>	 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10RobH)
[19:52:05] <wikibugs>	 10Operations, 10hardware-requests: Expand Eqiad Ganeti row_A capacity - https://phabricator.wikimedia.org/T242885 (10RobH) 05Open→03Resolved memory ordered on T243442 and implementation tracking on T244530.  resolving this task
[19:52:27] <James_F>	 (Prod clear)
[19:54:41] <wikibugs>	 10Puppet, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO, 10User-brennen: logspam-watch: Some exceptions may be missing from logspam - https://phabricator.wikimedia.org/T244528 (10brennen) p:05Triage→03Low
[19:56:31] <wikibugs>	 10Puppet, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services), 10User-brennen: logspam-watch: Some exceptions may be missing from logspam - https://phabricator.wikimedia.org/T244528 (10brennen)
[19:56:56] <wikibugs>	 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10Dzahn) As of today these hosts are in site.pp with spare::system role and not in production yet. So while standard Icinga alerts should be downtimed, no actual Ganeti service would be affected...
[19:59:44] <wikibugs>	 (03PS1) 10Jforrester: [cswikisource] Enable VisualEditor in the 'Edice' (102) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570706 (https://phabricator.wikimedia.org/T244133)
[20:00:04] <jouncebot>	 twentyafterfour and marxarelli: May I have your attention please! Mediawiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200206T2000)
[20:00:58] <wikibugs>	 (03CR) 10BryanDavis: "Completely untested at this point. As all the changes stayed within the webservice script iteself I should be able to test this out with s" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570702 (https://phabricator.wikimedia.org/T244293) (owner: 10BryanDavis)
[20:04:13] <wikibugs>	 (03PS2) 10Dzahn: site: define 2 codfw appservers as canary_appservers [puppet] - 10https://gerrit.wikimedia.org/r/570405 (https://phabricator.wikimedia.org/T242606)
[20:08:57] <marxarelli>	 twentyafterfour: o/
[20:10:12] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] "ugh https://puppet-compiler.wmflabs.org/compiler1002/20670/mw2163.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/570405 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn)
[20:12:16] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] "so.. "canary API" vs "API" roles was no actual difference in puppet but "canary app" vs "app" are making a difference and a bunch of extra" [puppet] - 10https://gerrit.wikimedia.org/r/570405 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn)
[20:13:35] <twentyafterfour>	 marxarelli: hey, everything looks clear for the train 
[20:13:57] <twentyafterfour>	 I'm about to deploy to all wikis shortly 
[20:14:43] <marxarelli>	 right on. i'll keep an eye on errors
[20:18:58] <wikibugs>	 (03PS1) 1020after4: all wikis to 1.35.0-wmf.18  refs T233866 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570710
[20:19:00] <wikibugs>	 (03CR) 1020after4: [C: 03+2] all wikis to 1.35.0-wmf.18  refs T233866 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570710 (owner: 1020after4)
[20:20:17] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.18  refs T233866 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570710 (owner: 1020after4)
[20:21:48] <James_F>	 Whee.
[20:24:02] <wikibugs>	 (03PS1) 10EBernhardson: Give NS_HELP same weight as NS_MAIN in search on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570712
[20:24:53] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=
[20:24:57] <twentyafterfour>	 uh oh
[20:25:17] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before
[20:25:17] <icinga-wm>	 eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre
[20:25:17] <icinga-wm>	 eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[20:25:18] <twentyafterfour>	 reverting to .16
[20:25:24] <logmsgbot>	 !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.18  refs T233866
[20:25:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out be
[20:25:25] <icinga-wm>	 as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:25:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be
[20:25:27] <icinga-wm>	 as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:25:28] <stashbot>	 T233866: 1.35.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T233866
[20:25:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be
[20:25:31] <icinga-wm>	 as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:25:31] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}
[20:25:31] <icinga-wm>	 ections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was 
[20:25:31] <icinga-wm>	 n}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:25:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out be
[20:25:32] <icinga-wm>	 as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:25:33] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}
[20:25:33] <icinga-wm>	 ections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was 
[20:25:34] <icinga-wm>	 n}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:25:34] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: 
[20:25:35] <icinga-wm>	  most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{
[20:25:35] <icinga-wm>	 eatured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[20:25:36] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before
[20:25:36] <icinga-wm>	 eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre
[20:25:37] <icinga-wm>	 eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[20:25:37] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before
[20:25:38] <icinga-wm>	 eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre
[20:25:38] <icinga-wm>	 eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[20:25:39] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:25:39] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb2004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:25:40] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:25:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out be
[20:25:41] <icinga-wm>	 as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:25:41] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before
[20:25:42] <icinga-wm>	 eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre
[20:25:42] <icinga-wm>	 eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[20:25:43] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-
[20:25:43] <icinga-wm>	 PROBLEM - Apache HTTP on mw1339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:25:44] <icinga-wm>	 PROBLEM - Apache HTTP on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:25:44] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:25:45] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:25:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed ou
[20:25:46] <icinga-wm>	 se was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:25:46] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200): /en.wikipedia.org/v1/pag
[20:25:47] <icinga-wm>	 Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[20:25:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be
[20:25:48] <icinga-wm>	 as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:25:48] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most 
[20:25:49] <icinga-wm>	  January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/ra
[20:25:49] <icinga-wm>	 eve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[20:25:50] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:25:50] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:25:51] <icinga-wm>	 PROBLEM - Apache HTTP on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:25:51] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:25:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Give NS_HELP same weight as NS_MAIN in search on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570712 (owner: 10EBernhardson)
[20:26:44] <Reedy>	 https://meta.wikimedia.org/w/api.php?action=sitematrix&format=json&smtype=language&smlangprop=dir%7Ccode%7Csite&smsiteprop=dbname&formatversion=2 seems to be broken
[20:26:47] <wikibugs>	 (03PS1) 1020after4: group2 wikis to 1.35.0-wmf.16  refs T233866 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570713
[20:26:49] <wikibugs>	 (03CR) 1020after4: [C: 03+2] group2 wikis to 1.35.0-wmf.16  refs T233866 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570713 (owner: 1020after4)
[20:27:30] <wikibugs>	 (03CR) 10EBernhardson: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570712 (owner: 10EBernhardson)
[20:27:38] <twentyafterfour>	 Reedy: wmf.18 seems to have broken a lot of things, rolling back
[20:27:41] <shdubsh>	 that's a lot of pages
[20:27:42] <apergos>	 just got a pile of pages
[20:27:42] <jynus>	 revert?
[20:27:44] <Reedy>	 sadface
[20:27:50] <rlazarus>	 TFW interrupted from writing an incident report by another page 🙃
[20:27:52] <jbond42>	 im here
[20:27:52] <shdubsh>	 timeouts
[20:27:54] <herron>	 ack
[20:28:11] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:28:12] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:28:13] <twentyafterfour>	 jynus: yes reverting
[20:28:14] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3121 on cp3064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:28:16] <marostegui>	 let's revert?
[20:28:20] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3120 on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:28:24] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:28:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] group2 wikis to 1.35.0-wmf.16  refs T233866 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570713 (owner: 1020after4)
[20:28:26] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3120 on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:28:26] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3121 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:28:31] <Reedy>	 mukunda already is. [13:26:49] <wikibugs> (CR) 20after4: [C: +2] group2 wikis to 1.35.0-wmf.16  refs T233866 [mediawiki-config] - https://gerrit.wikimedia.org/r/570713 (owner: 20after4)
[20:28:34] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3124 on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:28:37] <wikibugs>	 (03CR) 1020after4: [V: 03+2 C: 03+2] group2 wikis to 1.35.0-wmf.16  refs T233866 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570713 (owner: 1020after4)
[20:28:38] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3125 on cp3062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:28:38] <icinga-wm>	 PROBLEM - Apache HTTP on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:28:38] <icinga-wm>	 PROBLEM - Apache HTTP on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:28:38] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:28:38] <icinga-wm>	 PROBLEM - Apache HTTP on mw1243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:28:40] <bblack>	 wow
[20:28:41] <icinga-wm>	 PROBLEM - Apache HTTP on mw1255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:28:42] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:28:42] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:28:42] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:28:42] <icinga-wm>	 PROBLEM - Apache HTTP on mw1326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:28:42] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:28:42] <icinga-wm>	 PROBLEM - Apache HTTP on mw1330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:28:42] <icinga-wm>	 PROBLEM - Apache HTTP on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:28:44] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:28:52] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by ReadTimeoutError(HTTPConnectionPool(host=wikifeeds.svc.codfw.wmnet, port=8889): Read timed out. (read timeout=15),): /?spec https://wikitech.wikimedia.org/wiki/Wikifeeds
[20:28:56] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3126 on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:29:01] <marxarelli>	 :(
[20:29:14] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3121 on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:29:14] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3126 on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:29:14] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3126 on cp3064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:29:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received:
[20:29:22] <icinga-wm>	 g/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:29:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [mostread] https://wikit
[20:29:22] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[20:29:24] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb2003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:29:25] <_joe_>	 grafana is not responding
[20:29:25] <Praxidicae>	 #rip
[20:29:41] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[20:29:44] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3126 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:29:47] <twentyafterfour>	 no idea what went wrong with wmf.18 - I'm syncing the revert back to wmf.16 right now
[20:29:50] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1344.eqiad.wmnet, mw1272.eqiad.wmnet, mw1320.eqiad.wmnet, mw1250.eqiad.wmnet, mw1266.eqiad.wmnet, mw1223.eqiad.wmnet, mw1282.eqiad.wmnet, mw1333.eqiad.wmnet, mw1241.eqiad.wmnet, mw1221.eqiad.wmnet, mw1317.eqiad.wmnet, mw1224.eqiad.wmnet, mw1316.eqiad.wmnet, mw1325.eqiad.wmnet, mw1312.eqiad.wmnet, mw1347.eqiad.wmnet, mw1342
[20:29:51] <icinga-wm>	 270.eqiad.wmnet, mw1341.eqiad.wmnet, mw1332.eqiad.wmnet, mw1313.eqiad.wmnet, mw1346.eqiad.wmnet, mw1246.eqiad.wmnet, mw1322.eqiad.wmnet, mw1288.eqiad.wmnet, mw1281.eqiad.wmnet, mw1314.eqiad.wmnet, mw1323.eqiad.wmnet, mw1227.eqiad.wmnet, mw1233.eqiad.wmnet, mw1327.eqiad.wmnet, mw1245.eqiad.wmnet, mw1340.eqiad.wmnet, mw1258.eqiad.wmnet, mw1225.eqiad.wmnet, mw1264.eqiad.wmnet, mw1255.eqiad.wmnet, mw1257.eqiad.wmnet, mw1244.eqiad.wmn
[20:29:51] <icinga-wm>	 wmnet, mw1234.eqiad.wmnet, mw1235.eqiad.wmnet, mw1231.eqiad.wmnet, mw1315.eqiad.wmnet, mw1 https://wikitech.wikimedia.org/wiki/PyBal
[20:29:54] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on wikifeeds.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:30:07] <logmsgbot>	 !log twentyafterfour@deploy1001 Scap failed!: 9/11 canaries failed their endpoint checks(http://en.wikipedia.org)
[20:30:08] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:30:10] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/media/math/check/{ty
[20:30:10] <icinga-wm>	 eck test formula) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out befor
[20:30:10] <icinga-wm>	 received: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response wa
[20:30:10] <icinga-wm>	 rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[20:30:12] <twentyafterfour>	 doh!
[20:30:14] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3124 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:30:17] <James_F>	 --force 
[20:30:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:22] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3125 on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:30:22] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3124 on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:30:22] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3126 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:30:23] <_joe_>	 twentyafterfour: use --force
[20:30:27] <_joe_>	 sigh
[20:30:34] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3124 on cp3064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:30:40] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3126 on cp3052 is OK: HTTP OK: HTTP/1.1 200 OK - 544 bytes in 9.958 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:30:42] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3121 on cp3052 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 8.475 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:30:48] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3126 on cp3054 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 9.298 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:30:49] <twentyafterfour>	 !log sync-wikiversions --force
[20:30:50] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3122 on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:30:54] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3126 on cp3062 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 8.225 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:30:56] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3120 on cp3064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:30:56] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1263 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:30:56] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:30:56] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:30:56] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:30:56] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:30:56] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:30:57] <icinga-wm>	 PROBLEM - Apache HTTP on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:30:57] <icinga-wm>	 PROBLEM - Apache HTTP on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:30:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:58] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:30:58] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:00] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3122 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:31:02] <twentyafterfour>	 _joe_: yeah, should have used that the first time
[20:31:04] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3120 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:31:06] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:31:10] <marxarelli>	 we really need to make a rollback-wikiversions that always does --force
[20:31:15] <James_F>	 Yeah.
[20:31:16] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3120 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:31:20] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3064 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 6.330 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:31:21] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 6.529 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:31:24] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 8.895 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:31:24] <icinga-wm>	 RECOVERY - Apache HTTP on mw1270 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 4.740 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:24] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1346 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.836 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:26] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1344 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 8.486 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:31:26] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1341 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 7.447 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:31:26] <icinga-wm>	 RECOVERY - Apache HTTP on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.155 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:26] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 7.175 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:31:28] <icinga-wm>	 RECOVERY - Apache HTTP on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.783 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:28] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3127 on cp3054 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 2.176 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:31:30] <icinga-wm>	 RECOVERY - Apache HTTP on mw1225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.504 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:30] <icinga-wm>	 RECOVERY - Apache HTTP on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.709 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:30] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3121 on cp3064 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 1.634 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:31:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1340 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.161 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:31] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1324 is OK: HTTP OK: HTTP/1.1 200 OK - 75267 bytes in 8.202 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:31:31] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1317 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.231 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:32] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 8.812 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:31:32] <icinga-wm>	 RECOVERY - Apache HTTP on mw1255 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.994 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:34] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15057 bytes in 9.499 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:31:34] <icinga-wm>	 RECOVERY - Apache HTTP on mw1326 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.571 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:34] <icinga-wm>	 RECOVERY - Apache HTTP on mw1339 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.943 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:34] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}
[20:31:35] <icinga-wm>	 ections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was 
[20:31:35] <icinga-wm>	 n}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:31:35] <icinga-wm>	 PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not
[20:31:35] <icinga-wm>	 xistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[20:31:36] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:31:38] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15036 bytes in 9.619 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:31:38] <icinga-wm>	 RECOVERY - Apache HTTP on mw1247 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.289 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:38] <icinga-wm>	 RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:38] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1345 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 8.529 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:31:38] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1274 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 8.219 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:31:38] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1320 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.656 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:42] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3120 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 544 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:31:43] <twentyafterfour>	 marxarelli: yeah a fast rollback command would be nice
[20:31:44] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/availability (Retrieve feed content availability from \wikipedia.org\) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve featured article info for 
[20:31:44] <icinga-wm>	 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received: /{domain}/v1/page/news (get In the News content) timed out before a response was received: /{domain}/v1/page/news (get In the News content for unsupported language (with aggregated=true)) timed out before a response was received: / (spec from root) timed out b
[20:31:44] <icinga-wm>	 was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before
[20:31:44] <icinga-wm>	 eceived: /{domain}/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[20:31:46] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3127 on cp3064 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:31:46] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1239 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 5.071 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:31:46] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3121 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 544 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:31:46] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3120 on cp3056 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.240 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:31:48] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.562 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:48] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1247 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.940 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:48] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 7.723 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:31:48] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1251 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 7.911 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:49] <twentyafterfour>	 that recovery sure was quick 
[20:31:50] <icinga-wm>	 RECOVERY - Apache HTTP on mw1297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.981 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:50] <icinga-wm>	 RECOVERY - Apache HTTP on mw1253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 4.593 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:52] <icinga-wm>	 RECOVERY - Apache HTTP on mw1242 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.622 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:54] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1256 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 4.186 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:31:55] <icinga-wm>	 RECOVERY - Apache HTTP on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.451 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:55] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 7.830 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:31:55] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1339 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 8.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:31:55] <icinga-wm>	 RECOVERY - Apache HTTP on mw1320 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.110 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:55] <icinga-wm>	 RECOVERY - Apache HTTP on mw1327 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.240 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:55] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3124 on cp3052 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:31:56] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 8.717 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:31:56] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1268 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 7.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:56] <icinga-wm>	 RECOVERY - Apache HTTP on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.068 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:56] <icinga-wm>	 RECOVERY - Apache HTTP on mw1312 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.182 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:57] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1258 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 6.816 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:31:58] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 #page on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 14502 bytes in 9.632 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:31:58] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 7.743 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:58] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 7.086 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:31:59] <twentyafterfour>	 that is _definitely_ a broken branch :-/
[20:31:59] <icinga-wm>	 RECOVERY - Apache HTTP on mw1248 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 3.477 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:31:59] <icinga-wm>	 RECOVERY - Apache HTTP on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.930 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:32:01] <icinga-wm>	 RECOVERY - Apache HTTP on mw1324 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 5.927 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:32:01] <icinga-wm>	 RECOVERY - Apache HTTP on mw1264 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.612 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:32:01] <icinga-wm>	 RECOVERY - Apache HTTP on mw1269 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.780 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:32:01] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1320 is OK: HTTP OK: HTTP/1.1 200 OK - 75335 bytes in 7.137 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:32:02] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1228 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.278 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:32:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw1245 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 6.346 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:32:03] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.962 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:32:03] <icinga-wm>	 RECOVERY - Apache HTTP on mw1249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 4.293 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:32:04] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1246 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 3.847 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:32:04] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1240 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 3.896 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:32:05] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1254 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 4.156 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:32:05] <icinga-wm>	 RECOVERY - Apache HTTP on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.219 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:32:06] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.562 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:32:06] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1252 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 5.916 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:32:07] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 5.922 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:32:07] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1329 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 6.361 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:32:08] <icinga-wm>	 RECOVERY - Apache HTTP on mw1251 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 5.416 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:32:08] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3062 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:32:09] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1342 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.890 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:32:09] <icinga-wm>	 RECOVERY - Apache HTTP on mw1231 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.872 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:32:10] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1287 is OK: HTTP OK: HTTP/1.1 200 OK - 75328 bytes in 7.007 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:32:10] <icinga-wm>	 RECOVERY - Apache HTTP on mw1271 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.143 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:32:11] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1339 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 7.587 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:32:11] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1274 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.742 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:32:28] <logmsgbot>	 !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: (no justification provided)
[20:32:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:42] <twentyafterfour>	 sorry everyone, the error logs were perfectly clean on group1 so I have no idea how that failed so spectacularly on group2 
[20:33:16] <James_F>	 Yeah.
[20:33:44] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[20:33:44] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:33:45] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3127 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:33:45] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:33:46] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3124 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:33:50] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[20:33:52] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3052 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:33:52] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3124 on cp3056 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:33:52] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3126 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:33:54] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:33:58] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:33:58] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:33:58] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:33:58] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3056 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:34:02] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 71.93 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[20:34:04] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3124 on cp3064 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:34:10] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3056 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:34:18] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 544 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:34:18] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[20:34:24] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3120 on cp3064 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:34:24] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3056 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:34:26] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[20:34:26] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3064 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:34:28] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:34:30] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:34:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:34:36] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3120 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:34:38] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 544 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:34:44] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3120 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:34:46] <icinga-wm>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[20:34:53] <wikibugs>	 (03PS2) 10EBernhardson: Give NS_HELP same weight as NS_MAIN in search on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570712
[20:35:00] <icinga-wm>	 RECOVERY - ATS TLS has reduced HTTP availability #page on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1
[20:35:00] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST
[20:35:04] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:35:18] <icinga-wm>	 RECOVERY - phpfpm_up reduced availability on icinga1001 is OK: (C)0.8 le (W)0.9 le 1 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:35:28] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[20:35:30] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[20:35:32] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:35:34] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[20:36:12] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)8 ge (W)1 ge 0.8875 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[20:37:06] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[20:37:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:38:42] <icinga-wm>	 RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[20:42:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:44:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:45:43] <akosiaris>	 !log restart restbase on restbase1027
[20:45:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:34] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2316.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[20:48:49] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
[20:48:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:16] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikifeeds_8889: Servers kubernetes1001.eqiad.wmnet, kubernetes1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[20:49:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:49:32] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: / (spec from root) timed out before a response was received: /{domain}/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /{domain}/v1/feed/availability (Retrieve feed content availability from \wikipedia.org\) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /
[20:49:32] <icinga-wm>	 /image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received: /{domain}/v1/page/news (get In the News content) timed out before a response was received: /{domain}/v1/page/news (get In the News content for unsupported language (with aggregated=true)) timed out
[20:49:32] <icinga-wm>	 e was received: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve featured article info for unsupported site (with aggregated=true)) timed out 
[20:49:32] <icinga-wm>	  was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve most-read articles for date with no data (with
[20:49:32] <icinga-wm>	 ) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[20:51:06] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[20:52:00] <akosiaris>	 !log restart all wikifeeds pods
[20:52:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:52:56] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[20:52:56] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on wikifeeds.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 945 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:53:00] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[20:53:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:53:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:53:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:53:01] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[20:53:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:53:02] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:53:02] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:53:02] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:53:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:53:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:53:04] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:53:04] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:53:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:53:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:53:06] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[20:53:46] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:55:45] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Release-Engineering-Team-TODO, and 2 others: Horizon hiera UI: investigate data type handling - https://phabricator.wikimedia.org/T243422 (10Andrew) I've confirmed that the behavior with the yaml-based UI is correct.  For the guided interface, st...
[20:59:16] <icinga-wm>	 RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[21:03:23] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[21:03:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:05:40] <logmsgbot>	 !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[21:05:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:10:23] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2316.codfw.wmnet'] `  and were **ALL** successful.
[21:12:43] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2317.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[21:17:29] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: wikifeeds: Redefine CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/570726
[21:18:27] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: wikifeeds: Redefine CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/570726 (https://phabricator.wikimedia.org/T244535)
[21:22:46] <wikibugs>	 (03CR) 10Jforrester: wikifeeds: Redefine CPU limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/570726 (https://phabricator.wikimedia.org/T244535) (owner: 10Alexandros Kosiaris)
[21:23:09] <wikibugs>	 10Operations, 10Scap, 10serviceops-radar, 10User-brennen, 10User-jijiki: Introduce state to Scap - https://phabricator.wikimedia.org/T209881 (10brennen)
[21:27:46] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[21:27:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:30:03] <logmsgbot>	 !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[21:30:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:33:56] <RhinosF1>	 Is the train moving tonight or blocked? Not seen anything on phab
[21:34:45] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2317.codfw.wmnet'] `  and were **ALL** successful.
[21:35:13] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "It definitely *looks* like it would work. It may even be useful for end users. Not merging to wait for test results :)" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570702 (https://phabricator.wikimedia.org/T244293) (owner: 10BryanDavis)
[21:37:02] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2318.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[21:38:31] <RhinosF1>	 twentyafterfour: see my Q above
[21:39:18] <twentyafterfour>	 RhinosF1: blocked due to a massive outage after deploying wmf.18. The root cause is still under investigation 
[21:40:15] <RhinosF1>	 twentyafterfour: is there going to be a phab update on the task? Cause it’s looking like we’ll miss tonight’s window
[21:40:33] <twentyafterfour>	 !log train blocked due to serious incident related to deploying the latest branch. Incident documentation: https://wikitech.wikimedia.org/wiki/Incident_documentation/20200206-mediawiki refs T233866
[21:40:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:40:37] <stashbot>	 T233866: 1.35.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T233866
[21:42:21] <RhinosF1>	 Thanks twentyafterfour
[21:43:10] <wikibugs>	 (03PS4) 10Clarakosi: Add restbase202[123] to hiera [puppet] - 10https://gerrit.wikimedia.org/r/570094 (https://phabricator.wikimedia.org/T244178)
[21:52:03] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[21:52:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:52:19] <wikibugs>	 10Operations, 10Traffic: ats-tls performance issues under production load - https://phabricator.wikimedia.org/T244538 (10Vgutierrez)
[21:52:59] <wikibugs>	 10Operations, 10Traffic: ats-tls performance issues under production load - https://phabricator.wikimedia.org/T244538 (10Vgutierrez) p:05Triage→03High
[21:54:17] <logmsgbot>	 !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[21:54:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:58:00] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2318.codfw.wmnet'] `  and were **ALL** successful.
[21:58:59] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2319.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[22:03:24] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Release-Engineering-Team-TODO, and 2 others: Horizon hiera UI: investigate data type handling - https://phabricator.wikimedia.org/T243422 (10Andrew) This is happening because yaml.safe_dump() (and yaml.dump()) does some weird arbitrary quoting of...
[22:08:37] <wikibugs>	 (03PS1) 10Jforrester: Don't trying to assign $wgLogos to $wgLogoHD if it's unset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570729 (https://phabricator.wikimedia.org/T244405)
[22:08:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] Give NS_HELP same weight as NS_MAIN in search on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570712 (owner: 10EBernhardson)
[22:10:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "the difference is normal since a) mediawiki-testers user group gets added  b) scap sql scripts get removed intentionally" [puppet] - 10https://gerrit.wikimedia.org/r/570405 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn)
[22:11:27] <RhinosF1>	 twentyafterfour: that’s window over so train resumes Monday assuming unblocked right?
[22:12:01] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site: define 2 codfw appservers as canary_appservers [puppet] - 10https://gerrit.wikimedia.org/r/570405 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn)
[22:12:18] <twentyafterfour>	 RhinosF1: yes I believe you are correct, unless a miraculous discovery determines the root cause of the issues somewhat soon
[22:12:57] <RhinosF1>	 twentyafterfour: cool.
[22:13:40] <mutante>	 !log turning mw2271 and mw2163 into canary appservers for codfw, this adds mediawiki-testers shell users and removes scap sql scripts, rest stays as is (T242606)
[22:13:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:13:44] <stashbot>	 T242606: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606
[22:15:04] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[22:15:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:15:50] <wikibugs>	 (03CR) 10Dzahn: "things done by puppet:" [puppet] - 10https://gerrit.wikimedia.org/r/570405 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn)
[22:18:55] <logmsgbot>	 !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[22:18:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:20:30] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10Dzahn)
[22:23:03] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10Dzahn) mw2163 and mw2271 have been turned into canary appservers now. As opposed to canary API appservers this means actual puppet changes which are:  - mediawiki-testers shell access...
[22:23:14] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2320.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[22:23:37] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2319.codfw.wmnet'] `  and were **ALL** successful.
[22:24:02] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2321.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[22:24:05] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10Dzahn) @jijiki What do you think ? Is this good now? 4 of each type and in different rows/racks.
[22:27:10] <wikibugs>	 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10Dzahn) @Muehlenhoff Added them with private IPs, created VMs with the cookbook, then attempted OS install but on both of them it failed at the very end with GRUB install. I don't think this happen...
[22:38:15] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[22:38:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:39:04] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[22:39:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:40:33] <logmsgbot>	 !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[22:40:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:42:46] <logmsgbot>	 !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[22:42:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:44:48] <wikibugs>	 (03PS1) 10Bstorm: kubernetes: resource requests should be proportional to limits [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570734 (https://phabricator.wikimedia.org/T244289)
[22:46:21] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2320.codfw.wmnet'] `  and were **ALL** successful.
[22:47:10] <James_F>	 I'm live on mwdebug1001.
[22:47:28] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2321.codfw.wmnet'] `  and were **ALL** successful.
[22:50:37] <logmsgbot>	 !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.18/extensions/VisualEditor: T242184 Change tags method so anon edits will go through (duration: 01m 08s)
[22:50:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:50:42] <stashbot>	 T242184: Create a change tag for edits made using DiscussionTools - https://phabricator.wikimedia.org/T242184
[22:51:37] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Don't trying to assign $wgLogos to $wgLogoHD if it's unset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570729 (https://phabricator.wikimedia.org/T244405) (owner: 10Jforrester)
[22:52:52] <wikibugs>	 (03Merged) 10jenkins-bot: Don't trying to assign $wgLogos to $wgLogoHD if it's unset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570729 (https://phabricator.wikimedia.org/T244405) (owner: 10Jforrester)
[22:58:08] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2322.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[22:58:12] <wikibugs>	 (03PS1) 10Dzahn: httpd: add x-request-id to apache httpd logs [puppet] - 10https://gerrit.wikimedia.org/r/570735 (https://phabricator.wikimedia.org/T244545)
[22:58:25] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2323.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[22:58:30] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2323.codfw.wmnet'] `  Of which those **FAILED**: ` ['mw2323.codfw.wmnet'] `
[23:02:42] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2320.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[23:03:16] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2320.codfw.wmnet'] `  Of which those **FAILED**: ` ['mw2320.codfw.wmnet'] `
[23:04:38] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2323.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[23:05:17] <wikibugs>	 (03PS3) 10Dzahn: switch webproxy CNAMEs to new install servers [dns] - 10https://gerrit.wikimedia.org/r/569680 (https://phabricator.wikimedia.org/T224576)
[23:08:20] <wikibugs>	 (03CR) 10Dzahn: switch apt.wikimedia.org from install1002 to install1003 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/569682 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn)
[23:10:04] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T244405 Don't trying to assign  to  if it's unset (duration: 01m 07s)
[23:10:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:10:08] <stashbot>	 T244405: PHP Notice: Undefined offset: 1 in ResourceLoaderSkinModule.php - https://phabricator.wikimedia.org/T244405
[23:10:26] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] "needed instead from old servers to new separate VM for apt repo" [puppet] - 10https://gerrit.wikimedia.org/r/569691 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn)
[23:11:13] <wikibugs>	 (03PS1) 10BBlack: depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/570736
[23:11:39] * Reedy hands James_F a \
[23:13:11] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[23:13:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:13:23] <James_F>	 Oh, oops, yeah.
[23:13:28] * James_F coughs.
[23:15:27] <logmsgbot>	 !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[23:15:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:16:39] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10Jclark-ctr)
[23:17:47] <wikibugs>	 (03PS3) 10Dzahn: install_server: switch tftp server in DHCP to new install servers [puppet] - 10https://gerrit.wikimedia.org/r/569684 (https://phabricator.wikimedia.org/T224576)
[23:18:09] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] "Yes, please :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570712 (owner: 10EBernhardson)
[23:19:36] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[23:19:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:19:41] <wikibugs>	 (03PS2) 10Jforrester: [nlwiki] Enable VisualEditor in the Project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570419 (https://phabricator.wikimedia.org/T159711)
[23:19:53] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] [nlwiki] Enable VisualEditor in the Project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570419 (https://phabricator.wikimedia.org/T159711) (owner: 10Jforrester)
[23:20:10] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2322.codfw.wmnet'] `  and were **ALL** successful.
[23:20:13] <wikibugs>	 (03PS2) 10Jforrester: [cswikisource] Enable VisualEditor in the 'Edice' (102) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570706 (https://phabricator.wikimedia.org/T244133)
[23:20:18] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] [cswikisource] Enable VisualEditor in the 'Edice' (102) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570706 (https://phabricator.wikimedia.org/T244133) (owner: 10Jforrester)
[23:20:33] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2324.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[23:20:57] <wikibugs>	 (03Merged) 10jenkins-bot: [nlwiki] Enable VisualEditor in the Project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570419 (https://phabricator.wikimedia.org/T159711) (owner: 10Jforrester)
[23:21:47] <wikibugs>	 (03Merged) 10jenkins-bot: [cswikisource] Enable VisualEditor in the 'Edice' (102) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570706 (https://phabricator.wikimedia.org/T244133) (owner: 10Jforrester)
[23:21:53] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[23:21:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:22:42] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T159711 T161365 T164435 [nlwiki] Enable VisualEditor in the Project namespace (duration: 01m 08s)
[23:22:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:22:46] <stashbot>	 T164435: Enable visual editor for the Wikipedia namespace on nl.wikipedia.org - https://phabricator.wikimedia.org/T164435
[23:22:47] <stashbot>	 T161365: Enable VisualEditor by default for all users of the Dutch Wikipedia - https://phabricator.wikimedia.org/T161365
[23:22:47] <stashbot>	 T159711: Enable visual editor on Dutch Wikipedia in User namespace and possibly Wikipedia namespace - https://phabricator.wikimedia.org/T159711
[23:25:33] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T244133 [cswikisource] Enable VisualEditor in the Edice namespace (duration: 01m 07s)
[23:25:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:25:36] <stashbot>	 T244133: Enable VIsualEditor for ns:102 on cs.wikisource - https://phabricator.wikimedia.org/T244133
[23:26:37] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 37 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[23:26:37] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2323.codfw.wmnet'] `  and were **ALL** successful.
[23:27:35] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2325.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[23:31:57] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 32 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[23:35:37] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[23:35:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:37:56] <logmsgbot>	 !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[23:37:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:41:41] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2324.codfw.wmnet'] `  and were **ALL** successful.
[23:42:36] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[23:42:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:44:53] <logmsgbot>	 !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[23:44:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:47:23] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] kubernetes: resource requests should be proportional to limits (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570734 (https://phabricator.wikimedia.org/T244289) (owner: 10Bstorm)
[23:48:06] <wikibugs>	 (03Merged) 10jenkins-bot: kubernetes: resource requests should be proportional to limits [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570734 (https://phabricator.wikimedia.org/T244289) (owner: 10Bstorm)
[23:48:35] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2325.codfw.wmnet'] `  and were **ALL** successful.
[23:49:22] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[23:53:57] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2326.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[23:54:20] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2327.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[23:54:34] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 35 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[23:56:44] <icinga-wm>	 PROBLEM - Host mw2327 is DOWN: PING CRITICAL - Packet loss = 100%