[01:51:29] <icinga-wm>	 PROBLEM - Host cp5010.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:51:37] <icinga-wm>	 PROBLEM - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100%
[01:51:39] <icinga-wm>	 PROBLEM - Host cp5006.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:51:39] <icinga-wm>	 PROBLEM - Host mr1-eqsin is DOWN: PING CRITICAL - Packet loss = 100%
[01:51:39] <icinga-wm>	 PROBLEM - Host cp5009.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:51:39] <icinga-wm>	 PROBLEM - Host cp5005.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:51:39] <icinga-wm>	 PROBLEM - Host cp5004.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:51:39] <icinga-wm>	 PROBLEM - Host cp5012.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:51:45] <icinga-wm>	 PROBLEM - Host cp5011.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:51:49] <icinga-wm>	 PROBLEM - Host dns5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:51:49] <icinga-wm>	 PROBLEM - Host lvs5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:51:49] <icinga-wm>	 PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[01:51:49] <icinga-wm>	 PROBLEM - Host lvs5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:51:49] <icinga-wm>	 PROBLEM - Host dns5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:54:13] <icinga-wm>	 PROBLEM - Host cp5008.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:54:13] <icinga-wm>	 PROBLEM - Host cp5007.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:54:15] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:54:17] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:54:47] <icinga-wm>	 PROBLEM - Host bast5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:54:47] <icinga-wm>	 PROBLEM - Host lvs5003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:56:17] <icinga-wm>	 PROBLEM - Host cp5003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:56:17] <icinga-wm>	 PROBLEM - Host cp5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:56:17] <icinga-wm>	 PROBLEM - Host cp5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[02:49:39] <wikibugs>	 (03Abandoned) 10Tulsi Bhagat: Add Namespaces translation for zh.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527862 (https://phabricator.wikimedia.org/T229743) (owner: 10Tulsi Bhagat)
[03:16:37] <icinga-wm>	 PROBLEM - puppet last run on mc1033 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[03:44:41] <icinga-wm>	 RECOVERY - puppet last run on mc1033 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[03:50:26] <wikibugs>	 10Operations, 10netops: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10CDanis)
[03:51:24] <icinga-wm>	 ACKNOWLEDGEMENT - SSH cp5001.mgmt on cp5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:51:24] <icinga-wm>	 ACKNOWLEDGEMENT - SSH cp5011.mgmt on cp5011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:51:24] <icinga-wm>	 ACKNOWLEDGEMENT - SSH dns5002.mgmt on dns5002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:51:24] <icinga-wm>	 ACKNOWLEDGEMENT - SSH bast5001.mgmt on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:51:24] <icinga-wm>	 ACKNOWLEDGEMENT - SSH cp5006.mgmt on cp5006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:51:25] <icinga-wm>	 ACKNOWLEDGEMENT - SSH cp5005.mgmt on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:51:25] <icinga-wm>	 ACKNOWLEDGEMENT - SSH cp5012.mgmt on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:51:26] <icinga-wm>	 ACKNOWLEDGEMENT - SSH cp5009.mgmt on cp5009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:51:26] <icinga-wm>	 ACKNOWLEDGEMENT - SSH lvs5003.mgmt on lvs5003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:51:27] <icinga-wm>	 ACKNOWLEDGEMENT - SSH lvs5001.mgmt on lvs5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:51:27] <icinga-wm>	 ACKNOWLEDGEMENT - SSH lvs5002.mgmt on lvs5002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:51:28] <icinga-wm>	 ACKNOWLEDGEMENT - SSH cp5008.mgmt on cp5008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:51:28] <icinga-wm>	 ACKNOWLEDGEMENT - SSH cp5003.mgmt on cp5003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:51:29] <icinga-wm>	 ACKNOWLEDGEMENT - SSH cp5007.mgmt on cp5007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:51:29] <icinga-wm>	 ACKNOWLEDGEMENT - SSH cp5010.mgmt on cp5010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:51:30] <icinga-wm>	 ACKNOWLEDGEMENT - SSH cp5004.mgmt on cp5004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:51:30] <icinga-wm>	 ACKNOWLEDGEMENT - SSH cp5002.mgmt on cp5002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:51:31] <icinga-wm>	 ACKNOWLEDGEMENT - SSH dns5001.mgmt on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:51:31] <icinga-wm>	 ACKNOWLEDGEMENT - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:51:32] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:51:32] <icinga-wm>	 ACKNOWLEDGEMENT - OSPF status on cr1-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:51:33] <icinga-wm>	 ACKNOWLEDGEMENT - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[03:51:33] <icinga-wm>	 ACKNOWLEDGEMENT - Juniper alarms on asw1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.132.128.4 CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[03:58:33] <wikibugs>	 (03PS3) 10Viztor: Update HD logo for en.ws and mul.ws [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527922 (https://phabricator.wikimedia.org/T229769)
[03:59:06] <wikibugs>	 10Operations, 10ops-eqsin: remote hands setups for ganeti500[123] - https://phabricator.wikimedia.org/T229243 (10wiki_willy) Completed by Jin from DreamICC today.  The missing IPV4 IP addresses used are the following, with the gateway set to 10.132.129.1 accordingly (instead of 10.132.128.1):  ganeti5001 10.13...
[04:00:03] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[04:00:21] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[04:00:39] <wikibugs>	 10Operations, 10ops-eqsin: msw1-eqsin/msw2-eqsin missing serial number - https://phabricator.wikimedia.org/T227911 (10wiki_willy) Info gathered by Jin from DreamICC today.  Here's the info below (also sent out via email):  msw1-eqsin (msw-0603) WMF7189 2W026C5B012A2 msw2-eqsin (msw-0604) WMF7190 2W026C5E012B3...
[04:02:52] <wikibugs>	 10Operations, 10netops: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10CDanis) @wiki_willy Any chance this is somehow related to {T229243}?
[04:07:44] <wikibugs>	 (03PS4) 10Viztor: Update HD logo for en.ws and mul.ws [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527922 (https://phabricator.wikimedia.org/T229769)
[04:09:02] <wikibugs>	 10Operations, 10netops: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10wiki_willy) @CDanis - I just checked with our 3rd party contractor and he says it shouldn't have been affected from the work he was doing.  Although, he was working in the racks from 1:45-4:00 UTC, and If it...
[04:09:35] <wikibugs>	 10Operations, 10ops-eqsin: update PDUs for eqsin (asset tag and other info) - https://phabricator.wikimedia.org/T211368 (10wiki_willy) Asset tags applied by Jin from DreamICC today as follows (also emailed out via a spreadsheet):  ps1-603-eqsin WMF7196 ps2-603-eqsin WMF7197 ps1-604-eqsin WMF7198 ps2-604-eqsin...
[04:14:26] <wikibugs>	 10Operations, 10netops: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10CDanis) Still alerting, unfortunately.
[04:17:31] <icinga-wm>	 PROBLEM - HHVM rendering on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[04:19:01] <icinga-wm>	 RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 81922 bytes in 1.007 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[04:22:37] <wikibugs>	 10Operations, 10netops: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10wiki_willy) Alright, I'm asking him to go back to the datacenter to check all the connections on mr1-eqsin.
[04:24:17] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[04:24:35] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[04:37:18] <wikibugs>	 (03PS5) 10Viztor: Update HD logo for en.ws and mul.ws [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527922 (https://phabricator.wikimedia.org/T229769)
[04:38:15] <icinga-wm>	 RECOVERY - PHP opcache health on mwdebug2001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[04:52:50] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Marostegui) 05Open→03Resolved No OS or idrac errors since the memory was replaced, so I am closing this as resolved. If it happens again, I will re-open  Thanks @Papaul!
[05:23:15] <icinga-wm>	 RECOVERY - Host cp5005.mgmt is UP: PING OK - Packet loss = 16%, RTA = 231.87 ms
[05:23:17] <icinga-wm>	 RECOVERY - Host cp5010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.23 ms
[05:23:21] <icinga-wm>	 RECOVERY - Host dns5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 249.35 ms
[05:23:27] <icinga-wm>	 RECOVERY - Host mr1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 231.89 ms
[05:23:27] <icinga-wm>	 RECOVERY - Host asw1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 232.13 ms
[05:23:29] <icinga-wm>	 RECOVERY - Host cp5009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.04 ms
[05:23:29] <icinga-wm>	 RECOVERY - Host cp5006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.18 ms
[05:23:29] <icinga-wm>	 RECOVERY - Host cp5012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.09 ms
[05:23:29] <icinga-wm>	 RECOVERY - Host cp5004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.11 ms
[05:23:43] <icinga-wm>	 RECOVERY - Host lvs5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.10 ms
[05:23:43] <icinga-wm>	 RECOVERY - Host lvs5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.22 ms
[05:23:43] <icinga-wm>	 RECOVERY - Host dns5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.82 ms
[05:23:43] <icinga-wm>	 RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 233.61 ms
[05:23:43] <icinga-wm>	 RECOVERY - Host cp5011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 316.17 ms
[05:24:11] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:24:11] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:24:13] <icinga-wm>	 RECOVERY - Host lvs5003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 231.79 ms
[05:24:39] <icinga-wm>	 RECOVERY - Host cp5003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 231.94 ms
[05:25:05] <wikibugs>	 (03PS1) 10Marostegui: db2124: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/527949 (https://phabricator.wikimedia.org/T228969)
[05:25:05] <marostegui>	 wooot
[05:25:16] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Pool db2124 into s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527950 (https://phabricator.wikimedia.org/T228969)
[05:25:43] <icinga-wm>	 RECOVERY - Host cp5007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 231.84 ms
[05:25:59] <icinga-wm>	 RECOVERY - Host cp5008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 231.85 ms
[05:26:34] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] db-eqiad,db-codfw.php: Pool db2124 into s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527950 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[05:26:39] <icinga-wm>	 RECOVERY - Host bast5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 231.84 ms
[05:27:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Pool db2124 into s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527950 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[05:28:05] <icinga-wm>	 RECOVERY - Host cp5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 231.86 ms
[05:28:05] <icinga-wm>	 RECOVERY - Host cp5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 231.74 ms
[05:28:30] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2124 into s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527950 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[05:28:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit of MediaWiki config (dc=all), diff saved to 'https://phabricator.wikimedia.org/P8858', previous config saved to /var/cache/conftool/dbconfig/20190805-052839-marostegui.json
[05:28:47] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2124 into s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527950 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[05:28:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:29:57] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Provision db2124 into s6 T228969 (duration: 00m 49s)
[05:30:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:30:05] <stashbot>	 T228969: Productionize db21[21-30} - https://phabricator.wikimedia.org/T228969
[05:30:07] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2124: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/527949 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[05:30:49] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Provision db2124 into s6 T228969 (duration: 00m 46s)
[05:30:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:34:08] <wikibugs>	 10Operations, 10netops: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10wiki_willy) a:03wiki_willy
[05:37:28] <wikibugs>	 10Operations, 10ops-eqsin, 10netops: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10wiki_willy) 05Open→03Resolved Cable between mr1-eqsin p4 <---> asw-0603-eqsin p23 looks like it accidentally got bumped by the contractor during the server install.  Called him back and he...
[05:37:47] <wikibugs>	 10Operations, 10ops-eqsin, 10netops: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10Marostegui) We just got all the recoveries: ` [07:23:15]  <+icinga-wm> RECOVERY - Host cp5005.mgmt is UP: PING OK - Packet loss = 16%, RTA = 231.87 ms [07:23:17]  <+icinga-wm> RECOVERY - Host c...
[05:38:29] <wikibugs>	 10Operations, 10ops-eqsin, 10netops: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10Marostegui) Ha! @wiki_willy was faster!
[05:40:14] <wikibugs>	 10Operations, 10ops-eqsin, 10netops: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10wiki_willy) @Marostegui - Ha, we tied. =)
[05:58:14] <marostegui>	 !log Update rack column on zarcillo.servers for the new servers T229683
[05:58:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:58:22] <stashbot>	 T229683: Update rack information on zarcillo.servers  - https://phabricator.wikimedia.org/T229683
[06:32:27] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff)
[06:38:44] <wikibugs>	 (03PS1) 10Elukey: profile::analytics::refinery::job::test::refine: fix refine regex [puppet] - 10https://gerrit.wikimedia.org/r/527979 (https://phabricator.wikimedia.org/T226698)
[06:40:09] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::test::refine: fix refine regex [puppet] - 10https://gerrit.wikimedia.org/r/527979 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey)
[06:53:38] <wikibugs>	 10Operations, 10cloud-services-team: Migrate remaining cloudvirt hosts to Stretch/Mitaka - https://phabricator.wikimedia.org/T224561 (10MoritzMuehlenhoff)
[06:55:38] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff)
[07:05:35] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff)
[07:11:01] <wikibugs>	 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: mcrouter codfw proxies sometimes lead to TKOs - https://phabricator.wikimedia.org/T227265 (10elukey) We deployed all the changes for T225642, so async settings for codfw replication was not the culprit.  After T225059 we have per-shard...
[07:15:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/527430 (https://phabricator.wikimedia.org/T204450) (owner: 10Giuseppe Lavagetto)
[07:20:28] <wikibugs>	 (03PS1) 10Vgutierrez: Release fifo-log-demux 0.5 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527980
[07:21:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (and needs SRE meeting approval)" [puppet] - 10https://gerrit.wikimedia.org/r/525847 (https://phabricator.wikimedia.org/T229124) (owner: 10RobH)
[07:23:14] <marostegui>	 !log Move db2095:3312 from db2063 to db2126 - T228969
[07:23:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:24] <stashbot>	 T228969: Productionize db21[21-30} - https://phabricator.wikimedia.org/T228969
[07:35:49] <wikibugs>	 10Operations, 10serviceops: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10MoritzMuehlenhoff) The patch seems sane, but I'm wondering whether we actually need to pursue this further? tmpreaper is dead upstream (the Debian maintainer keeps it alive a little for security fixes,...
[07:37:22] <wikibugs>	 (03CR) 10Muehlenhoff: "Ack, I'll update the commit message and merge later. The one known corner case has been fixed." [puppet] - 10https://gerrit.wikimedia.org/r/525220 (https://phabricator.wikimedia.org/T46722) (owner: 10Muehlenhoff)
[07:43:17] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Reorganize s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528069 (https://phabricator.wikimedia.org/T228969)
[07:43:47] <moritzm>	 !log removed orespoolcounter[12]00[12] from debmonitor T227640
[07:43:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:57] <stashbot>	 T227640: Migrate ORES pool counters to Buster - https://phabricator.wikimedia.org/T227640
[07:43:59] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[07:44:45] <vgutierrez>	 marostegui: ^^  :?
[07:45:04] <marostegui>	 that's me yep
[07:45:08] <marostegui>	 I am preparing a big commit :)
[07:45:48] <moritzm>	 !log installing unzip regression DLA for jessie
[07:45:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:53] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[07:47:03] <marostegui>	 ^ Almost done
[07:49:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit of MediaWiki config (dc=all), diff saved to 'https://phabricator.wikimedia.org/P8859', previous config saved to /var/cache/conftool/dbconfig/20190805-074930-marostegui.json
[07:49:33] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[07:49:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:55] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Reorganize s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528069 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[07:51:59] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Reorganize s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528069 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[07:52:14] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Reorganize s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528069 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[07:52:31] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[07:52:38] <logmsgbot>	 !log marostegui@deploy1001 sync-file aborted: Reorganize s2 T228969 (duration: 00m 06s)
[07:52:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:46] <stashbot>	 T228969: Productionize db21[21-30} - https://phabricator.wikimedia.org/T228969
[07:53:20] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Reorganize s2 T228969 (duration: 00m 48s)
[07:53:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:49] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Reorganize s2 T228969 (duration: 00m 47s)
[07:53:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:18] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db2107 to codfw s2 master [puppet] - 10https://gerrit.wikimedia.org/r/528070 (https://phabricator.wikimedia.org/T220170)
[08:09:09] <wikibugs>	 10Operations, 10Goal: SRE firefighting improvements - 2019-20 Q1 Goal - https://phabricator.wikimedia.org/T229782 (10Volans)
[08:09:17] <wikibugs>	 10Operations, 10Goal: SRE firefighting improvements - 2019-20 Q1 Goal - https://phabricator.wikimedia.org/T229782 (10Volans) p:05Triage→03Normal
[08:11:20] <wikibugs>	 (03PS1) 10Elukey: Enable spark.authenticate in yarn-site.xml on the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/528071 (https://phabricator.wikimedia.org/T226698)
[08:11:52] <wikibugs>	 10Operations, 10serviceops: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10elukey) >>! In T151304#5391310, @MoritzMuehlenhoff wrote: > The patch seems sane, but I'm wondering whether we actually need to pursue this further? tmpreaper is dead upstream (the Debian maintainer kee...
[08:12:12] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Enable spark.authenticate in yarn-site.xml on the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/528071 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey)
[08:19:33] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Promote db2107 to codfw s2 master [puppet] - 10https://gerrit.wikimedia.org/r/528070 (https://phabricator.wikimedia.org/T220170)
[08:21:27] <marostegui>	 !log Switchover s2 codfw master from db2035 to db2107 - T221533 T220170
[08:21:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:37] <stashbot>	 T220170: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170
[08:21:37] <stashbot>	 T221533: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533
[08:27:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2107 to codfw s2 master [puppet] - 10https://gerrit.wikimedia.org/r/528070 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui)
[08:32:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit of MediaWiki config (dc=all), diff saved to 'https://phabricator.wikimedia.org/P8861', previous config saved to /var/cache/conftool/dbconfig/20190805-083254-marostegui.json
[08:33:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:13] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: restrouter: Fix typo in suffixes in admin [deployment-charts] - 10https://gerrit.wikimedia.org/r/527480
[08:36:15] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Realign restrouter limitranges [deployment-charts] - 10https://gerrit.wikimedia.org/r/528074
[08:39:13] <wikibugs>	 10Operations, 10DBA, 10decommission: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Marostegui)
[08:39:34] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: restrouter: Fix typo in suffixes in admin [deployment-charts] - 10https://gerrit.wikimedia.org/r/527480 (owner: 10Alexandros Kosiaris)
[08:39:48] <wikibugs>	 10Operations, 10DBA, 10decommission: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Marostegui) p:05Triage→03Normal
[08:40:03] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2035 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528075 (https://phabricator.wikimedia.org/T229784)
[08:40:06] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Realign restrouter limitranges [deployment-charts] - 10https://gerrit.wikimedia.org/r/528074
[08:40:09] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, and 2 others: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Marostegui)
[08:41:37] <wikibugs>	 (03CR) 10Volans: "Question inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov)
[08:41:57] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Realign restrouter limitranges [deployment-charts] - 10https://gerrit.wikimedia.org/r/528074 (owner: 10Alexandros Kosiaris)
[08:43:05] <logmsgbot>	 !log @ helmfile [STAGING] Ran 'apply' command on namespace 'restrouter' for release 'staging' .
[08:43:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2035 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528075 (https://phabricator.wikimedia.org/T229784) (owner: 10Marostegui)
[08:44:09] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, and 2 others: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Marostegui)
[08:44:11] <wikibugs>	 (03PS4) 10Filippo Giunchedi: monitoring: tweak description for paging alerts [puppet] - 10https://gerrit.wikimedia.org/r/525536 (https://phabricator.wikimedia.org/T228878)
[08:44:39] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2035 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528075 (https://phabricator.wikimedia.org/T229784) (owner: 10Marostegui)
[08:45:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] monitoring: tweak description for paging alerts [puppet] - 10https://gerrit.wikimedia.org/r/525536 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi)
[08:45:48] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2035 from config T229784 (duration: 00m 47s)
[08:45:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:56] <stashbot>	 T229784: Decommission db2035 - https://phabricator.wikimedia.org/T229784
[08:46:24] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2035 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528075 (https://phabricator.wikimedia.org/T229784) (owner: 10Marostegui)
[08:46:41] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2035 from config T229784 (duration: 00m 46s)
[08:46:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:11] <wikibugs>	 10Operations, 10ops-eqiad: helium.mgmt down - https://phabricator.wikimedia.org/T229706 (10Volans) 05Open→03Resolved p:05Triage→03Normal @Dzahn have you tried to follow https://wikitech.wikimedia.org/wiki/Management_Interfaces#Troubleshooting_Commands, in particular [[ https://wikitech.wikimedia.org/wi...
[08:52:13] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10Volans)
[08:56:02] <moritzm>	 !log installing vim security updates for jessie (stretch/buster already fixed)
[08:56:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:25] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Restrouter: Specify the correct image [deployment-charts] - 10https://gerrit.wikimedia.org/r/528077
[09:02:44] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Restrouter: Specify the correct image [deployment-charts] - 10https://gerrit.wikimedia.org/r/528077 (owner: 10Alexandros Kosiaris)
[09:02:50] <icinga-wm>	 PROBLEM - toolschecker: gridengine webservice running #page on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/gridengine - 177 bytes in 9.766 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[09:03:52] <volans>	 arturo: around?
[09:03:56] <arturo>	 yup
[09:04:15] <volans>	 let us know if we can help
[09:04:47] <icinga-wm>	 PROBLEM - toolschecker: kubernetes webservice running #page on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 177 bytes in 9.879 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[09:04:51] <arturo>	 I'm currently fighting the icinga autocompleter -_-
[09:05:03] <vgutierrez>	 so fast and useful :)
[09:05:15] <vgutierrez>	 (the icinga autocompleter)
[09:05:17] <logmsgbot>	 !log @ helmfile [STAGING] Ran 'apply' command on namespace 'restrouter' for release 'staging' .
[09:05:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:48] <icinga-wm>	 ACKNOWLEDGEMENT - toolschecker: gridengine webservice running #page on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/gridengine - 177 bytes in 9.766 second response time Arturo Borrero Gonzalez investigating https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[09:05:50] <icinga-wm>	 ACKNOWLEDGEMENT - toolschecker: kubernetes webservice running #page on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 177 bytes in 9.879 second response time Arturo Borrero Gonzalez investigating https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[09:06:23] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 #page on cloudelastic.wikimedia.org is CRITICAL: connect to address 208.80.154.84 and port 8643: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:06:45] <_joe_>	 err sorry per i servizi what's going on?
[09:07:02] <godog>	 I believe arturo's on it
[09:07:03] <arturo>	 !log downtime toolschecker for 5hours
[09:07:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:52] <arturo>	 the cloudeslastic one I'm not so sure. Is that a new service that the traffic service is working on? I believe I saw brandon downtiming it the other day. Maybe the downtime period expired
[09:08:07] <gehel>	 I'll check cloudelastic
[09:08:18] <apergos>	 https://phabricator.wikimedia.org/T229621
[09:08:39] <gehel>	 No critical clients on it yet, so no need to panic on that one
[09:09:17] <vgutierrez>	 yeah.. as apergos pointed out, it's an issue with the check itself
[09:11:12] <gehel>	 onimisionipe: ^^^ I think you had a similar issue with the elastic checks already, would you have time to have a look?
[09:11:44] <wikibugs>	 (03PS1) 10Fsero: k8s: deploy calico, rbac, psp, coredns and ns via helmfile in codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/528078 (https://phabricator.wikimedia.org/T228837)
[09:12:42] <onimisionipe>	 gehel: sure!
[09:12:48] <gehel>	 onimisionipe: thanks!
[09:13:06] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Upgrade db1100 firmware and BIOS - https://phabricator.wikimedia.org/T228732 (10Marostegui) @Cmjohnson are we still good for tomorrow at 14:00 UTC? I will have the host depooled and off for you before 14:00 UTC
[09:13:59] <vgutierrez>	 gehel: I ran into the same issue with ncredir service a few weeks ago
[09:14:40] <gehel>	 vgutierrez: and you replaced $ADDRESS with $HOSTNAME in the check definition?
[09:14:47] <vgutierrez>	 nope
[09:14:51] <vgutierrez>	 not at all
[09:15:02] * gehel is just guessing, I haven't checked what the actual problem is here
[09:16:49] <vgutierrez>	 so for text/upload/ncredir what we do is to configure in configuration.yaml an icinga check only on the service in port 80
[09:16:55] <vgutierrez>	 so text, upload and ncredir
[09:17:07] <vgutierrez>	 and text-https, upload-https and ncredir-https don't get icinga check configuration
[09:17:20] <wikibugs>	 10Operations, 10Maps: postgresql replication issues on maps1001 - https://phabricator.wikimedia.org/T229788 (10Gehel)
[09:18:16] <vgutierrez>	 and we have an icinga check that gets that configuration and generates two checks for ports 80 and 443
[09:18:19] <vgutierrez>	 kinda hacky
[09:18:21] <vgutierrez>	 but it works
[09:18:25] <gehel>	 vgutierrez: that implies that the different services are closely related and fail together?
[09:18:37] <logmsgbot>	 !log @ helmfile [STAGING] Ran 'apply' command on namespace 'restrouter' for release 'staging' .
[09:18:43] <vgutierrez>	 well.. we do check both ports
[09:18:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:18] <gehel>	 vgutierrez: not sure I understand, let me look at the code
[09:19:24] <icinga-wm>	 PROBLEM - puppet last run on mc1031 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[vim] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[09:19:26] <vgutierrez>	 I'm trying to find the code
[09:21:06] <gehel>	 Oh, I see! I was assuming a different issue
[09:22:13] <vgutierrez>	 basically if you check how the text LVS service is configured
[09:22:34] <vgutierrez>	 we only provide icinga config in one of the two services (one is text and the other one text-https)
[09:22:54] <vgutierrez>	 but the puppetization of that icinga check it's actually deploying two checks, one for text and another for text-https
[09:23:08] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Provision db2127 into s3 [puppet] - 10https://gerrit.wikimedia.org/r/528079 (https://phabricator.wikimedia.org/T228969)
[09:23:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor nitpick comment, rest LGTM" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/528078 (https://phabricator.wikimedia.org/T228837) (owner: 10Fsero)
[09:25:12] <vgutierrez>	 and the culprit doing that black magic is lvs::monitor_service_http_https
[09:27:15] <wikibugs>	 (03PS1) 10Marostegui: db2105: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/528081 (https://phabricator.wikimedia.org/T220170)
[09:28:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2105: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/528081 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui)
[09:29:27] <icinga-wm>	 RECOVERY - toolschecker: gridengine webservice running #page on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 22.235 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[09:30:10] <marostegui>	 !log Stop MySQL on db2105 to change binlog format
[09:30:13] <wikibugs>	 10Operations, 10Discovery-Search, 10Elasticsearch, 10Traffic: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10Vgutierrez) I've seen the same behaviour configuring the ncredir LVS service as it's using two ports (80/443). Same happens wi...
[09:30:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:23] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Provision db2127 into s3 [puppet] - 10https://gerrit.wikimedia.org/r/528079 (https://phabricator.wikimedia.org/T228969)
[09:31:54] <wikibugs>	 (03PS2) 10Filippo Giunchedi: WIP: alert on widespread puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/526431
[09:31:56] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: aggregate puppet zero resources reported [puppet] - 10https://gerrit.wikimedia.org/r/528084 (https://phabricator.wikimedia.org/T229262)
[09:32:28] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Provision db2127 into s3 [puppet] - 10https://gerrit.wikimedia.org/r/528079 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[09:41:48] <wikibugs>	 (03PS2) 10Fsero: k8s: deploy calico, rbac, psp, coredns and ns via helmfile in codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/528078 (https://phabricator.wikimedia.org/T228837)
[09:43:09] <icinga-wm>	 RECOVERY - toolschecker: kubernetes webservice running #page on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 58.311 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[09:43:14] <wikibugs>	 (03CR) 10Fsero: "addressed Alex and joe comments" [deployment-charts] - 10https://gerrit.wikimedia.org/r/528078 (https://phabricator.wikimedia.org/T228837) (owner: 10Fsero)
[09:47:46] <wikibugs>	 (03PS2) 10Gergő Tisza: Allow CORS access to publichtml (people.wikimedia.org) [puppet] - 10https://gerrit.wikimedia.org/r/522991 (https://phabricator.wikimedia.org/T224068)
[09:52:10] <icinga-wm>	 RECOVERY - puppet last run on mc1031 is OK: OK: Puppet is currently enabled, last run 7 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[09:53:27] <wikibugs>	 (03CR) 10Ema: [C: 03+1] Release fifo-log-demux 0.5 (031 comment) [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527980 (owner: 10Vgutierrez)
[09:55:53] <wikibugs>	 10Operations, 10serviceops, 10HHVM: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki)
[09:56:52] <wikibugs>	 10Operations, 10serviceops, 10HHVM: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki)
[09:56:55] <wikibugs>	 10Operations, 10serviceops: SRE FY19-20 Q1 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10jijiki)
[10:01:22] <wikibugs>	 (03PS1) 10Elukey: Add more spark security options to yarn-size in Hadoop Test [puppet] - 10https://gerrit.wikimedia.org/r/528086 (https://phabricator.wikimedia.org/T226698)
[10:01:40] <wikibugs>	 10Operations, 10serviceops, 10HHVM: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki)
[10:02:52] <wikibugs>	 10Operations, 10serviceops: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10jijiki) @elukey @MoritzMuehlenhoff I have added tmpreapers removal as an actionable in our HHVM removal process (T229792), shall we mark this as resolved or invalid?
[10:03:41] <wikibugs>	 (03PS2) 10Vgutierrez: Release fifo-log-demux 0.5 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527980
[10:04:33] <wikibugs>	 (03CR) 10Vgutierrez: Release fifo-log-demux 0.5 (031 comment) [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527980 (owner: 10Vgutierrez)
[10:05:47] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: aggregate puppet zero resources reported [puppet] - 10https://gerrit.wikimedia.org/r/528084 (https://phabricator.wikimedia.org/T229262)
[10:05:50] <wikibugs>	 (03PS3) 10Filippo Giunchedi: WIP: alert on widespread puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/526431
[10:05:53] <wikibugs>	 (03PS1) 10Filippo Giunchedi: base: stop per-host puppet critical when master has issues [puppet] - 10https://gerrit.wikimedia.org/r/528087 (https://phabricator.wikimedia.org/T229262)
[10:06:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: aggregate puppet zero resources reported [puppet] - 10https://gerrit.wikimedia.org/r/528084 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi)
[10:06:51] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: aggregate puppet zero resources reported [puppet] - 10https://gerrit.wikimedia.org/r/528084 (https://phabricator.wikimedia.org/T229262)
[10:07:31] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add more spark security options to yarn-size in Hadoop Test [puppet] - 10https://gerrit.wikimedia.org/r/528086 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey)
[10:09:09] <wikibugs>	 (03PS4) 10Filippo Giunchedi: prometheus: aggregate puppet zero resources reported [puppet] - 10https://gerrit.wikimedia.org/r/528084 (https://phabricator.wikimedia.org/T229262)
[10:12:13] <jbond42>	 !log rolling update of openjdk on maps servers
[10:12:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:17] <wikibugs>	 10Operations, 10serviceops: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10MoritzMuehlenhoff) Toolforge/Toollabs also uses tmpreaper (but not the puppetised version with the tmpreaper Puppet class). I'm adding @Andrew and @aborrero for comments whether we should keep it open f...
[10:20:23] <wikibugs>	 (03CR) 10Ema: [C: 03+1] Release fifo-log-demux 0.5 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527980 (owner: 10Vgutierrez)
[10:20:43] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Release fifo-log-demux 0.5 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527980 (owner: 10Vgutierrez)
[10:22:31] <wikibugs>	 10Operations, 10DBA: Remove sarin and neodymium GRANTs from all the databases - https://phabricator.wikimedia.org/T229796 (10Marostegui)
[10:22:49] <wikibugs>	 10Operations, 10DBA: Remove sarin and neodymium GRANTs from all the databases - https://phabricator.wikimedia.org/T229796 (10Marostegui) p:05Triage→03Normal
[10:24:04] <wikibugs>	 10Operations, 10DBA: Remove sarin and neodymium GRANTs from all the databases - https://phabricator.wikimedia.org/T229796 (10Marostegui)
[10:24:07] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10Marostegui)
[10:24:12] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10Marostegui)
[10:24:22] <wikibugs>	 10Operations, 10DBA: Remove sarin and neodymium GRANTs from all the databases - https://phabricator.wikimedia.org/T229796 (10Marostegui)
[10:24:25] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10Marostegui)
[10:24:30] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10Marostegui)
[10:24:36] <wikibugs>	 10Operations, 10DBA: Remove sarin and neodymium GRANTs from all the databases - https://phabricator.wikimedia.org/T229796 (10Marostegui)
[10:24:41] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10Marostegui)
[10:24:46] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10Marostegui)
[10:27:55] <ema>	 !log upload fifo-log-demux 0.5 to stretch-wikimedia
[10:28:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:55] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528092 (https://phabricator.wikimedia.org/T128546)
[10:30:04] <jouncebot>	 jan_drewniak: It is that lovely time of the day again! You are hereby commanded to deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190805T1030).
[10:30:24] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] k8s: deploy calico, rbac, psp, coredns and ns via helmfile in codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/528078 (https://phabricator.wikimedia.org/T228837) (owner: 10Fsero)
[10:30:38] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10MoritzMuehlenhoff) > Please comment and if its ready to start the decom process, check off the boxes and assign to me for followup.  Thanks in advance!  This needs to wait un...
[10:31:05] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10MoritzMuehlenhoff) @RobH This needs to wait until https://phabricator.wikimedia.org/T229796 is complete, I'll reassign the bug to you when that's done.
[10:32:52] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528092 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[10:33:47] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528092 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[10:34:04] <wikibugs>	 (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528092 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[10:36:46] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Yoruba Wikimedians User Group Official Mailing List - https://phabricator.wikimedia.org/T229749 (10MarcoAurelio)
[10:37:25] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Yoruba Wikimedians User Group Official Mailing List - https://phabricator.wikimedia.org/T229749 (10MarcoAurelio) >>! In T229749#5390555, @Wikicology wrote: > *The list should be private and requires list administrators approval for subscription. >  > *The list should ha...
[10:38:24] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "If you’re already touching this line, perhaps you could also look into T203397? (I tried to implement that a while back but couldn’t figur" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526755 (owner: 10Smalyshev)
[10:39:19] <logmsgbot>	 !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:528092| Bumping portals to master (T128546)]] (duration: 00m 49s)
[10:39:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:28] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[10:40:06] <logmsgbot>	 !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:528092| Bumping portals to master (T128546)]] (duration: 00m 46s)
[10:40:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:19] <jbond42>	 !log update java on sessionstore
[10:40:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:03] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Upgrade db1100 firmware and BIOS - https://phabricator.wikimedia.org/T228732 (10Cmjohnson) @marostegui yes, still good fit tomorrow at 1400UTC
[10:42:50] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Upgrade db1100 firmware and BIOS - https://phabricator.wikimedia.org/T228732 (10Marostegui) Excellent - thank you!
[10:44:16] <icinga-wm>	 PROBLEM - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 4.001 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops
[10:50:15] <wikibugs>	 (03PS1) 10Ladsgroup: labs: Set half of wikidata to read from the new term store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528103 (https://phabricator.wikimedia.org/T225055)
[10:51:06] <wikibugs>	 (03PS1) 10Ema: ATS: add role::cache::trafficserver::backend [puppet] - 10https://gerrit.wikimedia.org/r/528104 (https://phabricator.wikimedia.org/T227432)
[10:51:44] <wikibugs>	 (03PS3) 10Fsero: k8s: deploy calico, rbac, psp, coredns and ns via helmfile in codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/528078 (https://phabricator.wikimedia.org/T228837)
[10:52:25] <wikibugs>	 (03PS3) 10Vgutierrez: prometheus: Collect ncredir nginx metrics [puppet] - 10https://gerrit.wikimedia.org/r/524409 (https://phabricator.wikimedia.org/T228382)
[10:52:47] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s: deploy calico, rbac, psp, coredns and ns via helmfile in codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/528078 (https://phabricator.wikimedia.org/T228837) (owner: 10Fsero)
[10:52:49] <wikibugs>	 (03CR) 10Ema: [C: 03+1] sre.hosts.upgrade-varnish: cp1008 decom cleanup [cookbooks] - 10https://gerrit.wikimedia.org/r/527173 (https://phabricator.wikimedia.org/T229586) (owner: 10Volans)
[10:52:51] <wikibugs>	 (03PS4) 10Vgutierrez: prometheus: Collect ncredir nginx metrics [puppet] - 10https://gerrit.wikimedia.org/r/524409 (https://phabricator.wikimedia.org/T228382)
[10:53:22] <vgutierrez>	 godog: can I get a sanity check on that prometheus change? https://gerrit.wikimedia.org/r/c/operations/puppet/+/524409
[10:53:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: Collect ncredir nginx metrics [puppet] - 10https://gerrit.wikimedia.org/r/524409 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez)
[10:53:37] <vgutierrez>	 hmmm jerkins is not happy
[10:54:17] <vgutierrez>	 yeah.. merge issues...I've created that chain of changes before my vacations :_(
[10:54:28] <wikibugs>	 (03Abandoned) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512651 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[10:54:36] <wikibugs>	 (03Abandoned) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509785 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[10:55:20] <godog>	 vgutierrez: for sure, please add me to the review once ready to go and I'll take a look
[10:55:27] <vgutierrez>	 thx
[10:55:50] <wikibugs>	 (03CR) 10Ema: [C: 03+1] cumin: remove old scripts converted to cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/527170 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans)
[10:55:56] <wikibugs>	 (03PS5) 10Vgutierrez: fifo_log_demux: Provide pipe creation capabilities [puppet] - 10https://gerrit.wikimedia.org/r/524176 (https://phabricator.wikimedia.org/T228382)
[10:56:17] <wikibugs>	 (03PS2) 10Vgutierrez: fifo_log_demux: Allow to specify a service that requires fifo_log_demux [puppet] - 10https://gerrit.wikimedia.org/r/524496 (https://phabricator.wikimedia.org/T228382)
[10:56:41] <wikibugs>	 (03PS32) 10Vgutierrez: ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382)
[10:58:48] <wikibugs>	 10Operations, 10Traffic, 10netops, 10IPv6: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10jbond) Thanks brandon,  Ill take a look at removing the ones SLAAC addresses from puppet this week.  One of them, at least, was added by me and was what led...
[10:58:59] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] Update HD logo for en.ws and mul.ws (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527922 (https://phabricator.wikimedia.org/T229769) (owner: 10Viztor)
[10:59:06] <wikibugs>	 (03PS6) 10Urbanecm: Update HD logo for en.ws and mul.ws [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527922 (https://phabricator.wikimedia.org/T229769) (owner: 10Viztor)
[10:59:37] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: toolforge: k8s: ingress: nginx-ingress listen on 8082/tcp [puppet] - 10https://gerrit.wikimedia.org/r/527541 (https://phabricator.wikimedia.org/T228500)
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190805T1100).
[11:00:05] <jouncebot>	 Amir1 and raynor: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Update HD logo for en.ws and mul.ws [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527922 (https://phabricator.wikimedia.org/T229769) (owner: 10Viztor)
[11:00:15] <raynor>	 o/
[11:00:16] <Urbanecm>	 I can SWAT today!
[11:00:23] <Lucas_WMDE>	 o/
[11:00:24] <Urbanecm>	 (and also has a few things to deploy)
[11:00:28] <raynor>	 Urbanecm, awesome, thx
[11:00:41] <Amir1>	 o/ mine is backport and takes some time to merge, I already +2'ed it
[11:00:51] <Urbanecm>	 thanks Amir1
[11:00:56] <Amir1>	 I'm preet sure one config patch can go before mind 
[11:00:58] <Urbanecm>	 I've +2'ed raynor's backport
[11:00:59] <Amir1>	 *pretty
[11:00:59] <raynor>	 btw, I saw there is a patch for Related Articles but it's unclear on how to merge it (requires three sync commands)
[11:01:27] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527726 (https://phabricator.wikimedia.org/T229717) (owner: 10Huji)
[11:01:27] <raynor>	 also I'm not super familiar with that patch and feeling bit resistant to merge it without having Jon Robson around, so decided not to add it SWAT window
[11:01:40] <raynor>	 Urbanecm, thx for merging my backport, let me know when it gets to mwdebug
[11:01:47] <Urbanecm>	 will do
[11:02:03] <wikibugs>	 (03PS6) 10Vgutierrez: fifo_log_demux: Provide pipe creation capabilities [puppet] - 10https://gerrit.wikimedia.org/r/524176 (https://phabricator.wikimedia.org/T228382)
[11:02:05] <wikibugs>	 (03PS3) 10Vgutierrez: fifo_log_demux: Allow to specify a service that requires fifo_log_demux [puppet] - 10https://gerrit.wikimedia.org/r/524496 (https://phabricator.wikimedia.org/T228382)
[11:02:07] <wikibugs>	 (03PS33) 10Vgutierrez: ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382)
[11:02:09] <wikibugs>	 (03PS5) 10Vgutierrez: prometheus: Collect ncredir nginx metrics [puppet] - 10https://gerrit.wikimedia.org/r/524409 (https://phabricator.wikimedia.org/T228382)
[11:02:58] <wikibugs>	 (03Merged) 10jenkins-bot: Define import sources for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527726 (https://phabricator.wikimedia.org/T229717) (owner: 10Huji)
[11:04:41] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 9eb74c2: Define import sources for fawiki (T229717) (duration: 00m 48s)
[11:04:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:49] <stashbot>	 T229717: Define import sources for fawiki - https://phabricator.wikimedia.org/T229717
[11:05:23] <Urbanecm>	 raynor: If you mean https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/527632/, I'm pretty sure it's IS.php, CS.php, list. If you know how to test what it does, I'd be comfortable to merge&sync.
[11:06:30] <raynor>	 yeah, exactly this one, but I'm not fully comfortable with testing it, I think it can wait till later today, I'll be swatting another change during morning swat
[11:06:49] <Urbanecm>	 ok then
[11:06:50] <icinga-wm>	 PROBLEM - puppet last run on boron is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[linux-image-4.9.0-9-amd64-dbg],Package[linux-headers-4.9.0-9-amd64] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:06:51] <raynor>	 so most probably I'll pick also this one once Jon is around
[11:07:13] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "IS > CS > dblist looks correct to me as well, yeah." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra)
[11:08:01] <wikibugs>	 (03CR) 10jenkins-bot: Define import sources for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527726 (https://phabricator.wikimedia.org/T229717) (owner: 10Huji)
[11:16:54] <Amir1>	 Urbanecm: do you have any more config changes?
[11:17:21] <Urbanecm>	 Amir1, no, just a backport <https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AbuseFilter/+/528108>
[11:17:34] <Urbanecm>	 currently waiting on CI
[11:17:47] <Urbanecm>	 feel free to push your things if you have any :-)
[11:18:22] <Amir1>	 that's a pretty big backport :D
[11:18:32] <Amir1>	 Mine is waiting for CI still
[11:18:36] <hauskatze>	 jouncebot: now
[11:18:36] <jouncebot>	 For the next 0 hour(s) and 41 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190805T1100)
[11:18:57] <Amir1>	 Urbanecm: can you deploy mine? It's not testable (you can just sync the whole extension, it should be fine)
[11:19:09] <Daimona>	 Urbanecm: I'm here to test :)
[11:19:09] <Urbanecm>	 Amir1, the backport? Sure
[11:19:17] <Urbanecm>	 wonderful, Daimona!
[11:19:37] <Amir1>	 Yup: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/528072
[11:19:56] <Urbanecm>	 ok, will do
[11:19:58] <Urbanecm>	 once it's merged
[11:20:35] <Amir1>	 Thanks. It's almost done (5 minutes left...)
[11:21:00] <Urbanecm>	 I see
[11:23:32] <wikibugs>	 (03PS4) 10Urbanecm: Enable Page Previews as default on hewikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508088 (https://phabricator.wikimedia.org/T222017) (owner: 10Ammarpad)
[11:24:18] * Urbanecm is going to do ^ in the meanwhile as well
[11:24:24] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508088 (https://phabricator.wikimedia.org/T222017) (owner: 10Ammarpad)
[11:25:50] <Amir1>	 Mine is about to finish
[11:26:25] <Urbanecm>	 well, both are all success, but none is merged yet
[11:26:33] <Urbanecm>	 WIkibase done
[11:26:36] <Urbanecm>	 Amir1, fetching
[11:26:41] <Amir1>	 cool
[11:27:05] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Page Previews as default on hewikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508088 (https://phabricator.wikimedia.org/T222017) (owner: 10Ammarpad)
[11:27:44] <Urbanecm>	 raynor: Your patch should be on mwdebug1002
[11:28:23] <raynor>	 thx, checking it
[11:28:31] <Urbanecm>	 Amir1, syncing
[11:28:42] * Amir1 looks at graphs
[11:28:43] <Amir1>	 thanks
[11:29:12] <icinga-wm>	 RECOVERY - puppet last run on boron is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:29:31] <wikibugs>	 (03CR) 10jenkins-bot: Enable Page Previews as default on hewikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508088 (https://phabricator.wikimedia.org/T222017) (owner: 10Ammarpad)
[11:29:40] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.16/extensions/Wikibase: SWAT: 3ecaa57: Add only needed entity usages in AddUsagesForPageJob (T226818, T205045) (duration: 01m 12s)
[11:29:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:50] <stashbot>	 T226818: Diff when updating wbc_entity_usage - https://phabricator.wikimedia.org/T226818
[11:29:51] <stashbot>	 T205045: Exception from LinksUpdate: Deadlock found in database query  (from Wikibase\Client\Usage\Sql\EntityUsageTable::addUsages) - https://phabricator.wikimedia.org/T205045
[11:31:59] <hauskatze>	 Urbanecm: is 527852 being deployed?
[11:32:35] <Urbanecm>	 hauskatze, yes, currently on mwdebug1002, waiting for raynor's tests
[11:33:09] <raynor>	 I'm almost done, one sec
[11:33:19] <hauskatze>	 sure :)
[11:33:39] <marostegui>	 !log Upgrade MySQL on db2074 db2057 db2050 db2035 db2098
[11:33:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:48] <Urbanecm>	 raynor, I see "Expectation (masterConns <= 0) by ApiMain::setRequestExpectations not met (actual: 1):
[11:34:02] <Urbanecm>	 might that be related?
[11:34:16] <Urbanecm>	 https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.08.05/mediawiki/?id=AWxhjXUOZKA7Rpir5F7n is the full log
[11:34:37] <raynor>	 Urbanecm, done testing, looks good
[11:34:39] <raynor>	 let me check that logstash now
[11:34:44] <Urbanecm>	 thanks
[11:35:39] <raynor>	 Urbanecm - that's definitely related
[11:35:41] <raynor>	  	/w/api.php?action=query&format=json&formatversion=2&prop=revisions%7Cinfo&rvprop=content%7Ctimestamp&titles=User%3APMiazga%20(WMF)%2Fsandbox&intestactions=edit&intestactionsdetail=full&rvsection=0
[11:36:01] <Urbanecm>	 what should we do with that entry then raynor?
[11:36:02] <raynor>	 sorry, let me put that in different words -> that's definitely caused by me
[11:36:04] <raynor>	 but not related
[11:36:18] <Urbanecm>	 so caused by your tests, but not by your patch, right?
[11:36:19] <raynor>	 I was using Visual Editor, most edits went via API
[11:36:36] <raynor>	 yes - that's correct
[11:36:43] <raynor>	 caused by my tests, but not patch
[11:36:50] <Urbanecm>	 I'm looking for other occurances then, to be sure
[11:36:52] <raynor>	 I'll do some more tests and submit phab ticket for that
[11:37:22] <Urbanecm>	 ~1100 occurances in last 15 minutes on real servers, syncing then
[11:37:44] <raynor>	 Urbanecm, thx
[11:37:47] <Daimona>	 Those kind of notices usually happen when checking a block from master in a GET request
[11:37:58] <Daimona>	 And they're pretty common
[11:38:00] <raynor>	 I'll try to find out whats wrong and find phab ticket
[11:38:15] <Urbanecm>	 thank you
[11:38:18] <raynor>	 ah, Daimona thx, yes - it tries to check ` User->getBlockedStatus()`
[11:38:24] <Daimona>	 Indeed
[11:38:39] <Daimona>	 The idea being - for GET request it's probably enough to check the replica
[11:38:52] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.16/extensions/MobileFrontend/: SWAT: b7ae4fb: Revert "[AMC] [desktop] [mobile] use AMC by default for desktop users" (T229722) (duration: 00m 49s)
[11:38:59] <Urbanecm>	 thanks Daimona, just wanted to be sure nothing's wrong with the patch, those notices are scary :-)
[11:39:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:00] <stashbot>	 T229722: All edits are tagged as "advanced mobile edit" when wgMFAdvancedMobileContributions is true - https://phabricator.wikimedia.org/T229722
[11:39:03] <raynor>	 makes sense. Daimona do we have a phab ticket for that?
[11:39:14] <raynor>	 or can I just ignore that error?
[11:39:28] <Daimona>	 I don't think that specific patch is related
[11:39:35] <Daimona>	 TBH I don't know if we already have a task
[11:39:37] <Urbanecm>	 [XUgLxApAAE0AAFtFgIAAAABC] /rpc/RunSingleJob.php ArgumentCountError from line 45 of /srv/mediawiki/php-1.34.0-wmf.16/extensions/MobileFrontend/includes/amc/UserMode.php: Too few arguments to function MobileFrontend\AMC\UserMode::__construct(), 2 passed in /srv/mediawiki/php-1.34.0-wmf.16/extensions/MobileFrontend/includes/ServiceWiring.php on line 
[11:39:38] <Urbanecm>	 53 and exactly 3 expected
[11:39:40] <Urbanecm>	 a lot of entries like that
[11:40:24] <Urbanecm>	 raynor ^^
[11:40:41] <raynor>	 damn
[11:41:05] <raynor>	 Urbanecm - yes, that me, let me quickly check that
[11:41:25] <Daimona>	 I see there are some extension-specific tasks for that problem, which is not so serious anyway, especially as long as it doesn't happen too often.
[11:41:29] <Urbanecm>	 seems to have stopped 
[11:41:51] <Urbanecm>	 [XUgVJwpAAEwAAIJAQVEAAACS] /w/api.php   ErrorException from line 57 of /srv/mediawiki/php-1.34.0-wmf.16/extensions/MobileFrontend/includes/amc/Manager.php: PHP Warning: __construct() expects exactly 3 parameters, 2 given
[11:41:55] <raynor>	 might be problem during switch, the arguments count have changed
[11:42:07] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.elasticsearch.rolling-restart
[11:42:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:22] <Urbanecm>	 true
[11:42:24] <raynor>	 I hope it's only php cache
[11:42:59] <logmsgbot>	 !log jbond@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-restart (exit_code=97)
[11:43:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:14] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.elasticsearch.rolling-restart
[11:43:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:33] <Urbanecm>	 things seems to be back to normal
[11:43:55] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-restart (exit_code=99)
[11:44:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:53] <raynor>	 it should be good
[11:45:23] <Urbanecm>	 agreed
[11:45:43] <raynor>	 the only place where we initialize the Manager/UserMode is ServiceWirings
[11:46:05] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 0032b0a: Enable Page Previews as default on hewikivoyage (T222017) (duration: 00m 47s)
[11:46:11] <raynor>	 userMode also has named contructor (UserMode::newForUser()) but this one is also fixed
[11:46:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:46:13] <stashbot>	 T222017: Enable Page Previews as default on hewikivoyage - https://phabricator.wikimedia.org/T222017
[11:46:15] <raynor>	 it had to be php cache ;/
[11:46:51] <Urbanecm>	 hmm
[11:46:59] <Urbanecm>	 I'm pretty sure I ran git submodule update extensions/MobileFrontend
[11:47:08] <Urbanecm>	 but extensions/MobileFrontend is dirty?
[11:47:18] <raynor>	 is it?
[11:47:27] <Urbanecm>	 yes
[11:47:34] <raynor>	 sorry, I'm not logged into deployments server
[11:47:43] <Urbanecm>	 np
[11:48:19] <raynor>	 wmf.16 ?
[11:48:37] <Urbanecm>	 yes
[11:48:46] <Urbanecm>	 ok, I know what that's
[11:48:51] <Urbanecm>	 a security patch causes that :/
[11:48:57] <Urbanecm>	 should be fine
[11:49:24] <raynor>	 yup, security
[11:49:57] <wikibugs>	 10Operations, 10Discovery-Search, 10Elasticsearch, 10SRE-tools: cookbook sre.elasticsearch.rolling-restart failed with cluster relforge - https://phabricator.wikimedia.org/T229807 (10jbond)
[11:51:32] <Urbanecm>	 hmm
[11:51:34] <Urbanecm>	 the last one is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AbuseFilter/+/528108
[11:51:40] <Urbanecm>	 but I can't git fetch it
[11:51:54] <Urbanecm>	 the Update git submodules commit doesn't seem to appear
[11:52:39] <raynor>	 Urbanecm, check for security patches
[11:53:07] <Urbanecm>	 no security patches for AbuseFilter
[11:53:12] <Urbanecm>	 (in /srv/patches)
[11:54:13] <Urbanecm>	 jouncebot, next
[11:54:14] <jouncebot>	 In 5 hour(s) and 5 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190805T1700)
[11:54:15] <raynor>	 then I have no idea ;(
[11:55:38] <Lucas_WMDE>	 the dirty submodules are apparently expected and you’re just supposed to leave them alone
[11:55:46] <Lucas_WMDE>	 (`git rebase` seems to do the right thing)
[11:55:49] <Lucas_WMDE>	 see https://phabricator.wikimedia.org/T229285
[11:56:29] <Urbanecm>	 yup
[11:59:02] <Urbanecm>	 Daimona, I can't fetch the "merged" backport, so I can't deploy it
[11:59:09] <Urbanecm>	 since we're out of time, I'm going to revert it
[11:59:30] <wikibugs>	 10Operations, 10Discovery-Search, 10Elasticsearch, 10SRE-tools: cookbook sre.elasticsearch.rolling-restart failed with cluster relforge - https://phabricator.wikimedia.org/T229807 (10Mathew.onipe) p:05Triage→03Normal
[11:59:47] <Daimona>	 Oh well, fine
[12:00:52] <Urbanecm>	 I'll try it later :-)
[12:01:02] <Urbanecm>	 !log EU SWAT done
[12:01:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:40] <Daimona>	 I'm afraid I won't be there... Either way it should be enough to check https://logstash.wikimedia.org/app/kibana#/discover/38e78a30-1b51-11e9-b106-b154ee768b0a?_g=(refreshInterval%3A(display%3AOff%2Cpause%3A!f%2Cvalue%3A0)%2Ctime%3A(from%3Anow-30m%2Cmode%3Aquick%2Cto%3Anow)) for new errors, and ensure that we get a reduction in "Abusefilter parser 
[12:02:41] <Daimona>	 error" messages
[12:03:48] <wikibugs>	 10Operations, 10Discovery-Search, 10Elasticsearch, 10SRE-tools: cookbook sre.elasticsearch.rolling-restart failed with cluster relforge - https://phabricator.wikimedia.org/T229807 (10Mathew.onipe) Connection from cumin hosts to relforge via elastic ports (9[24]00) failed due to firewall I guess. This is a...
[12:05:42] <Urbanecm>	 yup, will do Daimona
[12:05:48] <Daimona>	 Thank you, bb :)
[12:05:56] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528112
[12:06:08] <wikibugs>	 (03PS2) 10Ladsgroup: Revert "Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528112
[12:06:08] <icinga-wm>	 PROBLEM - puppet last run on phab1001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[12:06:14] <wikibugs>	 (03PS2) 10ArielGlenn: add more public tables for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/527505 (https://phabricator.wikimedia.org/T226167)
[12:06:35] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Revert "Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528112 (owner: 10Ladsgroup)
[12:09:07] <Amir1>	 I have a quick deploy
[12:11:50] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528112 (owner: 10Ladsgroup)
[12:12:05] <wikibugs>	 (03CR) 10jenkins-bot: Revert "Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528112 (owner: 10Ladsgroup)
[12:12:13] * Lucas_WMDE counts “revert”s
[12:12:24] <Lucas_WMDE>	 this… switches to WRITE_NEW again, I think? :D
[12:12:51] <Amir1>	 yes, it's setting it to new
[12:12:57] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Revert "Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"""" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528112 (owner: 10Ladsgroup)
[12:13:05] * Amir1 keeps on putting new reverts until someone gets angry
[12:13:23] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:526657|Switch property terms migration to WRITE_NEW on production wikidata (T225053)]] (duration: 00m 48s)
[12:13:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:13:32] <stashbot>	 T225053: Switch `tmpPropertyTermsMigrationStage` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225053
[12:16:13] <wikibugs>	 (03CR) 10Ladsgroup: Revert "Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"""" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528112 (owner: 10Ladsgroup)
[12:26:35] <Krinkle>	 !log mwscript deleteEqualMessages.php --wiki fywiktionary (requested at [[m:Steward_requests/Miscellaneous]])
[12:26:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:18] <wikibugs>	 (03PS1) 10Jbond: relforge: allow cumin to access elastic search ports [puppet] - 10https://gerrit.wikimedia.org/r/528116 (https://phabricator.wikimedia.org/T229807)
[12:29:59] <wikibugs>	 (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/528116 (https://phabricator.wikimedia.org/T229807) (owner: 10Jbond)
[12:33:26] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] "labs, noop for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528103 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup)
[12:33:31] <moritzm>	 !log uploaded openjdk-8 u222 for jessie-wikimedia
[12:33:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:04] <icinga-wm>	 RECOVERY - puppet last run on phab1001 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[12:34:23] <wikibugs>	 (03Merged) 10jenkins-bot: labs: Set half of wikidata to read from the new term store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528103 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup)
[12:35:17] <wikibugs>	 10Operations, 10Discovery-Search, 10Elasticsearch, 10Traffic: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10Mathew.onipe) About cloudelastic resolving to icinga1001, I had jbond help me do see where it cloudelastic.wikimedia.org resol...
[12:35:30] <Amir1>	 Rebased on deploy node ^
[12:36:31] <wikibugs>	 (03CR) 10jenkins-bot: labs: Set half of wikidata to read from the new term store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528103 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup)
[12:40:49] <wikibugs>	 (03PS1) 10Gehel: elasticsearch: correct ports for relforge cluster [software/spicerack] - 10https://gerrit.wikimedia.org/r/528118 (https://phabricator.wikimedia.org/T229807)
[12:41:56] <wikibugs>	 (03PS2) 10Jbond: relforge: allow cumin to access elastic search ports [puppet] - 10https://gerrit.wikimedia.org/r/528116 (https://phabricator.wikimedia.org/T229807)
[12:42:18] <wikibugs>	 (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/528116 (https://phabricator.wikimedia.org/T229807) (owner: 10Jbond)
[12:42:37] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/528116 (https://phabricator.wikimedia.org/T229807) (owner: 10Jbond)
[12:44:00] <moritzm>	 !log restarting cassandra on restbase-dev1040
[12:44:02] <moritzm>	 !log restarting cassandra on restbase-dev1004
[12:44:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:16] <wikibugs>	 10Operations, 10Release Pipeline, 10serviceops, 10Goal, 10Release-Engineering-Team (Pipeline): Self-service Deployment Pipeline - https://phabricator.wikimedia.org/T228676 (10akosiaris)
[12:53:20] <wikibugs>	 10Operations, 10Release Pipeline, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 3 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10akosiaris)
[12:53:23] <wikibugs>	 10Operations, 10Discovery-Search, 10Elasticsearch, 10Traffic: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10BBlack) So, yes, cloudelastic is correct in DNS for normal lookups.  The issue is that the icinga check defines the virtual ho...
[12:55:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] relforge: allow cumin to access elastic search ports [puppet] - 10https://gerrit.wikimedia.org/r/528116 (https://phabricator.wikimedia.org/T229807) (owner: 10Jbond)
[12:56:17] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/528118 (https://phabricator.wikimedia.org/T229807) (owner: 10Gehel)
[12:56:33] <wikibugs>	 10Operations, 10Discovery-Search, 10Elasticsearch, 10Traffic: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10Mathew.onipe) @BBlack yea yea.. I've missed your musing on complex system. Thanks. I will make a patch
[12:56:53] <wikibugs>	 10Operations, 10Discovery-Search, 10Elasticsearch, 10Traffic: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10Mathew.onipe) p:05Triage→03Normal
[12:57:18] <wikibugs>	 10Operations, 10Release Pipeline, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 3 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10akosiaris) restrouter was temporarily deployed in the staging cluster today. Deployment wa...
[13:01:21] <jbond42>	 !log rolling update of openjdk-8 on restbase
[13:01:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:10] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] elasticsearch: correct ports for relforge cluster [software/spicerack] - 10https://gerrit.wikimedia.org/r/528118 (https://phabricator.wikimedia.org/T229807) (owner: 10Gehel)
[13:09:32] <wikibugs>	 (03CR) 10jenkins-bot: elasticsearch: correct ports for relforge cluster [software/spicerack] - 10https://gerrit.wikimedia.org/r/528118 (https://phabricator.wikimedia.org/T229807) (owner: 10Gehel)
[13:09:49] <wikibugs>	 (03PS1) 10Elukey: Add PartOf configuration in the Kafka mirror systemd units [puppet] - 10https://gerrit.wikimedia.org/r/528127 (https://phabricator.wikimedia.org/T229003)
[13:11:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add PartOf configuration in the Kafka mirror systemd units [puppet] - 10https://gerrit.wikimedia.org/r/528127 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey)
[13:12:56] <wikibugs>	 (03CR) 10Ema: [C: 03+1] fifo_log_demux: Allow to specify a service that requires fifo_log_demux [puppet] - 10https://gerrit.wikimedia.org/r/524496 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez)
[13:14:32] <wikibugs>	 (03CR) 10Ema: [C: 03+1] ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez)
[13:15:54] <wikibugs>	 (03PS2) 10Filippo Giunchedi: base: stop per-host puppet critical when master has issues [puppet] - 10https://gerrit.wikimedia.org/r/528087 (https://phabricator.wikimedia.org/T229262)
[13:15:56] <wikibugs>	 (03PS4) 10Filippo Giunchedi: prometheus: alert on widespread puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/526431 (https://phabricator.wikimedia.org/T229262)
[13:16:19] <wikibugs>	 (03CR) 10Ema: [C: 03+1] prometheus: Collect ncredir nginx metrics [puppet] - 10https://gerrit.wikimedia.org/r/524409 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez)
[13:16:28] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker
[13:16:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:21] <wikibugs>	 (03PS1) 10BPirkle: Switch testwiki to use kask (only) for sessions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528130 (https://phabricator.wikimedia.org/T222099)
[13:21:39] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.upgrade-varnish: cp1008 decom cleanup [cookbooks] - 10https://gerrit.wikimedia.org/r/527173 (https://phabricator.wikimedia.org/T229586) (owner: 10Volans)
[13:22:28] <volans>	 ema: oops... I just noticed a bug, sending a fixing patch
[13:23:18] <hauskatze>	 I've always wondered why it's called 'cookbook' :)
[13:25:05] <volans>	 hauskatze: it's used in the industry together with runbook. We ended up using runbook for documentation on wikitech and cookbook for automation scripts.
[13:26:04] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] fifo_log_demux: Provide pipe creation capabilities [puppet] - 10https://gerrit.wikimedia.org/r/524176 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez)
[13:26:08] <wikibugs>	 10Operations, 10Discovery-Search, 10Elasticsearch, 10Traffic: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10Mathew.onipe) Sadly, I don't think this will work as the host param will not be unique and icinga does not seem to handle that...
[13:26:14] <wikibugs>	 (03PS7) 10Vgutierrez: fifo_log_demux: Provide pipe creation capabilities [puppet] - 10https://gerrit.wikimedia.org/r/524176 (https://phabricator.wikimedia.org/T228382)
[13:26:16] <wikibugs>	 (03CR) 10Muehlenhoff: "Patch looks good, but best to wait until  https://phabricator.wikimedia.org/T229796 is resolved." [puppet] - 10https://gerrit.wikimedia.org/r/527043 (https://phabricator.wikimedia.org/T220503) (owner: 10Jbond)
[13:26:51] <wikibugs>	 (03PS2) 10Volans: sre.hosts.upgrade-varnish: cp1008 decom cleanup [cookbooks] - 10https://gerrit.wikimedia.org/r/527173 (https://phabricator.wikimedia.org/T229586)
[13:26:55] <volans>	 ema: ^^^ with the fix
[13:27:18] <ema>	 volans: looking
[13:28:10] <wikibugs>	 (03CR) 10Ema: [C: 03+1] sre.hosts.upgrade-varnish: cp1008 decom cleanup [cookbooks] - 10https://gerrit.wikimedia.org/r/527173 (https://phabricator.wikimedia.org/T229586) (owner: 10Volans)
[13:28:17] <ema>	 volans: wooops, looks good!
[13:28:36] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0)
[13:28:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:42] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.upgrade-varnish: cp1008 decom cleanup [cookbooks] - 10https://gerrit.wikimedia.org/r/527173 (https://phabricator.wikimedia.org/T229586) (owner: 10Volans)
[13:29:46] <volans>	 thx, sorry for the trouble :)
[13:30:11] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] fifo_log_demux: Allow to specify a service that requires fifo_log_demux [puppet] - 10https://gerrit.wikimedia.org/r/524496 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez)
[13:30:23] <wikibugs>	 (03PS4) 10Vgutierrez: fifo_log_demux: Allow to specify a service that requires fifo_log_demux [puppet] - 10https://gerrit.wikimedia.org/r/524496 (https://phabricator.wikimedia.org/T228382)
[13:30:51] <wikibugs>	 (03PS2) 10Volans: cumin: remove old scripts converted to cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/527170 (https://phabricator.wikimedia.org/T205886)
[13:31:41] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.upgrade-varnish: cp1008 decom cleanup [cookbooks] - 10https://gerrit.wikimedia.org/r/527173 (https://phabricator.wikimedia.org/T229586) (owner: 10Volans)
[13:33:32] <wikibugs>	 (03PS3) 10Volans: cumin: remove old scripts converted to cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/527170 (https://phabricator.wikimedia.org/T205886)
[13:33:55] * volans picks the next number to puppet-merge :D
[13:34:41] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch eqiad labsldapconfig to the read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/525220 (https://phabricator.wikimedia.org/T46722)
[13:34:51] <wikibugs>	 (03CR) 10Volans: [C: 03+2] cumin: remove old scripts converted to cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/527170 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans)
[13:35:12] <wikibugs>	 (03PS1) 10Jbond: cassandra: rolling restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/528133
[13:35:30] <volans>	 vgutierrez: there is your change too to puppet-merge, what should I do?
[13:35:48] <vgutierrez>	 don't be shy, merge it
[13:35:54] <volans>	 will do :)
[13:35:58] <vgutierrez>	 thx <3
[13:36:46] <volans>	 vgutierrez: {done}
[13:36:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cassandra: rolling restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/528133 (owner: 10Jbond)
[13:37:02] <wikibugs>	 (03PS11) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072)
[13:37:29] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] Switch testwiki to use kask (only) for sessions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528130 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle)
[13:37:47] <volans>	 !log run cumin 'A:cumin' 'rm -v /usr/local/sbin/{wmf-upgrade-varnish,wmf-upgrade-and-reboot,wmf-downtime-host,wmf-decommission-host}' T205886
[13:37:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:56] <stashbot>	 T205886: Cookbooks: convert remaining wmf-* scripts - https://phabricator.wikimedia.org/T205886
[13:38:20] <wikibugs>	 (03PS2) 10Jbond: cassandra: rolling restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/528133
[13:38:25] <wikibugs>	 10Operations, 10SRE-tools, 10Goal: Expand Spicerack library and SRE Cookbooks - Q2 2018-19 Goal - https://phabricator.wikimedia.org/T205867 (10Volans)
[13:39:22] <wikibugs>	 (03PS3) 10Eevans: cassandra: rolling restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/528133 (owner: 10Jbond)
[13:40:14] <wikibugs>	 (03CR) 10Muehlenhoff: cassandra: rolling restart cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/528133 (owner: 10Jbond)
[13:40:15] <wikibugs>	 (03PS4) 10Mathew.onipe: cloudelastic: remove ocsp_proxy [puppet] - 10https://gerrit.wikimedia.org/r/511381 (https://phabricator.wikimedia.org/T223519)
[13:41:08] <wikibugs>	 (03CR) 10Mathew.onipe: Add maps reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe)
[13:42:05] <fsero>	 !log deploying tiller in kube-system for helmfile changes - T228837
[13:42:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:14] <stashbot>	 T228837: recreate codfw cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] - https://phabricator.wikimedia.org/T228837
[13:44:09] <wikibugs>	 (03CR) 10Eevans: cassandra: rolling restart cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/528133 (owner: 10Jbond)
[13:45:36] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "lg!" [puppet] - 10https://gerrit.wikimedia.org/r/526431 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi)
[13:46:49] <wikibugs>	 (03CR) 10CDanis: base: stop per-host puppet critical when master has issues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528087 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi)
[13:51:25] <wikibugs>	 10Operations, 10Machine vision, 10Reading-Infrastructure-Team-Backlog, 10serviceops, and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway)
[13:52:16] <wikibugs>	 (03PS1) 10Fsero: k8s: bug: fixing initialize_cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/528137
[13:52:20] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Traffic: Rename gerrit-slave to gerrit-replica - https://phabricator.wikimedia.org/T229822 (10Paladox)
[13:52:31] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s: bug: fixing initialize_cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/528137 (owner: 10Fsero)
[13:54:43] <wikibugs>	 (03CR) 10Muehlenhoff: cassandra: rolling restart cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/528133 (owner: 10Jbond)
[13:56:07] <wikibugs>	 (03PS1) 10Vgutierrez: acme_chief: Add gerrit-replica.wm.o to the gerrit certificate [puppet] - 10https://gerrit.wikimedia.org/r/528138
[13:56:55] <fsero>	 !log deploying calico controller  in codfw via helmfile - T228837
[13:57:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:04] <stashbot>	 T228837: recreate codfw cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] - https://phabricator.wikimedia.org/T228837
[13:57:50] <wikibugs>	 (03PS2) 10Vgutierrez: acme_chief: Add gerrit-replica.wm.o to the gerrit certificate [puppet] - 10https://gerrit.wikimedia.org/r/528138 (https://phabricator.wikimedia.org/T229822)
[13:57:50] <logmsgbot>	 !log fsero@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' .
[13:57:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:23] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/528138 (https://phabricator.wikimedia.org/T229822) (owner: 10Vgutierrez)
[13:59:36] <wikibugs>	 (03PS3) 10Paladox: Add gerrit-replica [dns] - 10https://gerrit.wikimedia.org/r/527657
[14:00:02] <logmsgbot>	 !log fsero@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' .
[14:00:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:39] <wikibugs>	 (03PS1) 10Fsero: bug: incorrect values path [deployment-charts] - 10https://gerrit.wikimedia.org/r/528139
[14:01:42] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[14:01:48] <wikibugs>	 (03PS4) 10Paladox: Add gerrit-replica [dns] - 10https://gerrit.wikimedia.org/r/527657
[14:01:54] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] bug: incorrect values path [deployment-charts] - 10https://gerrit.wikimedia.org/r/528139 (owner: 10Fsero)
[14:02:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add gerrit-replica [dns] - 10https://gerrit.wikimedia.org/r/527657 (owner: 10Paladox)
[14:03:02] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 40.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[14:04:02] <wikibugs>	 (03PS5) 10Paladox: Add gerrit-replica [dns] - 10https://gerrit.wikimedia.org/r/527657
[14:04:13] <logmsgbot>	 !log fsero@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' .
[14:04:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:33] <logmsgbot>	 !log fsero@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'coredns' .
[14:05:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:45] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: stop polling varnish on upload backend [puppet] - 10https://gerrit.wikimedia.org/r/528142
[14:05:47] <wikibugs>	 (03PS1) 10Filippo Giunchedi: kubernetes: expand alert description [puppet] - 10https://gerrit.wikimedia.org/r/528143 (https://phabricator.wikimedia.org/T229262)
[14:06:16] <jijiki>	 !log Depool and restart restbase2009 for maint - T227408
[14:06:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:25] <stashbot>	 T227408: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408
[14:07:48] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[14:07:48] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:07:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:11] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Add gerrit-replica.wm.o to the gerrit certificate [puppet] - 10https://gerrit.wikimedia.org/r/528138 (https://phabricator.wikimedia.org/T229822) (owner: 10Vgutierrez)
[14:08:11] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[14:08:12] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[14:08:17] <wikibugs>	 (03PS6) 10Paladox: Add gerrit-replica [dns] - 10https://gerrit.wikimedia.org/r/527657 (https://phabricator.wikimedia.org/T229822)
[14:08:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:48] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "Yes!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526431 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi)
[14:08:50] <wikibugs>	 (03PS4) 10Jbond: cassandra: rolling restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/528133
[14:09:04] <wikibugs>	 (03PS1) 10Fsero: adding limitranges for codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/528145
[14:09:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add the mediawiki.restart_appservers cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/527487 (owner: 10Giuseppe Lavagetto)
[14:09:14] <wikibugs>	 (03CR) 10Jbond: cassandra: rolling restart cookbook (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/528133 (owner: 10Jbond)
[14:10:13] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Papaul) @jijiki i need this serveur power down  thanks
[14:10:15] <wikibugs>	 (03PS2) 10Fsero: adding limitranges for codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/528145
[14:10:44] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10jijiki) @papaul Server is depooled, ping me when do pool it back, many thanks !
[14:10:55] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] adding limitranges for codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/528145 (owner: 10Fsero)
[14:12:57] <logmsgbot>	 !log fsero@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=sessionstore|citoid|cxserver|eventgate-analytics|eventgate-main|termbox|blubberoid|mathoid|zotero,name=codfw
[14:13:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:18] <wikibugs>	 (03PS3) 10Paladox: Gerrit: Rename gerrit-slave to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/527656
[14:13:25] <wikibugs>	 (03PS4) 10Paladox: Gerrit: Rename gerrit-slave to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/527656
[14:13:38] <wikibugs>	 (03Abandoned) 10Paladox: gerrit: Add gerrit-replica to acme [puppet] - 10https://gerrit.wikimedia.org/r/527664 (owner: 10Paladox)
[14:14:21] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Add gerrit-replica [dns] - 10https://gerrit.wikimedia.org/r/527657 (https://phabricator.wikimedia.org/T229822) (owner: 10Paladox)
[14:19:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: alert on widespread puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/526431 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi)
[14:19:27] <wikibugs>	 (03PS5) 10Filippo Giunchedi: prometheus: alert on widespread puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/526431 (https://phabricator.wikimedia.org/T229262)
[14:19:46] <wikibugs>	 10Operations, 10serviceops, 10PHP 7.2 support: Socket Errors on PHP7 - https://phabricator.wikimedia.org/T224538 (10jijiki) For connection pooling purposes, when we want to access `search.svc.eqiad.wmnet` from php-fpm, we are doing so via nginx. This nginx is installed on each mw* server listening on localho...
[14:21:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: alert on widespread puppet failures (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526431 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi)
[14:24:31] <papaul>	 !log shut down rstbase2009 for battery replacement 
[14:24:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:34] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:25:38] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[14:25:42] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:25:42] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:25:44] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:25:44] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:25:49] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1010: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/528152 (https://phabricator.wikimedia.org/T222978)
[14:25:59] <jbond42>	 ^^ this is me should recover shortly
[14:26:37] <jbond42>	 ^^ possibly not actully i was working on restbase1019[3~
[14:26:58] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[14:27:04] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:27:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:27:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:27:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:27:16] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:27:21] <wikibugs>	 (03PS2) 10Marostegui: dbproxy1011: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/528152 (https://phabricator.wikimedia.org/T222978)
[14:29:06] <wikibugs>	 10Operations, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review, and 3 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) Removing HHVM and any leftovers are now part of T229792, we mark this as resolved 💃
[14:29:22] <wikibugs>	 10Operations, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review, and 3 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) 05Open→03Resolved a:03jijiki
[14:29:25] <wikibugs>	 10Operations, 10serviceops: SRE FY19-20 Q1 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10jijiki)
[14:31:19] <wikibugs>	 (03CR) 10BBlack: prometheus: stop polling varnish on upload backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528142 (owner: 10Filippo Giunchedi)
[14:31:23] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy1011: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/528152 (https://phabricator.wikimedia.org/T222978) (owner: 10Marostegui)
[14:31:59] <marostegui>	 !log Reload haproxy on dbproxy1011 to depool labsdb1010 T222978
[14:32:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:07] <stashbot>	 T222978: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978
[14:33:37] <wikibugs>	 (03CR) 10Herron: [C: 03+1] kubernetes: expand alert description [puppet] - 10https://gerrit.wikimedia.org/r/528143 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi)
[14:36:15] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "I have mixed feeling about this (the general complexity of this check) but overall think it's a good step to reduce alert noise.  Maybe we" [puppet] - 10https://gerrit.wikimedia.org/r/528087 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi)
[14:40:00] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] systemd::tmpfile: apply changes when we change the files. [puppet] - 10https://gerrit.wikimedia.org/r/527430 (https://phabricator.wikimedia.org/T204450) (owner: 10Giuseppe Lavagetto)
[14:40:11] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: systemd::tmpfile: apply changes when we change the files. [puppet] - 10https://gerrit.wikimedia.org/r/527430 (https://phabricator.wikimedia.org/T204450)
[14:41:45] <_joe_>	 uhm
[14:43:24] <wikibugs>	 (03PS1) 10Fsero: k8s, cache: disabling codfw services for k8s cluster recreation [puppet] - 10https://gerrit.wikimedia.org/r/528164 (https://phabricator.wikimedia.org/T228837)
[14:44:06] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] base: stop per-host puppet critical when master has issues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528087 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi)
[14:48:17] <wikibugs>	 (03PS5) 10Paladox: Gerrit: Rename gerrit-slave to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/527656 (https://phabricator.wikimedia.org/T229822)
[14:54:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] k8s, cache: disabling codfw services for k8s cluster recreation [puppet] - 10https://gerrit.wikimedia.org/r/528164 (https://phabricator.wikimedia.org/T228837) (owner: 10Fsero)
[14:55:42] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] locales-extended: Add support for Korean [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527653 (https://phabricator.wikimedia.org/T130532) (owner: 10BryanDavis)
[14:56:30] <wikibugs>	 (03PS1) 10Fsero: bug: deleted unused envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/528168
[14:56:59] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] bug: deleted unused envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/528168 (owner: 10Fsero)
[14:57:58] <wikibugs>	 (03PS1) 10RobH: ganeti500[123] mgmt dns [dns] - 10https://gerrit.wikimedia.org/r/528169 (https://phabricator.wikimedia.org/T228099)
[14:58:13] <wikibugs>	 (03PS2) 10RobH: ganeti500[123] mgmt dns [dns] - 10https://gerrit.wikimedia.org/r/528169 (https://phabricator.wikimedia.org/T228099)
[14:59:15] <wikibugs>	 (03PS3) 10RobH: ganeti500[123] mgmt dns [dns] - 10https://gerrit.wikimedia.org/r/528169 (https://phabricator.wikimedia.org/T228099)
[14:59:24] <wikibugs>	 (03PS4) 10RobH: ganeti500[123] mgmt dns [dns] - 10https://gerrit.wikimedia.org/r/528169 (https://phabricator.wikimedia.org/T228099)
[15:03:18] <wikibugs>	 (03PS1) 10Fsero: helmfile: ammending envs files [deployment-charts] - 10https://gerrit.wikimedia.org/r/528170
[15:06:38] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] helmfile: ammending envs files [deployment-charts] - 10https://gerrit.wikimedia.org/r/528170 (owner: 10Fsero)
[15:07:01] <wikibugs>	 (03CR) 10RobH: [C: 03+2] ganeti500[123] mgmt dns [dns] - 10https://gerrit.wikimedia.org/r/528169 (https://phabricator.wikimedia.org/T228099) (owner: 10RobH)
[15:07:49] <icinga-wm>	 PROBLEM - Host restbase2009 is DOWN: PING CRITICAL - Packet loss = 100%
[15:08:05] <jijiki>	 ^ downtme expired
[15:08:09] <jijiki>	 fixing 
[15:09:42] <icinga-wm>	 ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 6.001 ge 4 Effie Mouzeli We know, host will be retire soon https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops
[15:10:47] <icinga-wm>	 ACKNOWLEDGEMENT - Host restbase2009 is DOWN: PING CRITICAL - Packet loss = 100% Effie Mouzeli Host is down for maint - T227408
[15:10:53] <wikibugs>	 (03PS1) 10Fsero: ammending eqiad env too [deployment-charts] - 10https://gerrit.wikimedia.org/r/528172
[15:11:08] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] ammending eqiad env too [deployment-charts] - 10https://gerrit.wikimedia.org/r/528172 (owner: 10Fsero)
[15:12:43] <wikibugs>	 (03CR) 10CRusnov: "Long explanation inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov)
[15:15:57] <wikibugs>	 (03PS1) 10Elukey: camus: allow to choose the CamusChecker's alert email [puppet] - 10https://gerrit.wikimedia.org/r/528174
[15:16:05] <icinga-wm>	 PROBLEM - Host restbase2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:16:49] <wikibugs>	 (03CR) 10CRusnov: "> Patch Set 4:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov)
[15:17:37] <wikibugs>	 (03PS1) 10Pmiazga: Enable editor gender surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528175 (https://phabricator.wikimedia.org/T227793)
[15:18:11] <wikibugs>	 (03PS2) 10Pmiazga: Enable editor gender surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528175 (https://phabricator.wikimedia.org/T227793)
[15:18:58] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=ncredir2002.codfw.wmnet
[15:19:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:21] <wikibugs>	 (03PS3) 10Isarra: Remove related-articles-footer-blacklisted-skins.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644)
[15:19:41] <wikibugs>	 (03PS4) 10Isarra: Remove related-articles-footer-blacklisted-skins.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644)
[15:19:53] <wikibugs>	 (03CR) 10Isarra: "Good point, too much json." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra)
[15:21:02] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez)
[15:21:16] <wikibugs>	 (03PS34) 10Vgutierrez: ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382)
[15:21:21] <marostegui>	 !log Add db2127 to tendril and zarcillo (s3) - T228969
[15:21:25] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::php: tune after an appserver 100% in production [puppet] - 10https://gerrit.wikimedia.org/r/528176
[15:21:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:29] <stashbot>	 T228969: Productionize db21[21-30} - https://phabricator.wikimedia.org/T228969
[15:24:11] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] k8s, cache: disabling codfw services for k8s cluster recreation [puppet] - 10https://gerrit.wikimedia.org/r/528164 (https://phabricator.wikimedia.org/T228837) (owner: 10Fsero)
[15:26:09] <fsero>	 !log recreating zotero and termbox from helmfile codfw - T228837
[15:26:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:17] <stashbot>	 T228837: recreate codfw cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] - https://phabricator.wikimedia.org/T228837
[15:26:43] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): elastic1031 - PSU status critical - https://phabricator.wikimedia.org/T229453 (10Gehel) >>! In T229453#5381207, @wiki_willy wrote: > If it's actually a bad PSU, I think we can leave it, since it's due to be refreshed via T221636.  Confirmed, we ca...
[15:27:10] <fsero>	 !log recreating zotero and termbox  namespaces and services from helmfile codfw - T228837
[15:27:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:33] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1010 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:30:47] <icinga-wm>	 PROBLEM - Check systemd state on ncredir2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:32:03] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - termbox_3030: Servers kubernetes2004.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: zotero_1969: Servers kubernetes2004.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:32:13] <logmsgbot>	 !log fsero@ helmfile [CODFW] Ran 'apply' command on namespace 'termbox' for release 'production' .
[15:32:17] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - termbox_3030: Servers kubernetes2004.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: zotero_1969: Servers kubernetes2002.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:32:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:31] <icinga-wm>	 PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 193.5 ge 130 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[15:36:00] <logmsgbot>	 !log fsero@ helmfile [CODFW] Ran 'apply' command on namespace 'zotero' for release 'production' .
[15:36:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:39] <icinga-wm>	 PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=DELETE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[15:37:11] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10cloud-services-team (Kanban): SRE: root access for Hieu Pham, SRE @ WMCS - https://phabricator.wikimedia.org/T229833 (10aborrero)
[15:37:25] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10cloud-services-team (Kanban): SRE: root access for Hieu Pham, SRE @ WMCS - https://phabricator.wikimedia.org/T229833 (10aborrero) p:05Triage→03High
[15:37:33] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] Gerrit: Rename gerrit-slave to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/527656 (https://phabricator.wikimedia.org/T229822) (owner: 10Paladox)
[15:38:15] <icinga-wm>	 RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[15:38:27] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:38:39] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:39:21] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10cloud-services-team (Kanban): SRE: root access for Hieu Pham, SRE @ WMCS - https://phabricator.wikimedia.org/T229833 (10bd808) +1 as Hieu's manager. Getting all his accounts setup may take a day or two, but we wanted to get the group approval done as soon as we can.
[15:41:10] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ncredir2002.codfw.wmnet
[15:41:14] <wikibugs>	 (03PS1) 10Bstorm: Apply black formatting [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528177
[15:41:16] <wikibugs>	 (03PS1) 10Bstorm: docker: add support for "stable" and "testing" tags in addition to latest [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T229058)
[15:41:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:33] <wikibugs>	 (03CR) 10Ema: [C: 03+1] k8s, cache: disabling codfw services for k8s cluster recreation [puppet] - 10https://gerrit.wikimedia.org/r/528164 (https://phabricator.wikimedia.org/T228837) (owner: 10Fsero)
[15:41:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] docker: add support for "stable" and "testing" tags in addition to latest [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm)
[15:41:57] <wikibugs>	 (03PS1) 10Fsero: updating namespace creation in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/528179
[15:42:54] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Remove related-articles-footer-blacklisted-skins.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra)
[15:45:37] <wikibugs>	 (03CR) 10Bstorm: "There's tests!  I'm overjoyed and fixing" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm)
[15:47:25] <wikibugs>	 (03PS1) 10Thcipriani: gerrit: replication replicateOnStartup [puppet] - 10https://gerrit.wikimedia.org/r/528181 (https://phabricator.wikimedia.org/T229756)
[15:48:36] <wikibugs>	 (03PS2) 10Elukey: camus: allow to choose the CamusChecker's alert email [puppet] - 10https://gerrit.wikimedia.org/r/528174
[15:48:40] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] gerrit: replication replicateOnStartup [puppet] - 10https://gerrit.wikimedia.org/r/528181 (https://phabricator.wikimedia.org/T229756) (owner: 10Thcipriani)
[15:49:13] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10cloud-services-team (Kanban): SRE: root access for Hieu Pham, SRE @ WMCS - https://phabricator.wikimedia.org/T229833 (10bd808)
[15:49:15] <wikibugs>	 (03PS2) 10Bstorm: docker: add support for "stable" and "testing" tags [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T229058)
[15:51:07] <icinga-wm>	 RECOVERY - Host restbase2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.77 ms
[15:51:31] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Mathoid, 10Core Platform Team Legacy (Watching / External), and 2 others: remove mathoid from scb - https://phabricator.wikimedia.org/T200832 (10Pchelolo) Is this done and can be resolved? There doesn't seem to be a mathoid installation on scb any longer
[15:51:43] <icinga-wm>	 RECOVERY - Host restbase2009 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms
[15:52:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM overall, see inline for further restricting proxypass, possibly in a followup review" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov)
[15:54:22] <wikibugs>	 (03CR) 10Bstorm: "This is just to restrict to 80 chars, which I think we generally agree we like better than the black default of 110." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528177 (owner: 10Bstorm)
[15:54:58] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] updating namespace creation in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/528179 (owner: 10Fsero)
[15:55:27] <wikibugs>	 (03PS2) 10Fsero: k8s, cache: disabling codfw services for k8s cluster recreation [puppet] - 10https://gerrit.wikimedia.org/r/528164 (https://phabricator.wikimedia.org/T228837)
[15:55:38] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Papaul) @jijiki please repool the server when you have a minute. We will have to order a new Storage battery for the server since all the decom HP servers are GEN8 and this one is a GEN9 so diffe...
[15:55:57] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Mathoid, 10Core Platform Team Legacy (Watching / External), and 2 others: remove mathoid from scb - https://phabricator.wikimedia.org/T200832 (10akosiaris) 05Open→03Resolved a:03akosiaris I see `'mathoid' => 'http://deployment-docker-mathoid01.eqiad.wmfl...
[15:56:31] <icinga-wm>	 RECOVERY - puppet last run on dbproxy1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:57:35] <wikibugs>	 (03CR) 10Fsero: [C: 03+2] k8s, cache: disabling codfw services for k8s cluster recreation [puppet] - 10https://gerrit.wikimedia.org/r/528164 (https://phabricator.wikimedia.org/T228837) (owner: 10Fsero)
[15:57:58] <Urbanecm>	 jouncebot: now
[15:57:59] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 2 minute(s)
[15:58:03] <Urbanecm>	 jouncebot: next
[15:58:03] <jouncebot>	 In 1 hour(s) and 1 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190805T1700)
[15:58:28] <Urbanecm>	 !log Deploy patch for T200104
[15:58:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:01] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - termbox_3030: Servers kubernetes2004.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: zotero_1969: Servers kubernetes2002.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2006.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[16:02:27] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - termbox_3030: Servers kubernetes2006.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: zotero_1969: Servers kubernetes2006.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[16:02:37] <logmsgbot>	 !log crusnov@ helmfile [CODFW] Ran 'apply' command on namespace 'termbox' for release 'production' .
[16:02:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:19] <wikibugs>	 (03PS2) 10Krinkle: mediawiki: Use HTTPS for /nl-portal and /be-portal redirects [puppet] - 10https://gerrit.wikimedia.org/r/518099
[16:04:27] <logmsgbot>	 !log crusnov@ helmfile [CODFW] Ran 'apply' command on namespace 'termbox' for release 'production' .
[16:04:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:31] <logmsgbot>	 !log crusnov@ helmfile [CODFW] Ran 'apply' command on namespace 'zotero' for release 'production' .
[16:06:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:48] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227541 (10akosiaris)
[16:08:51] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:09:03] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:10:11] <wikibugs>	 (03PS3) 10Elukey: camus: allow to choose the CamusChecker's alert email [puppet] - 10https://gerrit.wikimedia.org/r/528174
[16:10:18] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Papaul) @jijiki  I made a procurement task the the storage battery at T229847
[16:10:46] <fsero>	 !log recreating citoid eventgate-analytics eventgate-main mathoid sessionstore namespaces and redeploying from helmfile T228837
[16:10:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:54] <stashbot>	 T228837: recreate codfw cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] - https://phabricator.wikimedia.org/T228837
[16:11:03] <wikibugs>	 (03PS1) 10Vgutierrez: ncredir: Ensure that mtail service doesn't get enabled [puppet] - 10https://gerrit.wikimedia.org/r/528188 (https://phabricator.wikimedia.org/T228382)
[16:12:11] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/17729/" [puppet] - 10https://gerrit.wikimedia.org/r/528174 (owner: 10Elukey)
[16:13:04] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 (10Marostegui)
[16:14:19] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1001/17730/" [puppet] - 10https://gerrit.wikimedia.org/r/528188 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez)
[16:14:30] <wikibugs>	 (03PS2) 10Vgutierrez: ncredir: Ensure that mtail service doesn't get enabled [puppet] - 10https://gerrit.wikimedia.org/r/528188 (https://phabricator.wikimedia.org/T228382)
[16:14:59] <wikibugs>	 (03PS1) 10EBernhardson: Change mjolnir_bulk_daemon kafka topics [puppet] - 10https://gerrit.wikimedia.org/r/528190 (https://phabricator.wikimedia.org/T227364)
[16:16:44] <logmsgbot>	 !log crusnov@ helmfile [CODFW] Ran 'apply' command on namespace 'citoid' for release 'production' .
[16:16:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:53] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-analytics_31192: Servers kubernetes2006.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: citoid_1970: Servers kubernetes2002.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[16:17:05] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-analytics_31192: Servers kubernetes2006.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: citoid_1970: Servers kubernetes2002.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2006.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[16:18:19] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt1007 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:18:58] <logmsgbot>	 !log crusnov@ helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'analytics' .
[16:19:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:15] <logmsgbot>	 !log crusnov@ helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'main' .
[16:22:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:22] <logmsgbot>	 !log crusnov@ helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'main' .
[16:22:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:38] <wikibugs>	 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10Marostegui) As per the sync on the SRE meeting, @JHedden will be online from WMCS. I will handle the announcement for wikitech, could...
[16:25:09] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 #page on mathoid.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.20 and port 10042: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[16:25:09] <icinga-wm>	 PROBLEM - puppet last run on ncredir2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Service[mtail] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:26:47] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:26:55] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: (2019-08-31)rack/setup/install db2131.codfw.wmnet - https://phabricator.wikimedia.org/T229251 (10Papaul)
[16:27:33] <icinga-wm>	 ACKNOWLEDGEMENT - LVS HTTP IPv4 #page on mathoid.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.20 and port 10042: Connection refused Fsero T228837 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[16:28:25] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:28:51] <wikibugs>	 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10JHedden) >>! In T229657#5393428, @Marostegui wrote: > As per the sync on the SRE meeting, @JHedden will be online from WMCS. > I will...
[16:28:53] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:29:14] <wikibugs>	 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10Marostegui) Thanks!
[16:29:17] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:29:56] <wikibugs>	 (03PS1) 10Vgutierrez: ncredir: Ensure that the default mtail service gets masked [puppet] - 10https://gerrit.wikimedia.org/r/528193 (https://phabricator.wikimedia.org/T228382)
[16:30:12] <icinga-wm>	 ACKNOWLEDGEMENT - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet Fsero T228837 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:30:12] <icinga-wm>	 ACKNOWLEDGEMENT - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet Fsero T228837 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:30:45] <icinga-wm>	 RECOVERY - puppet last run on ncredir2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:32:07] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:32:23] <logmsgbot>	 !log fsero@ helmfile [CODFW] Ran 'apply' command on namespace 'mathoid' for release 'production' .
[16:32:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:32:31] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:33:15] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 #page on mathoid.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 925 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[16:33:15] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/17731/" [puppet] - 10https://gerrit.wikimedia.org/r/528193 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez)
[16:33:27] <wikibugs>	 (03PS2) 10Vgutierrez: ncredir: Ensure that the default mtail service gets masked [puppet] - 10https://gerrit.wikimedia.org/r/528193 (https://phabricator.wikimedia.org/T228382)
[16:34:07] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] gerrit: replication replicateOnStartup [puppet] - 10https://gerrit.wikimedia.org/r/528181 (https://phabricator.wikimedia.org/T229756) (owner: 10Thcipriani)
[16:37:13] <logmsgbot>	 !log fsero@ helmfile [CODFW] Ran 'apply' command on namespace 'sessionstore' for release 'production' .
[16:37:19] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 #page on sessionstore.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.29 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[16:37:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:53] <bblack>	 are we trying to set a new record for number of paging alerts during an SRE meeting? :)
[16:37:59] <volans>	 lol
[16:38:03] <godog>	 hold my pager
[16:38:09] <cdanis>	 6 seconds from the log to the 📟 was impressive
[16:38:15] <icinga-wm>	 ACKNOWLEDGEMENT - LVS HTTP IPv4 #page on sessionstore.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.29 and port 8081: Connection refused Fsero expected page from cluster recreation https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[16:38:58] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 #page on sessionstore.svc.codfw.wmnet is OK: HTTP OK: Status line output matched 200 - 258 bytes in 0.154 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[16:39:10] <apergos>	 what the
[16:39:47] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:40:02] <fsero>	 is the cluster recreation sorry, this shouldnt have paged (it was downtimed)
[16:40:03] <fsero>	 no mor enoise
[16:40:05] <fsero>	 it ended
[16:41:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:41:27] <icinga-wm>	 RECOVERY - Check systemd state on ncredir2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:43:14] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] camus: allow to choose the CamusChecker's alert email [puppet] - 10https://gerrit.wikimedia.org/r/528174 (owner: 10Elukey)
[16:43:45] <wikibugs>	 (03PS4) 10Elukey: camus: allow to choose the CamusChecker's alert email [puppet] - 10https://gerrit.wikimedia.org/r/528174
[16:44:08] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "Hm, I think refine calls this emails_to, and accepts a list of emails.  $check_emails_to?  I don't care that much, target is ok too" [puppet] - 10https://gerrit.wikimedia.org/r/528174 (owner: 10Elukey)
[16:45:24] <wikibugs>	 (03CR) 10Elukey: "> Hm, I think refine calls this emails_to, and accepts a list of" [puppet] - 10https://gerrit.wikimedia.org/r/528174 (owner: 10Elukey)
[16:45:44] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "Probably don't need list." [puppet] - 10https://gerrit.wikimedia.org/r/528174 (owner: 10Elukey)
[16:46:17] <icinga-wm>	 RECOVERY - puppet last run on cloudvirt1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:47:21] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Increase mathoid resourcequotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/528196 (https://phabricator.wikimedia.org/T228837)
[16:48:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Seems fine as a stopgap until fixed upstream." [puppet] - 10https://gerrit.wikimedia.org/r/520207 (https://phabricator.wikimedia.org/T199406) (owner: 10Filippo Giunchedi)
[16:49:32] <wikibugs>	 (03CR) 10Fsero: Increase mathoid resourcequotas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/528196 (https://phabricator.wikimedia.org/T228837) (owner: 10Alexandros Kosiaris)
[16:52:11] <logmsgbot>	 !log fsero@ helmfile [CODFW] Ran 'apply' command on namespace 'blubberoid' for release 'production' .
[16:52:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:27] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_8748: Servers kubernetes2002.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: cxserver_8080: Servers kubernetes2004.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[16:52:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_8748: Servers kubernetes2006.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: cxserver_8080: Servers kubernetes2006.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[16:53:35] <logmsgbot>	 !log fsero@ helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' .
[16:53:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:44] <logmsgbot>	 !log fsero@ helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' .
[16:53:45] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Increase mathoid resourcequotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/528196 (https://phabricator.wikimedia.org/T228837)
[16:53:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Increase mathoid resourcequotas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/528196 (https://phabricator.wikimedia.org/T228837) (owner: 10Alexandros Kosiaris)
[16:54:48] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] Increase mathoid resourcequotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/528196 (https://phabricator.wikimedia.org/T228837) (owner: 10Alexandros Kosiaris)
[16:55:41] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:55:51] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:00:04] <jouncebot>	 gehel and onimisionipe: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190805T1700).
[17:00:16] <onimisionipe>	 here here
[17:00:34] <onimisionipe>	 jouncebot: no deployment for wdqs
[17:04:16] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] camus: allow to choose the CamusChecker's alert email [puppet] - 10https://gerrit.wikimedia.org/r/528174 (owner: 10Elukey)
[17:10:57] <wikibugs>	 (03PS1) 10Fsero: k8s, codfw: disabling quotas on eventgate, cxserver and mathoid as they need more work [deployment-charts] - 10https://gerrit.wikimedia.org/r/528202 (https://phabricator.wikimedia.org/T228837)
[17:11:45] <icinga-wm>	 PROBLEM - toolschecker: kubernetes webservice running #page on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 177 bytes in 9.828 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[17:12:13] <apergos>	 arturo: ?
[17:12:25] <bd808>	 argh. I'll put that back into long term downtime
[17:12:31] <apergos>	 ah. thank you
[17:12:37] <bd808>	 that test is crappy
[17:12:46] <godog>	 thanks bd808 !
[17:12:48] <wikibugs>	 (03PS2) 10Fsero: k8s, codfw: disabling quotas on eventgate, zotero, cxserver and mathoid as they need more work [deployment-charts] - 10https://gerrit.wikimedia.org/r/528202 (https://phabricator.wikimedia.org/T228837)
[17:18:01] <bd808>	 Downtimed until 2019-09-02. Hopefully we will either fix the test to decide to turn them back off entirely by then.
[17:18:25] <wikibugs>	 (03PS3) 10Fsero: k8s, codfw: disabling quotas on eventgate, zotero, cxserver and mathoid as they need more work [deployment-charts] - 10https://gerrit.wikimedia.org/r/528202 (https://phabricator.wikimedia.org/T228837)
[17:19:17] <wikibugs>	 (03PS4) 10Fsero: k8s, codfw: disabling quotas on some namespaces. [deployment-charts] - 10https://gerrit.wikimedia.org/r/528202 (https://phabricator.wikimedia.org/T228837)
[17:19:36] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s, codfw: disabling quotas on some namespaces. [deployment-charts] - 10https://gerrit.wikimedia.org/r/528202 (https://phabricator.wikimedia.org/T228837) (owner: 10Fsero)
[17:24:11] <cdanis>	 robh: can you update the topic in here re: clinic duty?
[17:24:24] <robh>	 is it you?
[17:24:34] <robh>	 i assume yes since you asked ;D
[17:24:42] <cdanis>	 no
[17:24:46] <cdanis>	 it is shdubsh
[17:24:50] <cdanis>	 I'm next week
[17:24:58] <herron>	 I think it was shdubsh 
[17:25:30] <wikibugs>	 (03PS1) 10Fsero: bug: it expects and empty map [deployment-charts] - 10https://gerrit.wikimedia.org/r/528203
[17:25:39] <cdanis>	 robh: ^
[17:25:47] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] bug: it expects and empty map [deployment-charts] - 10https://gerrit.wikimedia.org/r/528203 (owner: 10Fsero)
[17:25:51] <robh>	 =]
[17:26:12] <shdubsh>	 thanks robh!
[17:28:06] <logmsgbot>	 !log fsero@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=sessionstore|citoid|cxserver|eventgate-analytics|eventgate-main|termbox|blubberoid|mathoid|zotero,name=codfw
[17:28:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:50] <wikibugs>	 (03CR) 10RobH: [C: 03+2] adding jclark to shell and dc ops group [puppet] - 10https://gerrit.wikimedia.org/r/525847 (https://phabricator.wikimedia.org/T229124) (owner: 10RobH)
[17:30:59] <wikibugs>	 (03PS3) 10RobH: adding jclark to shell and dc ops group [puppet] - 10https://gerrit.wikimedia.org/r/525847 (https://phabricator.wikimedia.org/T229124)
[17:31:41] <wikibugs>	 (03PS1) 10Bstorm: toolschecker: Remove a lot of paging of all SRE [puppet] - 10https://gerrit.wikimedia.org/r/528204
[17:33:42] <jijiki>	 !log Pool restbase2009 - T227408
[17:33:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:51] <stashbot>	 T227408: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408
[17:36:07] <icinga-wm>	 RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 96.25 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[17:42:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "lgtm, anything to wake up fewer SREs (although this might wake up wmcs more... we'll see!)" [puppet] - 10https://gerrit.wikimedia.org/r/528204 (owner: 10Bstorm)
[17:46:02] <wikibugs>	 (03PS1) 10RobH: updating notation for dc operations group [puppet] - 10https://gerrit.wikimedia.org/r/528206 (https://phabricator.wikimedia.org/T229124)
[17:58:08] <wikibugs>	 (03CR) 10RobH: [C: 03+2] updating notation for dc operations group [puppet] - 10https://gerrit.wikimedia.org/r/528206 (https://phabricator.wikimedia.org/T229124) (owner: 10RobH)
[18:00:04] <jouncebot>	 MaxSem, RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Morning SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190805T1800).
[18:00:04] <jouncebot>	 raynor, bpirkle, jdlrobson, and ebernhardson: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:00:09] <bpirkle>	 I'm here
[18:00:14] <Urbanecm>	 I can SWAT today!
[18:00:20] <raynor>	 x`o/
[18:00:55] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528175 (https://phabricator.wikimedia.org/T227793) (owner: 10Pmiazga)
[18:01:34] <jdlrobson>	 \o
[18:01:40] <Urbanecm>	 hi jdlrobson
[18:01:54] <wikibugs>	 (03Merged) 10jenkins-bot: Enable editor gender surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528175 (https://phabricator.wikimedia.org/T227793) (owner: 10Pmiazga)
[18:02:24] <Urbanecm>	 raynor: Your patch is on mwdebug1002
[18:02:44] <raynor>	 Urbanecm, thx
[18:02:45] <wikibugs>	 (03CR) 10jenkins-bot: Enable editor gender surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528175 (https://phabricator.wikimedia.org/T227793) (owner: 10Pmiazga)
[18:03:18] <raynor>	 testing
[18:03:19] <wikibugs>	 (03PS2) 10Urbanecm: Switch testwiki to use kask (only) for sessions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528130 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle)
[18:03:25] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528130 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle)
[18:03:34] <Urbanecm>	 thanks raynor
[18:03:44] <Urbanecm>	 bpirkle: +2'ed your patch, will ping you once it's on mwdebug1002
[18:04:40] <wikibugs>	 (03Merged) 10jenkins-bot: Switch testwiki to use kask (only) for sessions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528130 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle)
[18:04:44] <Urbanecm>	 ebernhardson: Around?
[18:04:57] <wikibugs>	 (03CR) 10jenkins-bot: Switch testwiki to use kask (only) for sessions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528130 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle)
[18:05:04] <Urbanecm>	 bpirkle: your patch is on mwdebug1002
[18:05:49] <bpirkle>	 Urbanecm: thank you. Good to go.
[18:05:52] <wikibugs>	 (03PS5) 10Urbanecm: Remove related-articles-footer-blacklisted-skins.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra)
[18:05:57] <raynor>	 Urbanecm - it works preprerply, please sync prod
[18:06:01] <raynor>	 properly*
[18:06:18] <Urbanecm>	 syncing
[18:06:34] <onimisionipe>	 !log reinit postgres on maps1001 - T229788
[18:06:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:06:43] <stashbot>	 T229788: postgresql replication issues on maps1001 - https://phabricator.wikimedia.org/T229788
[18:07:31] <ebernhardson>	 Urbanecm: ya
[18:07:32] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: e44a6e6: Enable editor gender surveys (T227793) (duration: 00m 48s)
[18:07:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:41] <stashbot>	 T227793: First round editor gender surveys - https://phabricator.wikimedia.org/T227793
[18:07:54] <Urbanecm>	 ebernhardson: Hi, I've +2'ed your backport, will ping you once it's ready to be tested
[18:08:00] <ebernhardson>	 Urbanecm: k
[18:09:10] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra)
[18:09:39] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 254ecc1: Switch testwiki to use kask (only) for sessions (T222099) (duration: 00m 48s)
[18:09:46] <Urbanecm>	 bpirkle, synced
[18:09:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:48] <stashbot>	 T222099: Staging release of RESTBagOStuff using Kask - https://phabricator.wikimedia.org/T222099
[18:10:39] <bpirkle>	 Urbanecm: thank you
[18:10:45] <Urbanecm>	 happy to help!
[18:11:14] <wikibugs>	 (03Merged) 10jenkins-bot: Remove related-articles-footer-blacklisted-skins.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra)
[18:11:29] <wikibugs>	 (03CR) 10jenkins-bot: Remove related-articles-footer-blacklisted-skins.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra)
[18:11:58] <Urbanecm>	 jdlrobson: your patch is on mwdebug1002, please test
[18:12:55] <jdlrobson>	 on it
[18:13:26] <Isarra>	 <3
[18:13:26] <Urbanecm>	 ebernhardson: Your patch is on mwdebug1002, please test
[18:17:09] <Isarra>	 Seems to work! Although why are we getting a flash of extra footer at first what the crap.
[18:17:15] <Isarra>	 Totally unrelated, though.
[18:17:40] <jdlrobson>	 Urbanecm +1 to @isarra that works! Please sync :)
[18:17:54] <Urbanecm>	 syncing!
[18:18:29] <raynor>	 Urbanecm, thx for deploying the genders survey and related-articles patches \o/
[18:18:43] <Urbanecm>	 happy to help!
[18:19:27] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: a9e4ed8: Remove related-articles-footer-blacklisted-skins.dblist (T229644, 1/3) (duration: 00m 49s)
[18:19:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:35] <stashbot>	 T229644: RelatedArticles showing on all German and Russian Wikipedia due to incorrect configuration settings - https://phabricator.wikimedia.org/T229644
[18:20:29] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: a9e4ed8: Remove related-articles-footer-blacklisted-skins.dblist (T229644, 2/3) (duration: 00m 47s)
[18:20:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:28] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized dblists/: SWAT: a9e4ed8: Remove related-articles-footer-blacklisted-skins.dblist (T229644, 3/3) (duration: 00m 46s)
[18:21:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:36] <Urbanecm>	 jdlrobson: synced
[18:21:42] <wikibugs>	 (03PS1) 10Ppchelko: Switch updateBetaFeaturesUserCounts job to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528209 (https://phabricator.wikimedia.org/T228705)
[18:21:42] <Urbanecm>	 ebernhardson: Ping?
[18:22:37] <Isarra>	 jdlrobson, Urbanecm: Thanks for handling this!
[18:22:44] <Urbanecm>	 happy to help!
[18:24:26] <jdlrobson>	 yeh thanks Urbanecm :)
[18:24:34] <Urbanecm>	 yw
[18:25:39] <Urbanecm>	 !log Deployed patch for T207094
[18:25:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:29] <ebernhardson>	 Urbanecm: sorry, debugging things in multiple places :)
[18:27:40] <ebernhardson>	 Urbanecm: it generally looks to work
[18:27:46] <Urbanecm>	 np, thanks
[18:27:48] <Urbanecm>	 syncing
[18:29:20] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.16/extensions/WikimediaEvents/: SWAT: 3ee0e84: Temporarily log search to two schemas (duration: 00m 47s)
[18:29:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:29:28] <Urbanecm>	 ebernhardson: Synced!
[18:32:22] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.16/extensions/AbuseFilter/: SWAT: 936a462: Better handling of DNONE (T214674, T228677) (duration: 00m 47s)
[18:32:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:32] <stashbot>	 T228677: use of get_matches function returns "Requesting array item of non-array" - https://phabricator.wikimedia.org/T228677
[18:32:33] <stashbot>	 T214674: Short circuit fails with assignments - https://phabricator.wikimedia.org/T214674
[18:35:26] <ebernhardson>	 Urbanecm: thanks
[18:35:34] <Urbanecm>	 yw ebernhardson
[18:39:44] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.16/extensions/AbuseFilter/: SWAT: d358f17: Revert "Better handling of DNONE" (T214674, T228677) (duration: 00m 47s)
[18:39:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:54] <stashbot>	 T228677: use of get_matches function returns "Requesting array item of non-array" - https://phabricator.wikimedia.org/T228677
[18:39:54] <stashbot>	 T214674: Short circuit fails with assignments - https://phabricator.wikimedia.org/T214674
[18:41:03] <wikibugs>	 (03PS1) 10BBlack: cloudelastic: Fix LVS IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/528215 (https://phabricator.wikimedia.org/T224324)
[18:41:06] <wikibugs>	 (03PS1) 10BBlack: cloudelastic: Fix LVS IPv6 address [dns] - 10https://gerrit.wikimedia.org/r/528216 (https://phabricator.wikimedia.org/T224324)
[18:44:26] <Urbanecm>	 !log Morning SWAT done
[18:44:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:49] <icinga-wm>	 PROBLEM - Disk space on wdqs1005 is CRITICAL: DISK CRITICAL - free space: /srv 53079 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wdqs1005&var-datasource=eqiad+prometheus/ops
[18:47:34] <gehel>	 SMalyshev: ∆ disk space above, any idea ?
[18:58:31] <icinga-wm>	 RECOVERY - toolschecker: kubernetes webservice running #page on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 32.216 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[19:10:10] <wikibugs>	 (03Abandoned) 10Dzahn: mediawiki::php::restarts: try to avoid including LVS but still get pools [puppet] - 10https://gerrit.wikimedia.org/r/527285 (owner: 10Dzahn)
[19:16:35] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: replication replicateOnStartup [puppet] - 10https://gerrit.wikimedia.org/r/528181 (https://phabricator.wikimedia.org/T229756) (owner: 10Thcipriani)
[19:16:42] <wikibugs>	 (03PS2) 10Dzahn: gerrit: replication replicateOnStartup [puppet] - 10https://gerrit.wikimedia.org/r/528181 (https://phabricator.wikimedia.org/T229756) (owner: 10Thcipriani)
[19:17:44] <wikibugs>	 (03PS2) 10Paladox: gerrit: Re-enable the use of HTTP auth tokens [puppet] - 10https://gerrit.wikimedia.org/r/527596 (https://phabricator.wikimedia.org/T225308)
[19:20:42] <thcipriani>	 mutante: thanks for the merge!
[19:21:05] <wikibugs>	 (03CR) 10Cwhite: [C: 04-1] "Comments inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528142 (owner: 10Filippo Giunchedi)
[19:24:18] <mutante>	 thcipriani: yw. now actually applied on cobalt but did not restart
[19:26:25] <thcipriani>	 mutante: great, thanks, I'll give gerrit a restart (on gerrit2001 now as well :))
[19:26:40] <wikibugs>	 (03PS6) 10Dzahn: Gerrit: Rename gerrit-slave to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/527656 (https://phabricator.wikimedia.org/T229822) (owner: 10Paladox)
[19:26:56] <mutante>	 thcipriani: we can add this ^
[19:27:07] <mutante>	 looks like DNS and acme is already done
[19:27:17] <thcipriani>	 mutante: sure, let's get that in as well
[19:27:44] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Gerrit: Rename gerrit-slave to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/527656 (https://phabricator.wikimedia.org/T229822) (owner: 10Paladox)
[19:28:49] <wikibugs>	 (03PS1) 10Andrew Bogott: nova: update scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/528231 (https://phabricator.wikimedia.org/T216195)
[19:29:29] <paladox>	 thcipriani we can now use gerrit2001 to accurately test config changes/plugin updates :D (since we will now be able to verify if it'll take gerrit down)
[19:29:32] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova: update scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/528231 (https://phabricator.wikimedia.org/T216195) (owner: 10Andrew Bogott)
[19:29:36] <mutante>	 paladox: https://gerrit-replica.wikimedia.org/r/
[19:29:43] <paladox>	 yup
[19:29:43] <mutante>	 thcipriani: +    ServerAlias gerrit-replica.wikimedia.org
[19:29:51] <paladox>	 \o/
[19:31:01] <mutante>	  git clone https://gerrit-replica.wikimedia.org/r/operations/puppetCloning into 'puppet'...
[19:31:11] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] cloudelastic: Fix LVS IPv6 address [dns] - 10https://gerrit.wikimedia.org/r/528216 (https://phabricator.wikimedia.org/T224324) (owner: 10BBlack)
[19:31:44] <wikibugs>	 (03PS2) 10BBlack: cloudelastic: Fix LVS IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/528215 (https://phabricator.wikimedia.org/T224324)
[19:31:53] <thcipriani>	 mutante: nice :)
[19:32:07] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] cloudelastic: Fix LVS IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/528215 (https://phabricator.wikimedia.org/T224324) (owner: 10BBlack)
[19:32:44] <thcipriani>	 mutante: ok, so are we ready for a restart?
[19:32:54] <mutante>	 thcipriani: yes, we are
[19:33:25] * thcipriani does
[19:33:42] <thcipriani>	 !log gerrit restart for gerrit-replica on gerrit2001
[19:33:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:34:04] <bblack>	 !log fixing up cloudelastic LVS IPv6 stuff on lvs1014, lvs1016, cloudelastic* - possible monitoring noise
[19:34:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:31] <thcipriani>	 alright, gerrit2001 looks ok, cobalt incoming
[19:35:49] <thcipriani>	 !log gerrit restart on cobalt for configuration updates
[19:35:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:14] <mutante>	 paladox / thcipriani: cloning puppet repo from gerrit: real time: 24 seconds. cloning from gerrit-replica: real time: 34s .. oops?
[19:37:27] <icinga-wm>	 PROBLEM - puppet last run on elastic1020 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:37:45] <bblack>	 gerrit's unavailable, expected I assume
[19:37:56] <bblack>	 (getting 503s on code review pages in gerrit.wikimedia.org)
[19:38:08] <bblack>	 seems back now!
[19:38:12] <mutante>	 bblack: yes, expected. it's restarting as we speak
[19:38:23] <mutante>	 already back for me
[19:38:26] <thcipriani>	 yep, should be back now
[19:39:22] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] gerrit: Re-enable the use of HTTP auth tokens [puppet] - 10https://gerrit.wikimedia.org/r/527596 (https://phabricator.wikimedia.org/T225308) (owner: 10Paladox)
[19:39:50] <mutante>	 paladox: can't reproduce the same time .. next attempt it's just 24s 
[19:40:01] <icinga-wm>	 PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_wikistats-v2],Exec[git_pull_analytics.wikimedia.org] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:40:07] <paladox>	 mutante it's very slow for me over the Atlantic :P
[19:40:12] <cdanis>	 silly question
[19:40:16] <cdanis>	 gerrit2001 has more RAM, right?
[19:40:19] <mutante>	 paladox: gerrit-replica.esams.wikimedia.org :P
[19:40:24] <paladox>	 cdanis yup
[19:40:26] <mutante>	 yea
[19:40:29] <paladox>	 and cpu power
[19:40:32] <cdanis>	 any thought to making it the primary? 🙃
[19:40:48] <paladox>	 the db is read only cdanis 
[19:40:55] <paladox>	 (the proxy in codfw)
[19:41:01] <mutante>	 i'd just try to unblock the "switch eqiad to a 64GB machine:
[19:41:06] <cdanis>	 fair enough
[19:41:20] <mutante>	 actually, i will look at just that now.. hmm
[19:41:41] <thcipriani>	 paladox: and now we see how long updating all repos on the replica takes to clear the queue :)
[19:42:11] <paladox>	 :D
[19:43:17] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[19:43:19] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:44:02] <bblack>	 that looks real
[19:44:07] <bblack>	 (esams problems)
[19:44:14] <paladox>	 mutante it took me 6m5.376s
[19:46:35] <mutante>	 paladox: i will answer your question from another channel here:   i just confirmed cobalt and gerrit2001 have same NIC speed. 1000MB/s
[19:46:39] <mutante>	 Mb/s
[19:46:43] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[19:46:45] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:46:50] <paladox>	 ok, thanks
[19:47:57] <bblack>	 it's a link flap, appears over now - https://phabricator.wikimedia.org/T205609
[19:49:00] <bblack>	 heh wrong task
[19:49:54] <bblack>	 that one: https://phabricator.wikimedia.org/T228827
[19:49:54] <paladox>	 show-queue shows github && gerrit2001 are being updated
[19:49:55] <mutante>	 paladox: i wonder how long it takes you to use the mirror in wmflabs "OK, I set up a Gerrit mirror today. Clone URLs are https://ggmirror.wmflabs.org/git/<gerrit name>.git. And https://ggmirror.wmflabs.org/cgit/ as a web view/debugger."
[19:50:02] <paladox>	 oh
[19:50:04] * paladox tries
[19:50:39] <paladox>	 that's much faster in the cloud
[19:50:40] <paladox>	 0m11.568s
[19:51:12] <gehel>	 !log depool wdqs1005 - T229876
[19:51:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:51:22] <stashbot>	 T229876: blazegraph journal on wdqs1005 has doubled in space - https://phabricator.wikimedia.org/T229876
[19:51:51] <icinga-wm>	 ACKNOWLEDGEMENT - Disk space on wdqs1005 is CRITICAL: DISK CRITICAL - free space: /srv 46842 MB (3% inode=99%): Gehel tracked in https://phabricator.wikimedia.org/T229876 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wdqs1005&var-datasource=eqiad+prometheus/ops
[19:52:24] <paladox>	 cobalt is going at ~6MiB/s and gerrit2001 is at ~200 KiB/s for me
[19:53:59] <mutante>	 bblack: checked whether CenturyLink had maintenance announced .. nope.. not this time
[19:54:03] <icinga-wm>	 PROBLEM - puppet last run on dns5001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:54:17] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: site=eqsin https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[19:54:41] <mutante>	 paladox: lol, that's such a huge difference. can you repeat that a couple times? is the time different each time? 
[19:54:59] <mutante>	 and i mean after deleting everything ..so always from scratch
[19:55:19] <paladox>	 mutante it's the same each time from what i can tell (gerrit2001 being in the KiB/s range
[19:55:21] <icinga-wm>	 PROBLEM - puppet last run on mw1258 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:55:29] <mutante>	 first time i see a single "idespread puppet agent failures" by itself, nice
[19:55:45] <herron>	 yeah!
[19:55:57] <mutante>	 but also we are trained already to just be alarmed if it scrolls :p
[19:56:01] <mutante>	 hehe
[19:56:08] <herron>	 I think it’s the first time it’s alerted.  old ones haven’t been squelched yet though
[19:57:06] <mutante>	 herron: i wonder if it should link to puppetboard instead of grafana
[19:57:21] <mutante>	 when this happens we want the server names rather than a number, right
[19:57:31] <mutante>	 also does https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1  look like widespread to you?
[19:57:45] <bd808>	 the forced perspective of 0%-100% on those graphs is ... not very helpful
[19:57:56] <mutante>	 swift .. hmm.. i see
[19:58:27] <herron>	 yes although does puppet board have details when compilation fails?  iirc data is sent to puppetdb after the catalog is compiled, so you might not see it in puppetboard?
[19:58:28] <mutante>	 i want to click something on that dashboard to get to the list of actual servers
[19:59:17] <mutante>	 herron: https://puppetboard.wikimedia.org/nodes?status=failed
[19:59:39] <mutante>	 i think i would like that as the link target
[20:00:04] <jouncebot>	 cscott, arlolra, subbu, bearND, halfak, and accraze: That opportune time is upon us again. Time for a Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190805T2000).
[20:00:04] <herron>	 nice, yes!
[20:00:20] <herron>	 probably good to include both the dashboard and the puppetboard
[20:00:59] <mutante>	 yea, and that should be possible to add multiple links with grafana checks
[20:01:19] <mutante>	 swift = 1 host = 1 disk always dies becuase ..scale
[20:01:32] <logmsgbot>	 !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@e774a05]: Update mobileapps to c713c2e
[20:01:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:04:08] <mutante>	 herron: more things we can improve. if a check is already "handled" (ACKed) from Icinga's point of view then it should ignore that in the overall "widespread issues" check
[20:04:23] <mutante>	 like ms-be1040 in this case.. was already known that it has hardware issue
[20:04:30] <mutante>	 yet it pushed the check over the edge
[20:05:01] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[20:06:07] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 354.70 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[20:06:23] <logmsgbot>	 !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@e774a05]: Update mobileapps to c713c2e (duration: 04m 51s)
[20:06:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:30] <wikibugs>	 (03PS2) 10Bstorm: toolschecker: Remove a lot of paging of all SRE [puppet] - 10https://gerrit.wikimedia.org/r/528204
[20:09:27] <wikibugs>	 (03PS3) 10Bstorm: toolschecker: Remove a lot of paging of all SRE [puppet] - 10https://gerrit.wikimedia.org/r/528204
[20:09:31] <icinga-wm>	 RECOVERY - puppet last run on elastic1020 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:10:01] <wikibugs>	 (03PS1) 10CDanis: Release 1.1.4 [software/conftool] - 10https://gerrit.wikimedia.org/r/528251
[20:10:44] <cdanis>	 whoa nice widespread puppet failure alert
[20:10:47] <cdanis>	 that's awesome
[20:11:29] <herron>	 mutante: I added an iframe for the puppetboard failed list to https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[20:11:35] <wikibugs>	 (03CR) 10Jhedden: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/528204 (owner: 10Bstorm)
[20:11:36] <herron>	 should save the need to have two links in the alert
[20:12:02] <wikibugs>	 (03PS4) 10Bstorm: toolschecker: Remove a lot of paging of all SRE [puppet] - 10https://gerrit.wikimedia.org/r/528204
[20:12:35] <icinga-wm>	 RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:15:08] <mutante>	 herron: wow, nice. i did not expect i would like iframes that much
[20:15:28] <herron>	 haha yeah me either
[20:15:59] <wikibugs>	 (03PS5) 10Bstorm: toolschecker: Remove a lot of paging of all SRE [puppet] - 10https://gerrit.wikimedia.org/r/528204
[20:16:53] <wikibugs>	 (03CR) 10Bstorm: "Ok, I'm done now.  I've pinned everything to wmcs/paws groups.  When done, I'd like to test it by doing something like stopping one of the" [puppet] - 10https://gerrit.wikimedia.org/r/528204 (owner: 10Bstorm)
[20:17:01] <wikibugs>	 (03PS2) 10CDanis: Release 1.1.4 [software/conftool] - 10https://gerrit.wikimedia.org/r/528251
[20:17:03] <wikibugs>	 (03PS1) 10CDanis: debian: Release 1.1.4-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/528257
[20:19:43] <logmsgbot>	 !log arlolra@deploy1001 Started deploy [parsoid/deploy@d3a2937]: Updating Parsoid to 7232dff
[20:19:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:03] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Release 1.1.4 [software/conftool] - 10https://gerrit.wikimedia.org/r/528251 (owner: 10CDanis)
[20:21:36] <wikibugs>	 (03PS6) 10Bstorm: toolschecker: Remove a lot of paging of all SRE [puppet] - 10https://gerrit.wikimedia.org/r/528204
[20:21:49] <icinga-wm>	 RECOVERY - puppet last run on dns5001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:23:05] <icinga-wm>	 RECOVERY - puppet last run on mw1258 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:23:21] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] toolschecker: Remove a lot of paging of all SRE [puppet] - 10https://gerrit.wikimedia.org/r/528204 (owner: 10Bstorm)
[20:23:49] <icinga-wm>	 RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[20:25:00] <wikibugs>	 (03Merged) 10jenkins-bot: Release 1.1.4 [software/conftool] - 10https://gerrit.wikimedia.org/r/528251 (owner: 10CDanis)
[20:25:13] <wikibugs>	 (03CR) 10CDanis: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/528257 (owner: 10CDanis)
[20:25:40] <wikibugs>	 (03PS2) 10MSantos: First version of the wikifeeds chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/526679 (https://phabricator.wikimedia.org/T229287)
[20:27:13] <wikibugs>	 (03PS1) 10Thcipriani: gerrit: do not treat github as a mirror [puppet] - 10https://gerrit.wikimedia.org/r/528259
[20:27:51] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] debian: Release 1.1.4-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/528257 (owner: 10CDanis)
[20:28:46] <logmsgbot>	 !log arlolra@deploy1001 Finished deploy [parsoid/deploy@d3a2937]: Updating Parsoid to 7232dff (duration: 09m 02s)
[20:28:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:33] <wikibugs>	 (03Merged) 10jenkins-bot: debian: Release 1.1.4-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/528257 (owner: 10CDanis)
[20:32:14] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] "Wouldn't this mean branches that are deleted won't be deleted on the mirror?" [puppet] - 10https://gerrit.wikimedia.org/r/528259 (owner: 10Thcipriani)
[20:34:08] <arlolra>	 !log Updated Parsoid to 7232dff (T228223)
[20:34:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:16] <stashbot>	 T228223: tokensToString being called on KV->V looks like a signature mismatch - https://phabricator.wikimedia.org/T228223
[20:34:54] <wikibugs>	 (03PS1) 10EBernhardson: Repoint cloudelastic at LB dns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528260 (https://phabricator.wikimedia.org/T220625)
[20:39:17] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: do not treat github as a mirror [puppet] - 10https://gerrit.wikimedia.org/r/528259 (owner: 10Thcipriani)
[20:41:27] <wikibugs>	 (03PS1) 10Paladox: Gerrit: Set replicatePermissions to false for GitHub remote [puppet] - 10https://gerrit.wikimedia.org/r/528262
[20:41:46] <wikibugs>	 (03PS2) 10Paladox: Gerrit: Set replicatePermissions to false for GitHub remote [puppet] - 10https://gerrit.wikimedia.org/r/528262
[20:42:41] <wikibugs>	 (03PS3) 10Dzahn: Gerrit: Set replicatePermissions to false for GitHub remote [puppet] - 10https://gerrit.wikimedia.org/r/528262 (owner: 10Paladox)
[20:44:01] <wikibugs>	 (03PS2) 10EBernhardson: Repoint cloudelastic at LB dns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528260 (https://phabricator.wikimedia.org/T220625)
[20:44:03] <wikibugs>	 (03PS1) 10EBernhardson: Temporarily stop writing to cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528263 (https://phabricator.wikimedia.org/T220625)
[20:44:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Temporarily stop writing to cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528263 (https://phabricator.wikimedia.org/T220625) (owner: 10EBernhardson)
[20:45:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Repoint cloudelastic at LB dns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528260 (https://phabricator.wikimedia.org/T220625) (owner: 10EBernhardson)
[20:45:39] <icinga-wm>	 PROBLEM - puppet last run on an-worker1091 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:47:27] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] Gerrit: Set replicatePermissions to false for GitHub remote [puppet] - 10https://gerrit.wikimedia.org/r/528262 (owner: 10Paladox)
[20:49:14] <ebernhardson>	 !log nuke all search indices on cloudelastic preparing for fresh imports and live updates T220625
[20:49:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:23] <stashbot>	 T220625: Initialize CirrusSearch on cloudelastic - https://phabricator.wikimedia.org/T220625
[20:56:00] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] Gerrit: Set replicatePermissions to false for GitHub remote [puppet] - 10https://gerrit.wikimedia.org/r/528262 (owner: 10Paladox)
[20:56:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Gerrit: Set replicatePermissions to false for GitHub remote [puppet] - 10https://gerrit.wikimedia.org/r/528262 (owner: 10Paladox)
[21:00:04] <jouncebot>	 Reedy and sbassett: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190805T2100).
[21:12:19] <icinga-wm>	 RECOVERY - puppet last run on an-worker1091 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[21:13:47] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 163 threshold =0.15 breach: timed_out: False, initializing_shards: 0, number_of_data_nodes: 4, active_primary_shards: 179, cluster_name: cloudelastic-chi-eqiad, number_of_pending_tasks: 0, status: red, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0, unassigned_shards: 163, task_max_waiting_
[21:13:47] <icinga-wm>	 0, active_shards_percent_as_number: 73.36601307189542, relocating_shards: 0, number_of_nodes: 4, active_shards: 449 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:47] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 163 threshold =0.15 breach: active_primary_shards: 179, delayed_unassigned_shards: 0, number_of_data_nodes: 4, timed_out: False, unassigned_shards: 163, number_of_pending_tasks: 0, relocating_shards: 0, task_max_waiting_in_queue_millis: 0, initializing_shards: 0, number_of_nodes: 4, cluster_name: cloud
[21:13:47] <icinga-wm>	 , active_shards_percent_as_number: 73.36601307189542, status: red, active_shards: 449, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:14:19] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 165 threshold =0.15 breach: delayed_unassigned_shards: 0, number_of_nodes: 4, unassigned_shards: 165, number_of_pending_tasks: 0, active_shards_percent_as_number: 73.5576923076923, timed_out: False, number_of_in_flight_fetch: 0, initializing_shards: 0, active_primary_shards: 183, status: red, task_max_
[21:14:19] <icinga-wm>	 millis: 0, cluster_name: cloudelastic-chi-eqiad, number_of_data_nodes: 4, relocating_shards: 0, active_shards: 459 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:14:19] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 165 threshold =0.15 breach: delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, number_of_pending_tasks: 0, status: red, active_shards: 459, unassigned_shards: 165, number_of_data_nodes: 4, timed_out: False, cluster_name: cloudelastic-chi-eqiad, relocating_shards: 0, number_of_in_flight_
[21:14:20] <icinga-wm>	 primary_shards: 183, number_of_nodes: 4, initializing_shards: 0, active_shards_percent_as_number: 73.5576923076923 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:15:35] <ebernhardson>	 cloudelastic problems expected, i'm reinitializing it
[21:16:40] <wikibugs>	 (03CR) 10BryanDavis: docker: add support for "stable" and "testing" tags (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm)
[21:18:02] <wikibugs>	 (03Abandoned) 10EBernhardson: Temporarily stop writing to cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528263 (https://phabricator.wikimedia.org/T220625) (owner: 10EBernhardson)
[21:18:34] <wikibugs>	 (03PS3) 10EBernhardson: Repoint cloudelastic at LB dns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528260 (https://phabricator.wikimedia.org/T220625)
[21:22:25] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1004 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, active_shards_percent_as_number: 85.38913362701909, number_of_pending_tasks: 0, active_shards: 1163, unassigned_shards: 199, status: red, initializing_shards: 0, active_primary_shards: 423, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0, number_of_d
[21:22:25] <icinga-wm>	 k_max_waiting_in_queue_millis: 0, number_of_nodes: 4, timed_out: False, relocating_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:22:25] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1002 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: active_shards_percent_as_number: 85.38913362701909, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_in_flight_fetch: 0, unassigned_shards: 199, number_of_data_nodes: 4, initializing_shards: 0, active_shards: 11
[21:22:25] <icinga-wm>	 ing_in_queue_millis: 0, active_primary_shards: 423, number_of_nodes: 4, relocating_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:22:41] <ebernhardson>	 !log start importing group0 to cloudelastic from mwmaint1002
[21:22:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:24:37] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1001 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, delayed_unassigned_shards: 0, unassigned_shards: 216, initializing_shards: 0, number_of_in_flight_fetch: 0, relocating_shards: 0, timed_out: False, number_of_nodes: 4, active_shards: 1260, active_shards_percent_as_number: 85.36585365853658, number_of_data_nodes
[21:24:37] <icinga-wm>	 ting_in_queue_millis: 0, active_primary_shards: 458, status: red, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:24:39] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1003 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: task_max_waiting_in_queue_millis: 0, number_of_in_flight_fetch: 0, delayed_unassigned_shards: 0, number_of_data_nodes: 4, active_shards: 1260, cluster_name: cloudelastic-chi-eqiad, initializing_shards: 0, relocating_shards: 0, number_of_nodes: 4, number_of_pending_tasks: 0, active_shards_percent_as
[21:24:39] <icinga-wm>	 365853658, unassigned_shards: 216, timed_out: False, status: red, active_primary_shards: 458 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:24:45] <cdanis>	 !log ✔️ cdanis@install1002.wikimedia.org ~ 🕠 sudo -E reprepro -C main include stretch-wikimedia conftool-1.1.4-1/conftool_1.1.4-1_amd64.changes 
[21:24:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:25:06] <cdanis>	 !log ✔️ cdanis@install1002.wikimedia.org ~ 🕠🍺 sudo -E reprepro -C main include buster-wikimedia conftool-1.1.4-1/conftool_1.1.4-1+deb10u1_amd64.changes 
[21:25:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:25:17] <cdanis>	 !log ✔️ cdanis@install1002.wikimedia.org ~ 🕠🍺 sudo -E reprepro -C main include jessie-wikimedia conftool-1.1.4-1/conftool_1.1.4-1+deb8u1_amd64.changes
[21:25:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:25:35] <wikibugs>	 (03PS1) 10Herron: kafka-main: replace kafka1001 hardware with kafka-main1001 [puppet] - 10https://gerrit.wikimedia.org/r/528271 (https://phabricator.wikimedia.org/T225005)
[21:27:08] <cdanis>	 !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕠🍺 sudo debdeploy deploy -u 2019-08-05-conftool.yaml -s mw-canary 
[21:27:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:28:12] <mutante>	 !log 
[21:28:12] <stashbot>	 mutante: Message missing. Nothing logged.
[21:28:43] <mutante>	 !log 🔔 scandium - ree-enabled icinga notifications for various services
[21:28:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:29:05] <cdanis>	 !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕠🍺 sudo cumin -p99 -b100 'A:all' 'apt-get update'                
[21:29:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:31:52] <wikibugs>	 (03CR) 10Bstorm: docker: add support for "stable" and "testing" tags (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm)
[21:35:34] <cdanis>	 !log ❌cdanis@cumin1001.eqiad.wmnet ~ 🕠🍺 sudo debdeploy deploy -u 2019-08-05-conftool.yaml -s eqsin    
[21:35:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:09] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m3 on db2065 is OK: OK slave_sql_lag Replication lag: 23.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[21:39:11] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m3 on db2078 is OK: OK slave_sql_lag Replication lag: 20.87 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[21:39:25] <cdanis>	 !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕠🍺 sudo debdeploy deploy -u 2019-08-05-conftool.yaml -s all  
[21:39:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:35] <wikibugs>	 (03PS1) 10Herron: calico: add all kafka-main hosts to k8s eventgate policy [puppet] - 10https://gerrit.wikimedia.org/r/528275 (https://phabricator.wikimedia.org/T225005)
[21:40:45] <icinga-wm>	 PROBLEM - puppet last run on cloudstore1009 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[21:42:31] <wikibugs>	 (03PS1) 10Thcipriani: gerrit: replication: exclude some projects [puppet] - 10https://gerrit.wikimedia.org/r/528276
[21:44:31] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] kubernetes: expand alert description [puppet] - 10https://gerrit.wikimedia.org/r/528143 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi)
[21:44:59] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/528276 (owner: 10Thcipriani)
[21:55:28] <papaul>	 !log powering down wtp2011 for BIOS upgrade
[21:55:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:59:23] <icinga-wm>	 PROBLEM - Host wtp2011 is DOWN: PING CRITICAL - Packet loss = 100%
[22:06:19] <icinga-wm>	 RECOVERY - Host wtp2011 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms
[22:08:39] <icinga-wm>	 RECOVERY - puppet last run on cloudstore1009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[22:37:02] <wikibugs>	 (03PS1) 10Viztor: Add fonts-noto-cjk-extra, replace fonts-noto-cjk [puppet] - 10https://gerrit.wikimedia.org/r/528279 (https://phabricator.wikimedia.org/T226633)
[22:45:05] <wikibugs>	 (03PS1) 10Papaul: DNS: Add mgmt and production DNS for db2131 [dns] - 10https://gerrit.wikimedia.org/r/528281
[22:45:58] <wikibugs>	 (03PS5) 10CRusnov: netbox: Fix additional swift parameters [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182)
[22:47:27] <wikibugs>	 (03CR) 10CRusnov: netbox: Fix additional swift parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov)
[22:47:34] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] netbox: Fix additional swift parameters [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov)
[22:49:40] <wikibugs>	 (03PS2) 10Dzahn: DNS: Add mgmt and production DNS for db2131 [dns] - 10https://gerrit.wikimedia.org/r/528281 (owner: 10Papaul)
[22:51:17] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] DNS: Add mgmt and production DNS for db2131 [dns] - 10https://gerrit.wikimedia.org/r/528281 (owner: 10Papaul)
[22:53:58] <wikibugs>	 (03PS1) 10Dzahn: mediawiki:maintenance: switch readinglists cron to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528282 (https://phabricator.wikimedia.org/T195392)
[22:55:15] <icinga-wm>	 PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[22:55:25] <chaomodus>	 ^ yah i'm looking at it
[22:55:29] <chaomodus>	 looks like a merge weirdness or something
[22:56:51] <wikibugs>	 (03PS1) 10Papaul: DHCP: Add MAC address for db2131 [puppet] - 10https://gerrit.wikimedia.org/r/528283 (https://phabricator.wikimedia.org/T229251)
[22:58:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] DHCP: Add MAC address for db2131 [puppet] - 10https://gerrit.wikimedia.org/r/528283 (https://phabricator.wikimedia.org/T229251) (owner: 10Papaul)
[23:00:04] <jouncebot>	 MaxSem, RoanKattouw, and Niharika: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190805T2300).
[23:00:04] <jouncebot>	 ebernhardson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:00:13] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] mediawiki:maintenance: switch readinglists cron to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528282 (https://phabricator.wikimedia.org/T195392) (owner: 10Dzahn)
[23:00:17] <wikibugs>	 (03PS2) 10Dzahn: mediawiki:maintenance: switch readinglists cron to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528282 (https://phabricator.wikimedia.org/T195392)
[23:00:19] <Urbanecm>	 I can SWAT today!
[23:00:21] <Urbanecm>	 ebernhardson, around?
[23:00:22] <ebernhardson>	 Urbanecm: :)
[23:00:43] <ebernhardson>	 Urbanecm: this one isn't really noticable, it only effects job runners
[23:00:46] <wikibugs>	 (03CR) 10Dzahn: "applied on install server, you can start install" [puppet] - 10https://gerrit.wikimedia.org/r/528283 (https://phabricator.wikimedia.org/T229251) (owner: 10Papaul)
[23:00:48] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528260 (https://phabricator.wikimedia.org/T220625) (owner: 10EBernhardson)
[23:00:50] <ebernhardson>	 and then only group0
[23:01:00] <Urbanecm>	 ok
[23:01:08] <Urbanecm>	 so nothing to test, i guess ebernhardson ?
[23:01:15] <ebernhardson>	 Urbanecm: right
[23:01:17] <Urbanecm>	 ok
[23:01:49] <wikibugs>	 (03Merged) 10jenkins-bot: Repoint cloudelastic at LB dns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528260 (https://phabricator.wikimedia.org/T220625) (owner: 10EBernhardson)
[23:02:09] <wikibugs>	 (03CR) 10jenkins-bot: Repoint cloudelastic at LB dns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528260 (https://phabricator.wikimedia.org/T220625) (owner: 10EBernhardson)
[23:03:09] <wikibugs>	 (03PS1) 10CRusnov: netbox: Fix parameter that didnt get passed [puppet] - 10https://gerrit.wikimedia.org/r/528285
[23:03:15] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/ProductionServices.php: SWAT: 87b428d: Repoint cloudelastic at LB dns (T220625) (duration: 00m 48s)
[23:03:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:03:24] <stashbot>	 T220625: Initialize CirrusSearch on cloudelastic - https://phabricator.wikimedia.org/T220625
[23:03:25] <Urbanecm>	 ebernhardson, synced!
[23:03:36] <ebernhardson>	 Urbanecm: thanks, i'll watch the logs if anything silly happens
[23:03:39] <Urbanecm>	 anything else?
[23:03:43] <Urbanecm>	 thanks
[23:03:44] <ebernhardson>	 Urbanecm: nope, thats it
[23:03:47] <wikibugs>	 (03CR) 10Dzahn: "manually tested this on mwmaint1002 - no difference - purges db rows using mwscript" [puppet] - 10https://gerrit.wikimedia.org/r/528282 (https://phabricator.wikimedia.org/T195392) (owner: 10Dzahn)
[23:03:50] <Urbanecm>	 !log Evening SWAT done
[23:03:51] <Urbanecm>	 okay than ^^
[23:03:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:05:56] <wikibugs>	 (03PS1) 10CRusnov: netbox: add dummy swift url key [labs/private] - 10https://gerrit.wikimedia.org/r/528286
[23:06:27] <mutante>	 The MediaWiki script file "/srv/mediawiki/php-1.34.0-wmf.16/extensions/WikimediaMaintenance/getJobQueueLengths.php" does not exist.
[23:06:36] <mutante>	 well then.. i guess that's an outdated maintenance cron :p
[23:06:54] <wikibugs>	 (03PS2) 10CRusnov: netbox: Fix parameter that didnt get passed [puppet] - 10https://gerrit.wikimedia.org/r/528285
[23:10:30] <wikibugs>	 (03PS1) 10Dzahn: mediawiki:maintenance: remove non-working jobqueue_stats cron [puppet] - 10https://gerrit.wikimedia.org/r/528287 (https://phabricator.wikimedia.org/T195392)
[23:11:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mediawiki:maintenance: remove non-working jobqueue_stats cron [puppet] - 10https://gerrit.wikimedia.org/r/528287 (https://phabricator.wikimedia.org/T195392) (owner: 10Dzahn)
[23:11:43] <wikibugs>	 (03CR) 10Paladox: "You have to either remove this manually or use the ensure => absent." [puppet] - 10https://gerrit.wikimedia.org/r/528287 (https://phabricator.wikimedia.org/T195392) (owner: 10Dzahn)
[23:12:19] <icinga-wm>	 PROBLEM - HHVM rendering on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[23:13:20] <wikibugs>	 (03PS2) 10Dzahn: mediawiki:maintenance: remove non-working jobqueue_stats cron [puppet] - 10https://gerrit.wikimedia.org/r/528287 (https://phabricator.wikimedia.org/T195392)
[23:14:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mediawiki:maintenance: remove non-working jobqueue_stats cron [puppet] - 10https://gerrit.wikimedia.org/r/528287 (https://phabricator.wikimedia.org/T195392) (owner: 10Dzahn)
[23:14:51] <wikibugs>	 (03CR) 10Dzahn: "I am going to remove it manually since it's just 2 hosts." [puppet] - 10https://gerrit.wikimedia.org/r/528287 (https://phabricator.wikimedia.org/T195392) (owner: 10Dzahn)
[23:15:25] <icinga-wm>	 RECOVERY - HHVM rendering on mw1341 is OK: HTTP OK: HTTP/1.1 200 OK - 81879 bytes in 0.126 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[23:15:45] <icinga-wm>	 PROBLEM - puppet last run on netmon2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[23:17:48] <wikibugs>	 (03CR) 10CRusnov: [V: 03+2 C: 03+2] netbox: add dummy swift url key [labs/private] - 10https://gerrit.wikimedia.org/r/528286 (owner: 10CRusnov)
[23:18:37] <Krinkle>	 mutante: the core commit paladox pointed to is not directly related
[23:20:07] <mutante>	 Krinkle: you mean it's not the change that actually removed it but just that it updates a comment? well, i was already suprised positively either way 
[23:20:42] <wikibugs>	 (03PS3) 10Dzahn: mediawiki:maintenance: remove non-working jobqueue_stats cron [puppet] - 10https://gerrit.wikimedia.org/r/528287 (https://phabricator.wikimedia.org/T195392)
[23:21:25] <wikibugs>	 (03PS4) 10Dzahn: mediawiki:maintenance: remove non-working jobqueue_stats cron [puppet] - 10https://gerrit.wikimedia.org/r/528287 (https://phabricator.wikimedia.org/T195392)
[23:22:41] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] mediawiki:maintenance: remove non-working jobqueue_stats cron [puppet] - 10https://gerrit.wikimedia.org/r/528287 (https://phabricator.wikimedia.org/T195392) (owner: 10Dzahn)
[23:25:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] mediawiki:maintenance: remove non-working jobqueue_stats cron [puppet] - 10https://gerrit.wikimedia.org/r/528287 (https://phabricator.wikimedia.org/T195392) (owner: 10Dzahn)
[23:25:30] <wikibugs>	 (03PS5) 10Dzahn: mediawiki:maintenance: remove non-working jobqueue_stats cron [puppet] - 10https://gerrit.wikimedia.org/r/528287 (https://phabricator.wikimedia.org/T195392)
[23:25:38] <wikibugs>	 (03PS1) 10Viztor: Add Noto Sans CJK + Noto Mono CJK fonts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528290
[23:28:53] <paladox>	 mutante i think he means the script (the script was removed from the Wikimedia extension due to the removal of the class it was using from MW core)
[23:30:07] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt1006 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[23:34:39] <mutante>	 !log mwmaint1002 - remove getJobQueueLengths.php from www-data's crontab (T195392)
[23:34:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:34:48] <stashbot>	 T195392: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392
[23:34:54] <mutante>	 paladox: ok, ack
[23:36:26] <wikibugs>	 (03PS1) 10Jhedden: toolschecker: webservice final process status [puppet] - 10https://gerrit.wikimedia.org/r/528292 (https://phabricator.wikimedia.org/T221301)
[23:37:54] <wikibugs>	 (03PS2) 10Viztor: Add Noto Sans CJK + Noto Mono CJK fonts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528290
[23:43:31] <icinga-wm>	 PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[23:45:47] <wikibugs>	 (03PS2) 10Jhedden: toolschecker: Ensure webservice is fully stopped [puppet] - 10https://gerrit.wikimedia.org/r/528292 (https://phabricator.wikimedia.org/T221301)
[23:53:39] <icinga-wm>	 RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[23:58:03] <icinga-wm>	 RECOVERY - puppet last run on cloudvirt1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun