[01:51:29] PROBLEM - Host cp5010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:51:37] PROBLEM - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [01:51:39] PROBLEM - Host cp5006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:51:39] PROBLEM - Host mr1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [01:51:39] PROBLEM - Host cp5009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:51:39] PROBLEM - Host cp5005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:51:39] PROBLEM - Host cp5004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:51:39] PROBLEM - Host cp5012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:51:45] PROBLEM - Host cp5011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:51:49] PROBLEM - Host dns5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:51:49] PROBLEM - Host lvs5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:51:49] PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:51:49] PROBLEM - Host lvs5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:51:49] PROBLEM - Host dns5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:54:13] PROBLEM - Host cp5008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:54:13] PROBLEM - Host cp5007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:54:15] PROBLEM - OSPF status on cr1-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:54:17] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:54:47] PROBLEM - Host bast5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:54:47] PROBLEM - Host lvs5003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:56:17] PROBLEM - Host cp5003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:56:17] PROBLEM - Host cp5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:56:17] PROBLEM - Host cp5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:49:39] (03Abandoned) 10Tulsi Bhagat: Add Namespaces translation for zh.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527862 (https://phabricator.wikimedia.org/T229743) (owner: 10Tulsi Bhagat) [03:16:37] PROBLEM - puppet last run on mc1033 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:44:41] RECOVERY - puppet last run on mc1033 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:50:26] 10Operations, 10netops: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10CDanis) [03:51:24] ACKNOWLEDGEMENT - SSH cp5001.mgmt on cp5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:51:24] ACKNOWLEDGEMENT - SSH cp5011.mgmt on cp5011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:51:24] ACKNOWLEDGEMENT - SSH dns5002.mgmt on dns5002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:51:24] ACKNOWLEDGEMENT - SSH bast5001.mgmt on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:51:24] ACKNOWLEDGEMENT - SSH cp5006.mgmt on cp5006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:51:25] ACKNOWLEDGEMENT - SSH cp5005.mgmt on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:51:25] ACKNOWLEDGEMENT - SSH cp5012.mgmt on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:51:26] ACKNOWLEDGEMENT - SSH cp5009.mgmt on cp5009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:51:26] ACKNOWLEDGEMENT - SSH lvs5003.mgmt on lvs5003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:51:27] ACKNOWLEDGEMENT - SSH lvs5001.mgmt on lvs5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:51:27] ACKNOWLEDGEMENT - SSH lvs5002.mgmt on lvs5002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:51:28] ACKNOWLEDGEMENT - SSH cp5008.mgmt on cp5008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:51:28] ACKNOWLEDGEMENT - SSH cp5003.mgmt on cp5003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:51:29] ACKNOWLEDGEMENT - SSH cp5007.mgmt on cp5007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:51:29] ACKNOWLEDGEMENT - SSH cp5010.mgmt on cp5010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:51:30] ACKNOWLEDGEMENT - SSH cp5004.mgmt on cp5004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:51:30] ACKNOWLEDGEMENT - SSH cp5002.mgmt on cp5002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:51:31] ACKNOWLEDGEMENT - SSH dns5001.mgmt on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:51:31] ACKNOWLEDGEMENT - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:51:32] ACKNOWLEDGEMENT - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:51:32] ACKNOWLEDGEMENT - OSPF status on cr1-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:51:33] ACKNOWLEDGEMENT - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [03:51:33] ACKNOWLEDGEMENT - Juniper alarms on asw1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.132.128.4 CDanis https://phabricator.wikimedia.org/T229778 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [03:58:33] (03PS3) 10Viztor: Update HD logo for en.ws and mul.ws [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527922 (https://phabricator.wikimedia.org/T229769) [03:59:06] 10Operations, 10ops-eqsin: remote hands setups for ganeti500[123] - https://phabricator.wikimedia.org/T229243 (10wiki_willy) Completed by Jin from DreamICC today. The missing IPV4 IP addresses used are the following, with the gateway set to 10.132.129.1 accordingly (instead of 10.132.128.1): ganeti5001 10.13... [04:00:03] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [04:00:21] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [04:00:39] 10Operations, 10ops-eqsin: msw1-eqsin/msw2-eqsin missing serial number - https://phabricator.wikimedia.org/T227911 (10wiki_willy) Info gathered by Jin from DreamICC today. Here's the info below (also sent out via email): msw1-eqsin (msw-0603) WMF7189 2W026C5B012A2 msw2-eqsin (msw-0604) WMF7190 2W026C5E012B3... [04:02:52] 10Operations, 10netops: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10CDanis) @wiki_willy Any chance this is somehow related to {T229243}? [04:07:44] (03PS4) 10Viztor: Update HD logo for en.ws and mul.ws [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527922 (https://phabricator.wikimedia.org/T229769) [04:09:02] 10Operations, 10netops: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10wiki_willy) @CDanis - I just checked with our 3rd party contractor and he says it shouldn't have been affected from the work he was doing. Although, he was working in the racks from 1:45-4:00 UTC, and If it... [04:09:35] 10Operations, 10ops-eqsin: update PDUs for eqsin (asset tag and other info) - https://phabricator.wikimedia.org/T211368 (10wiki_willy) Asset tags applied by Jin from DreamICC today as follows (also emailed out via a spreadsheet): ps1-603-eqsin WMF7196 ps2-603-eqsin WMF7197 ps1-604-eqsin WMF7198 ps2-604-eqsin... [04:14:26] 10Operations, 10netops: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10CDanis) Still alerting, unfortunately. [04:17:31] PROBLEM - HHVM rendering on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [04:19:01] RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 81922 bytes in 1.007 second response time https://wikitech.wikimedia.org/wiki/Application_servers [04:22:37] 10Operations, 10netops: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10wiki_willy) Alright, I'm asking him to go back to the datacenter to check all the connections on mr1-eqsin. [04:24:17] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [04:24:35] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [04:37:18] (03PS5) 10Viztor: Update HD logo for en.ws and mul.ws [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527922 (https://phabricator.wikimedia.org/T229769) [04:38:15] RECOVERY - PHP opcache health on mwdebug2001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [04:52:50] 10Operations, 10ops-codfw, 10DBA: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Marostegui) 05Open→03Resolved No OS or idrac errors since the memory was replaced, so I am closing this as resolved. If it happens again, I will re-open Thanks @Papaul! [05:23:15] RECOVERY - Host cp5005.mgmt is UP: PING OK - Packet loss = 16%, RTA = 231.87 ms [05:23:17] RECOVERY - Host cp5010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.23 ms [05:23:21] RECOVERY - Host dns5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 249.35 ms [05:23:27] RECOVERY - Host mr1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 231.89 ms [05:23:27] RECOVERY - Host asw1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 232.13 ms [05:23:29] RECOVERY - Host cp5009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.04 ms [05:23:29] RECOVERY - Host cp5006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.18 ms [05:23:29] RECOVERY - Host cp5012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.09 ms [05:23:29] RECOVERY - Host cp5004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.11 ms [05:23:43] RECOVERY - Host lvs5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.10 ms [05:23:43] RECOVERY - Host lvs5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.22 ms [05:23:43] RECOVERY - Host dns5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.82 ms [05:23:43] RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 233.61 ms [05:23:43] RECOVERY - Host cp5011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 316.17 ms [05:24:11] RECOVERY - OSPF status on cr1-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:24:11] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:24:13] RECOVERY - Host lvs5003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 231.79 ms [05:24:39] RECOVERY - Host cp5003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 231.94 ms [05:25:05] (03PS1) 10Marostegui: db2124: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/527949 (https://phabricator.wikimedia.org/T228969) [05:25:05] wooot [05:25:16] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Pool db2124 into s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527950 (https://phabricator.wikimedia.org/T228969) [05:25:43] RECOVERY - Host cp5007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 231.84 ms [05:25:59] RECOVERY - Host cp5008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 231.85 ms [05:26:34] (03CR) 10Vgutierrez: [C: 03+1] db-eqiad,db-codfw.php: Pool db2124 into s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527950 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [05:26:39] RECOVERY - Host bast5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 231.84 ms [05:27:38] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Pool db2124 into s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527950 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [05:28:05] RECOVERY - Host cp5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 231.86 ms [05:28:05] RECOVERY - Host cp5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 231.74 ms [05:28:30] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2124 into s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527950 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [05:28:40] !log marostegui@cumin1001 dbctl commit of MediaWiki config (dc=all), diff saved to 'https://phabricator.wikimedia.org/P8858', previous config saved to /var/cache/conftool/dbconfig/20190805-052839-marostegui.json [05:28:47] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2124 into s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527950 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [05:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:57] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Provision db2124 into s6 T228969 (duration: 00m 49s) [05:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:05] T228969: Productionize db21[21-30} - https://phabricator.wikimedia.org/T228969 [05:30:07] (03CR) 10Marostegui: [C: 03+2] db2124: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/527949 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [05:30:49] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Provision db2124 into s6 T228969 (duration: 00m 46s) [05:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:08] 10Operations, 10netops: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10wiki_willy) a:03wiki_willy [05:37:28] 10Operations, 10ops-eqsin, 10netops: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10wiki_willy) 05Open→03Resolved Cable between mr1-eqsin p4 <---> asw-0603-eqsin p23 looks like it accidentally got bumped by the contractor during the server install. Called him back and he... [05:37:47] 10Operations, 10ops-eqsin, 10netops: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10Marostegui) We just got all the recoveries: ` [07:23:15] <+icinga-wm> RECOVERY - Host cp5005.mgmt is UP: PING OK - Packet loss = 16%, RTA = 231.87 ms [07:23:17] <+icinga-wm> RECOVERY - Host c... [05:38:29] 10Operations, 10ops-eqsin, 10netops: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10Marostegui) Ha! @wiki_willy was faster! [05:40:14] 10Operations, 10ops-eqsin, 10netops: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10wiki_willy) @Marostegui - Ha, we tied. =) [05:58:14] !log Update rack column on zarcillo.servers for the new servers T229683 [05:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:22] T229683: Update rack information on zarcillo.servers - https://phabricator.wikimedia.org/T229683 [06:32:27] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [06:38:44] (03PS1) 10Elukey: profile::analytics::refinery::job::test::refine: fix refine regex [puppet] - 10https://gerrit.wikimedia.org/r/527979 (https://phabricator.wikimedia.org/T226698) [06:40:09] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::test::refine: fix refine regex [puppet] - 10https://gerrit.wikimedia.org/r/527979 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [06:53:38] 10Operations, 10cloud-services-team: Migrate remaining cloudvirt hosts to Stretch/Mitaka - https://phabricator.wikimedia.org/T224561 (10MoritzMuehlenhoff) [06:55:38] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [07:05:35] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [07:11:01] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: mcrouter codfw proxies sometimes lead to TKOs - https://phabricator.wikimedia.org/T227265 (10elukey) We deployed all the changes for T225642, so async settings for codfw replication was not the culprit. After T225059 we have per-shard... [07:15:29] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/527430 (https://phabricator.wikimedia.org/T204450) (owner: 10Giuseppe Lavagetto) [07:20:28] (03PS1) 10Vgutierrez: Release fifo-log-demux 0.5 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527980 [07:21:32] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (and needs SRE meeting approval)" [puppet] - 10https://gerrit.wikimedia.org/r/525847 (https://phabricator.wikimedia.org/T229124) (owner: 10RobH) [07:23:14] !log Move db2095:3312 from db2063 to db2126 - T228969 [07:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:24] T228969: Productionize db21[21-30} - https://phabricator.wikimedia.org/T228969 [07:35:49] 10Operations, 10serviceops: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10MoritzMuehlenhoff) The patch seems sane, but I'm wondering whether we actually need to pursue this further? tmpreaper is dead upstream (the Debian maintainer keeps it alive a little for security fixes,... [07:37:22] (03CR) 10Muehlenhoff: "Ack, I'll update the commit message and merge later. The one known corner case has been fixed." [puppet] - 10https://gerrit.wikimedia.org/r/525220 (https://phabricator.wikimedia.org/T46722) (owner: 10Muehlenhoff) [07:43:17] (03PS1) 10Marostegui: db-codfw.php: Reorganize s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528069 (https://phabricator.wikimedia.org/T228969) [07:43:47] !log removed orespoolcounter[12]00[12] from debmonitor T227640 [07:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:57] T227640: Migrate ORES pool counters to Buster - https://phabricator.wikimedia.org/T227640 [07:43:59] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:44:45] marostegui: ^^ :? [07:45:04] that's me yep [07:45:08] I am preparing a big commit :) [07:45:48] !log installing unzip regression DLA for jessie [07:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:53] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:47:03] ^ Almost done [07:49:31] !log marostegui@cumin1001 dbctl commit of MediaWiki config (dc=all), diff saved to 'https://phabricator.wikimedia.org/P8859', previous config saved to /var/cache/conftool/dbconfig/20190805-074930-marostegui.json [07:49:33] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:55] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Reorganize s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528069 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [07:51:59] (03Merged) 10jenkins-bot: db-codfw.php: Reorganize s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528069 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [07:52:14] (03CR) 10jenkins-bot: db-codfw.php: Reorganize s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528069 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [07:52:31] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:52:38] !log marostegui@deploy1001 sync-file aborted: Reorganize s2 T228969 (duration: 00m 06s) [07:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:46] T228969: Productionize db21[21-30} - https://phabricator.wikimedia.org/T228969 [07:53:20] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Reorganize s2 T228969 (duration: 00m 48s) [07:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:49] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Reorganize s2 T228969 (duration: 00m 47s) [07:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:18] (03PS1) 10Marostegui: mariadb: Promote db2107 to codfw s2 master [puppet] - 10https://gerrit.wikimedia.org/r/528070 (https://phabricator.wikimedia.org/T220170) [08:09:09] 10Operations, 10Goal: SRE firefighting improvements - 2019-20 Q1 Goal - https://phabricator.wikimedia.org/T229782 (10Volans) [08:09:17] 10Operations, 10Goal: SRE firefighting improvements - 2019-20 Q1 Goal - https://phabricator.wikimedia.org/T229782 (10Volans) p:05Triage→03Normal [08:11:20] (03PS1) 10Elukey: Enable spark.authenticate in yarn-site.xml on the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/528071 (https://phabricator.wikimedia.org/T226698) [08:11:52] 10Operations, 10serviceops: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10elukey) >>! In T151304#5391310, @MoritzMuehlenhoff wrote: > The patch seems sane, but I'm wondering whether we actually need to pursue this further? tmpreaper is dead upstream (the Debian maintainer kee... [08:12:12] (03CR) 10Elukey: [C: 03+2] Enable spark.authenticate in yarn-site.xml on the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/528071 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [08:19:33] (03PS2) 10Marostegui: mariadb: Promote db2107 to codfw s2 master [puppet] - 10https://gerrit.wikimedia.org/r/528070 (https://phabricator.wikimedia.org/T220170) [08:21:27] !log Switchover s2 codfw master from db2035 to db2107 - T221533 T220170 [08:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:37] T220170: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 [08:21:37] T221533: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533 [08:27:53] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2107 to codfw s2 master [puppet] - 10https://gerrit.wikimedia.org/r/528070 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [08:32:55] !log marostegui@cumin1001 dbctl commit of MediaWiki config (dc=all), diff saved to 'https://phabricator.wikimedia.org/P8861', previous config saved to /var/cache/conftool/dbconfig/20190805-083254-marostegui.json [08:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:13] (03PS2) 10Alexandros Kosiaris: restrouter: Fix typo in suffixes in admin [deployment-charts] - 10https://gerrit.wikimedia.org/r/527480 [08:36:15] (03PS1) 10Alexandros Kosiaris: Realign restrouter limitranges [deployment-charts] - 10https://gerrit.wikimedia.org/r/528074 [08:39:13] 10Operations, 10DBA, 10decommission: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Marostegui) [08:39:34] (03Abandoned) 10Alexandros Kosiaris: restrouter: Fix typo in suffixes in admin [deployment-charts] - 10https://gerrit.wikimedia.org/r/527480 (owner: 10Alexandros Kosiaris) [08:39:48] 10Operations, 10DBA, 10decommission: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Marostegui) p:05Triage→03Normal [08:40:03] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2035 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528075 (https://phabricator.wikimedia.org/T229784) [08:40:06] (03PS2) 10Alexandros Kosiaris: Realign restrouter limitranges [deployment-charts] - 10https://gerrit.wikimedia.org/r/528074 [08:40:09] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, and 2 others: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Marostegui) [08:41:37] (03CR) 10Volans: "Question inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [08:41:57] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Realign restrouter limitranges [deployment-charts] - 10https://gerrit.wikimedia.org/r/528074 (owner: 10Alexandros Kosiaris) [08:43:05] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'restrouter' for release 'staging' . [08:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:45] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2035 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528075 (https://phabricator.wikimedia.org/T229784) (owner: 10Marostegui) [08:44:09] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, and 2 others: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Marostegui) [08:44:11] (03PS4) 10Filippo Giunchedi: monitoring: tweak description for paging alerts [puppet] - 10https://gerrit.wikimedia.org/r/525536 (https://phabricator.wikimedia.org/T228878) [08:44:39] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2035 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528075 (https://phabricator.wikimedia.org/T229784) (owner: 10Marostegui) [08:45:36] (03CR) 10Filippo Giunchedi: [C: 03+2] monitoring: tweak description for paging alerts [puppet] - 10https://gerrit.wikimedia.org/r/525536 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [08:45:48] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2035 from config T229784 (duration: 00m 47s) [08:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:56] T229784: Decommission db2035 - https://phabricator.wikimedia.org/T229784 [08:46:24] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2035 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528075 (https://phabricator.wikimedia.org/T229784) (owner: 10Marostegui) [08:46:41] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2035 from config T229784 (duration: 00m 46s) [08:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:11] 10Operations, 10ops-eqiad: helium.mgmt down - https://phabricator.wikimedia.org/T229706 (10Volans) 05Open→03Resolved p:05Triage→03Normal @Dzahn have you tried to follow https://wikitech.wikimedia.org/wiki/Management_Interfaces#Troubleshooting_Commands, in particular [[ https://wikitech.wikimedia.org/wi... [08:52:13] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10Volans) [08:56:02] !log installing vim security updates for jessie (stretch/buster already fixed) [08:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:25] (03PS1) 10Alexandros Kosiaris: Restrouter: Specify the correct image [deployment-charts] - 10https://gerrit.wikimedia.org/r/528077 [09:02:44] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Restrouter: Specify the correct image [deployment-charts] - 10https://gerrit.wikimedia.org/r/528077 (owner: 10Alexandros Kosiaris) [09:02:50] PROBLEM - toolschecker: gridengine webservice running #page on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/gridengine - 177 bytes in 9.766 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [09:03:52] arturo: around? [09:03:56] yup [09:04:15] let us know if we can help [09:04:47] PROBLEM - toolschecker: kubernetes webservice running #page on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 177 bytes in 9.879 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [09:04:51] I'm currently fighting the icinga autocompleter -_- [09:05:03] so fast and useful :) [09:05:15] (the icinga autocompleter) [09:05:17] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'restrouter' for release 'staging' . [09:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:48] ACKNOWLEDGEMENT - toolschecker: gridengine webservice running #page on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/gridengine - 177 bytes in 9.766 second response time Arturo Borrero Gonzalez investigating https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [09:05:50] ACKNOWLEDGEMENT - toolschecker: kubernetes webservice running #page on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 177 bytes in 9.879 second response time Arturo Borrero Gonzalez investigating https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [09:06:23] PROBLEM - LVS HTTP IPv4 #page on cloudelastic.wikimedia.org is CRITICAL: connect to address 208.80.154.84 and port 8643: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:06:45] <_joe_> err sorry per i servizi what's going on? [09:07:02] I believe arturo's on it [09:07:03] !log downtime toolschecker for 5hours [09:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:52] the cloudeslastic one I'm not so sure. Is that a new service that the traffic service is working on? I believe I saw brandon downtiming it the other day. Maybe the downtime period expired [09:08:07] I'll check cloudelastic [09:08:18] https://phabricator.wikimedia.org/T229621 [09:08:39] No critical clients on it yet, so no need to panic on that one [09:09:17] yeah.. as apergos pointed out, it's an issue with the check itself [09:11:12] onimisionipe: ^^^ I think you had a similar issue with the elastic checks already, would you have time to have a look? [09:11:44] (03PS1) 10Fsero: k8s: deploy calico, rbac, psp, coredns and ns via helmfile in codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/528078 (https://phabricator.wikimedia.org/T228837) [09:12:42] gehel: sure! [09:12:48] onimisionipe: thanks! [09:13:06] 10Operations, 10ops-eqiad, 10DBA: Upgrade db1100 firmware and BIOS - https://phabricator.wikimedia.org/T228732 (10Marostegui) @Cmjohnson are we still good for tomorrow at 14:00 UTC? I will have the host depooled and off for you before 14:00 UTC [09:13:59] gehel: I ran into the same issue with ncredir service a few weeks ago [09:14:40] vgutierrez: and you replaced $ADDRESS with $HOSTNAME in the check definition? [09:14:47] nope [09:14:51] not at all [09:15:02] * gehel is just guessing, I haven't checked what the actual problem is here [09:16:49] so for text/upload/ncredir what we do is to configure in configuration.yaml an icinga check only on the service in port 80 [09:16:55] so text, upload and ncredir [09:17:07] and text-https, upload-https and ncredir-https don't get icinga check configuration [09:17:20] 10Operations, 10Maps: postgresql replication issues on maps1001 - https://phabricator.wikimedia.org/T229788 (10Gehel) [09:18:16] and we have an icinga check that gets that configuration and generates two checks for ports 80 and 443 [09:18:19] kinda hacky [09:18:21] but it works [09:18:25] vgutierrez: that implies that the different services are closely related and fail together? [09:18:37] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'restrouter' for release 'staging' . [09:18:43] well.. we do check both ports [09:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:18] vgutierrez: not sure I understand, let me look at the code [09:19:24] PROBLEM - puppet last run on mc1031 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[vim] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:19:26] I'm trying to find the code [09:21:06] Oh, I see! I was assuming a different issue [09:22:13] basically if you check how the text LVS service is configured [09:22:34] we only provide icinga config in one of the two services (one is text and the other one text-https) [09:22:54] but the puppetization of that icinga check it's actually deploying two checks, one for text and another for text-https [09:23:08] (03PS1) 10Marostegui: mariadb: Provision db2127 into s3 [puppet] - 10https://gerrit.wikimedia.org/r/528079 (https://phabricator.wikimedia.org/T228969) [09:23:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor nitpick comment, rest LGTM" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/528078 (https://phabricator.wikimedia.org/T228837) (owner: 10Fsero) [09:25:12] and the culprit doing that black magic is lvs::monitor_service_http_https [09:27:15] (03PS1) 10Marostegui: db2105: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/528081 (https://phabricator.wikimedia.org/T220170) [09:28:05] (03CR) 10Marostegui: [C: 03+2] db2105: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/528081 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [09:29:27] RECOVERY - toolschecker: gridengine webservice running #page on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 22.235 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [09:30:10] !log Stop MySQL on db2105 to change binlog format [09:30:13] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Traffic: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10Vgutierrez) I've seen the same behaviour configuring the ncredir LVS service as it's using two ports (80/443). Same happens wi... [09:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:23] (03PS2) 10Marostegui: mariadb: Provision db2127 into s3 [puppet] - 10https://gerrit.wikimedia.org/r/528079 (https://phabricator.wikimedia.org/T228969) [09:31:54] (03PS2) 10Filippo Giunchedi: WIP: alert on widespread puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/526431 [09:31:56] (03PS1) 10Filippo Giunchedi: prometheus: aggregate puppet zero resources reported [puppet] - 10https://gerrit.wikimedia.org/r/528084 (https://phabricator.wikimedia.org/T229262) [09:32:28] (03CR) 10Marostegui: [C: 03+2] mariadb: Provision db2127 into s3 [puppet] - 10https://gerrit.wikimedia.org/r/528079 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [09:41:48] (03PS2) 10Fsero: k8s: deploy calico, rbac, psp, coredns and ns via helmfile in codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/528078 (https://phabricator.wikimedia.org/T228837) [09:43:09] RECOVERY - toolschecker: kubernetes webservice running #page on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 58.311 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [09:43:14] (03CR) 10Fsero: "addressed Alex and joe comments" [deployment-charts] - 10https://gerrit.wikimedia.org/r/528078 (https://phabricator.wikimedia.org/T228837) (owner: 10Fsero) [09:47:46] (03PS2) 10Gergő Tisza: Allow CORS access to publichtml (people.wikimedia.org) [puppet] - 10https://gerrit.wikimedia.org/r/522991 (https://phabricator.wikimedia.org/T224068) [09:52:10] RECOVERY - puppet last run on mc1031 is OK: OK: Puppet is currently enabled, last run 7 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:53:27] (03CR) 10Ema: [C: 03+1] Release fifo-log-demux 0.5 (031 comment) [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527980 (owner: 10Vgutierrez) [09:55:53] 10Operations, 10serviceops, 10HHVM: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) [09:56:52] 10Operations, 10serviceops, 10HHVM: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) [09:56:55] 10Operations, 10serviceops: SRE FY19-20 Q1 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10jijiki) [10:01:22] (03PS1) 10Elukey: Add more spark security options to yarn-size in Hadoop Test [puppet] - 10https://gerrit.wikimedia.org/r/528086 (https://phabricator.wikimedia.org/T226698) [10:01:40] 10Operations, 10serviceops, 10HHVM: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) [10:02:52] 10Operations, 10serviceops: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10jijiki) @elukey @MoritzMuehlenhoff I have added tmpreapers removal as an actionable in our HHVM removal process (T229792), shall we mark this as resolved or invalid? [10:03:41] (03PS2) 10Vgutierrez: Release fifo-log-demux 0.5 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527980 [10:04:33] (03CR) 10Vgutierrez: Release fifo-log-demux 0.5 (031 comment) [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527980 (owner: 10Vgutierrez) [10:05:47] (03PS2) 10Filippo Giunchedi: prometheus: aggregate puppet zero resources reported [puppet] - 10https://gerrit.wikimedia.org/r/528084 (https://phabricator.wikimedia.org/T229262) [10:05:50] (03PS3) 10Filippo Giunchedi: WIP: alert on widespread puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/526431 [10:05:53] (03PS1) 10Filippo Giunchedi: base: stop per-host puppet critical when master has issues [puppet] - 10https://gerrit.wikimedia.org/r/528087 (https://phabricator.wikimedia.org/T229262) [10:06:44] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: aggregate puppet zero resources reported [puppet] - 10https://gerrit.wikimedia.org/r/528084 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi) [10:06:51] (03PS3) 10Filippo Giunchedi: prometheus: aggregate puppet zero resources reported [puppet] - 10https://gerrit.wikimedia.org/r/528084 (https://phabricator.wikimedia.org/T229262) [10:07:31] (03CR) 10Elukey: [C: 03+2] Add more spark security options to yarn-size in Hadoop Test [puppet] - 10https://gerrit.wikimedia.org/r/528086 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [10:09:09] (03PS4) 10Filippo Giunchedi: prometheus: aggregate puppet zero resources reported [puppet] - 10https://gerrit.wikimedia.org/r/528084 (https://phabricator.wikimedia.org/T229262) [10:12:13] !log rolling update of openjdk on maps servers [10:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:17] 10Operations, 10serviceops: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10MoritzMuehlenhoff) Toolforge/Toollabs also uses tmpreaper (but not the puppetised version with the tmpreaper Puppet class). I'm adding @Andrew and @aborrero for comments whether we should keep it open f... [10:20:23] (03CR) 10Ema: [C: 03+1] Release fifo-log-demux 0.5 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527980 (owner: 10Vgutierrez) [10:20:43] (03CR) 10Vgutierrez: [C: 03+2] Release fifo-log-demux 0.5 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527980 (owner: 10Vgutierrez) [10:22:31] 10Operations, 10DBA: Remove sarin and neodymium GRANTs from all the databases - https://phabricator.wikimedia.org/T229796 (10Marostegui) [10:22:49] 10Operations, 10DBA: Remove sarin and neodymium GRANTs from all the databases - https://phabricator.wikimedia.org/T229796 (10Marostegui) p:05Triage→03Normal [10:24:04] 10Operations, 10DBA: Remove sarin and neodymium GRANTs from all the databases - https://phabricator.wikimedia.org/T229796 (10Marostegui) [10:24:07] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10Marostegui) [10:24:12] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10Marostegui) [10:24:22] 10Operations, 10DBA: Remove sarin and neodymium GRANTs from all the databases - https://phabricator.wikimedia.org/T229796 (10Marostegui) [10:24:25] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10Marostegui) [10:24:30] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10Marostegui) [10:24:36] 10Operations, 10DBA: Remove sarin and neodymium GRANTs from all the databases - https://phabricator.wikimedia.org/T229796 (10Marostegui) [10:24:41] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10Marostegui) [10:24:46] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10Marostegui) [10:27:55] !log upload fifo-log-demux 0.5 to stretch-wikimedia [10:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:55] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528092 (https://phabricator.wikimedia.org/T128546) [10:30:04] jan_drewniak: It is that lovely time of the day again! You are hereby commanded to deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190805T1030). [10:30:24] (03CR) 10Alexandros Kosiaris: [C: 03+1] k8s: deploy calico, rbac, psp, coredns and ns via helmfile in codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/528078 (https://phabricator.wikimedia.org/T228837) (owner: 10Fsero) [10:30:38] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10MoritzMuehlenhoff) > Please comment and if its ready to start the decom process, check off the boxes and assign to me for followup. Thanks in advance! This needs to wait un... [10:31:05] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10MoritzMuehlenhoff) @RobH This needs to wait until https://phabricator.wikimedia.org/T229796 is complete, I'll reassign the bug to you when that's done. [10:32:52] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528092 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:33:47] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528092 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:34:04] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528092 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:36:46] 10Operations, 10Wikimedia-Mailing-lists: Yoruba Wikimedians User Group Official Mailing List - https://phabricator.wikimedia.org/T229749 (10MarcoAurelio) [10:37:25] 10Operations, 10Wikimedia-Mailing-lists: Yoruba Wikimedians User Group Official Mailing List - https://phabricator.wikimedia.org/T229749 (10MarcoAurelio) >>! In T229749#5390555, @Wikicology wrote: > *The list should be private and requires list administrators approval for subscription. > > *The list should ha... [10:38:24] (03CR) 10Lucas Werkmeister (WMDE): "If you’re already touching this line, perhaps you could also look into T203397? (I tried to implement that a while back but couldn’t figur" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526755 (owner: 10Smalyshev) [10:39:19] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:528092| Bumping portals to master (T128546)]] (duration: 00m 49s) [10:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:28] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:40:06] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:528092| Bumping portals to master (T128546)]] (duration: 00m 46s) [10:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:19] !log update java on sessionstore [10:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:03] 10Operations, 10ops-eqiad, 10DBA: Upgrade db1100 firmware and BIOS - https://phabricator.wikimedia.org/T228732 (10Cmjohnson) @marostegui yes, still good fit tomorrow at 1400UTC [10:42:50] 10Operations, 10ops-eqiad, 10DBA: Upgrade db1100 firmware and BIOS - https://phabricator.wikimedia.org/T228732 (10Marostegui) Excellent - thank you! [10:44:16] PROBLEM - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 4.001 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [10:50:15] (03PS1) 10Ladsgroup: labs: Set half of wikidata to read from the new term store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528103 (https://phabricator.wikimedia.org/T225055) [10:51:06] (03PS1) 10Ema: ATS: add role::cache::trafficserver::backend [puppet] - 10https://gerrit.wikimedia.org/r/528104 (https://phabricator.wikimedia.org/T227432) [10:51:44] (03PS3) 10Fsero: k8s: deploy calico, rbac, psp, coredns and ns via helmfile in codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/528078 (https://phabricator.wikimedia.org/T228837) [10:52:25] (03PS3) 10Vgutierrez: prometheus: Collect ncredir nginx metrics [puppet] - 10https://gerrit.wikimedia.org/r/524409 (https://phabricator.wikimedia.org/T228382) [10:52:47] (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s: deploy calico, rbac, psp, coredns and ns via helmfile in codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/528078 (https://phabricator.wikimedia.org/T228837) (owner: 10Fsero) [10:52:49] (03CR) 10Ema: [C: 03+1] sre.hosts.upgrade-varnish: cp1008 decom cleanup [cookbooks] - 10https://gerrit.wikimedia.org/r/527173 (https://phabricator.wikimedia.org/T229586) (owner: 10Volans) [10:52:51] (03PS4) 10Vgutierrez: prometheus: Collect ncredir nginx metrics [puppet] - 10https://gerrit.wikimedia.org/r/524409 (https://phabricator.wikimedia.org/T228382) [10:53:22] godog: can I get a sanity check on that prometheus change? https://gerrit.wikimedia.org/r/c/operations/puppet/+/524409 [10:53:30] (03CR) 10jerkins-bot: [V: 04-1] prometheus: Collect ncredir nginx metrics [puppet] - 10https://gerrit.wikimedia.org/r/524409 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez) [10:53:37] hmmm jerkins is not happy [10:54:17] yeah.. merge issues...I've created that chain of changes before my vacations :_( [10:54:28] (03Abandoned) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512651 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:54:36] (03Abandoned) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509785 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:55:20] vgutierrez: for sure, please add me to the review once ready to go and I'll take a look [10:55:27] thx [10:55:50] (03CR) 10Ema: [C: 03+1] cumin: remove old scripts converted to cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/527170 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [10:55:56] (03PS5) 10Vgutierrez: fifo_log_demux: Provide pipe creation capabilities [puppet] - 10https://gerrit.wikimedia.org/r/524176 (https://phabricator.wikimedia.org/T228382) [10:56:17] (03PS2) 10Vgutierrez: fifo_log_demux: Allow to specify a service that requires fifo_log_demux [puppet] - 10https://gerrit.wikimedia.org/r/524496 (https://phabricator.wikimedia.org/T228382) [10:56:41] (03PS32) 10Vgutierrez: ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) [10:58:48] 10Operations, 10Traffic, 10netops, 10IPv6: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10jbond) Thanks brandon, Ill take a look at removing the ones SLAAC addresses from puppet this week. One of them, at least, was added by me and was what led... [10:58:59] (03CR) 10Urbanecm: [C: 04-1] Update HD logo for en.ws and mul.ws (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527922 (https://phabricator.wikimedia.org/T229769) (owner: 10Viztor) [10:59:06] (03PS6) 10Urbanecm: Update HD logo for en.ws and mul.ws [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527922 (https://phabricator.wikimedia.org/T229769) (owner: 10Viztor) [10:59:37] (03PS2) 10Arturo Borrero Gonzalez: toolforge: k8s: ingress: nginx-ingress listen on 8082/tcp [puppet] - 10https://gerrit.wikimedia.org/r/527541 (https://phabricator.wikimedia.org/T228500) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190805T1100). [11:00:05] Amir1 and raynor: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:10] (03CR) 10jerkins-bot: [V: 04-1] Update HD logo for en.ws and mul.ws [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527922 (https://phabricator.wikimedia.org/T229769) (owner: 10Viztor) [11:00:15] o/ [11:00:16] I can SWAT today! [11:00:23] o/ [11:00:24] (and also has a few things to deploy) [11:00:28] Urbanecm, awesome, thx [11:00:41] o/ mine is backport and takes some time to merge, I already +2'ed it [11:00:51] thanks Amir1 [11:00:56] I'm preet sure one config patch can go before mind [11:00:58] I've +2'ed raynor's backport [11:00:59] *pretty [11:00:59] btw, I saw there is a patch for Related Articles but it's unclear on how to merge it (requires three sync commands) [11:01:27] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527726 (https://phabricator.wikimedia.org/T229717) (owner: 10Huji) [11:01:27] also I'm not super familiar with that patch and feeling bit resistant to merge it without having Jon Robson around, so decided not to add it SWAT window [11:01:40] Urbanecm, thx for merging my backport, let me know when it gets to mwdebug [11:01:47] will do [11:02:03] (03PS6) 10Vgutierrez: fifo_log_demux: Provide pipe creation capabilities [puppet] - 10https://gerrit.wikimedia.org/r/524176 (https://phabricator.wikimedia.org/T228382) [11:02:05] (03PS3) 10Vgutierrez: fifo_log_demux: Allow to specify a service that requires fifo_log_demux [puppet] - 10https://gerrit.wikimedia.org/r/524496 (https://phabricator.wikimedia.org/T228382) [11:02:07] (03PS33) 10Vgutierrez: ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) [11:02:09] (03PS5) 10Vgutierrez: prometheus: Collect ncredir nginx metrics [puppet] - 10https://gerrit.wikimedia.org/r/524409 (https://phabricator.wikimedia.org/T228382) [11:02:58] (03Merged) 10jenkins-bot: Define import sources for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527726 (https://phabricator.wikimedia.org/T229717) (owner: 10Huji) [11:04:41] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 9eb74c2: Define import sources for fawiki (T229717) (duration: 00m 48s) [11:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:49] T229717: Define import sources for fawiki - https://phabricator.wikimedia.org/T229717 [11:05:23] raynor: If you mean https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/527632/, I'm pretty sure it's IS.php, CS.php, list. If you know how to test what it does, I'd be comfortable to merge&sync. [11:06:30] yeah, exactly this one, but I'm not fully comfortable with testing it, I think it can wait till later today, I'll be swatting another change during morning swat [11:06:49] ok then [11:06:50] PROBLEM - puppet last run on boron is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[linux-image-4.9.0-9-amd64-dbg],Package[linux-headers-4.9.0-9-amd64] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:06:51] so most probably I'll pick also this one once Jon is around [11:07:13] (03CR) 10Lucas Werkmeister (WMDE): "IS > CS > dblist looks correct to me as well, yeah." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra) [11:08:01] (03CR) 10jenkins-bot: Define import sources for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527726 (https://phabricator.wikimedia.org/T229717) (owner: 10Huji) [11:16:54] Urbanecm: do you have any more config changes? [11:17:21] Amir1, no, just a backport [11:17:34] currently waiting on CI [11:17:47] feel free to push your things if you have any :-) [11:18:22] that's a pretty big backport :D [11:18:32] Mine is waiting for CI still [11:18:36] jouncebot: now [11:18:36] For the next 0 hour(s) and 41 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190805T1100) [11:18:57] Urbanecm: can you deploy mine? It's not testable (you can just sync the whole extension, it should be fine) [11:19:09] Urbanecm: I'm here to test :) [11:19:09] Amir1, the backport? Sure [11:19:17] wonderful, Daimona! [11:19:37] Yup: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/528072 [11:19:56] ok, will do [11:19:58] once it's merged [11:20:35] Thanks. It's almost done (5 minutes left...) [11:21:00] I see [11:23:32] (03PS4) 10Urbanecm: Enable Page Previews as default on hewikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508088 (https://phabricator.wikimedia.org/T222017) (owner: 10Ammarpad) [11:24:18] * Urbanecm is going to do ^ in the meanwhile as well [11:24:24] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508088 (https://phabricator.wikimedia.org/T222017) (owner: 10Ammarpad) [11:25:50] Mine is about to finish [11:26:25] well, both are all success, but none is merged yet [11:26:33] WIkibase done [11:26:36] Amir1, fetching [11:26:41] cool [11:27:05] (03Merged) 10jenkins-bot: Enable Page Previews as default on hewikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508088 (https://phabricator.wikimedia.org/T222017) (owner: 10Ammarpad) [11:27:44] raynor: Your patch should be on mwdebug1002 [11:28:23] thx, checking it [11:28:31] Amir1, syncing [11:28:42] * Amir1 looks at graphs [11:28:43] thanks [11:29:12] RECOVERY - puppet last run on boron is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:29:31] (03CR) 10jenkins-bot: Enable Page Previews as default on hewikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508088 (https://phabricator.wikimedia.org/T222017) (owner: 10Ammarpad) [11:29:40] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.16/extensions/Wikibase: SWAT: 3ecaa57: Add only needed entity usages in AddUsagesForPageJob (T226818, T205045) (duration: 01m 12s) [11:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:50] T226818: Diff when updating wbc_entity_usage - https://phabricator.wikimedia.org/T226818 [11:29:51] T205045: Exception from LinksUpdate: Deadlock found in database query (from Wikibase\Client\Usage\Sql\EntityUsageTable::addUsages) - https://phabricator.wikimedia.org/T205045 [11:31:59] Urbanecm: is 527852 being deployed? [11:32:35] hauskatze, yes, currently on mwdebug1002, waiting for raynor's tests [11:33:09] I'm almost done, one sec [11:33:19] sure :) [11:33:39] !log Upgrade MySQL on db2074 db2057 db2050 db2035 db2098 [11:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:48] raynor, I see "Expectation (masterConns <= 0) by ApiMain::setRequestExpectations not met (actual: 1): [11:34:02] might that be related? [11:34:16] https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.08.05/mediawiki/?id=AWxhjXUOZKA7Rpir5F7n is the full log [11:34:37] Urbanecm, done testing, looks good [11:34:39] let me check that logstash now [11:34:44] thanks [11:35:39] Urbanecm - that's definitely related [11:35:41] /w/api.php?action=query&format=json&formatversion=2&prop=revisions%7Cinfo&rvprop=content%7Ctimestamp&titles=User%3APMiazga%20(WMF)%2Fsandbox&intestactions=edit&intestactionsdetail=full&rvsection=0 [11:36:01] what should we do with that entry then raynor? [11:36:02] sorry, let me put that in different words -> that's definitely caused by me [11:36:04] but not related [11:36:18] so caused by your tests, but not by your patch, right? [11:36:19] I was using Visual Editor, most edits went via API [11:36:36] yes - that's correct [11:36:43] caused by my tests, but not patch [11:36:50] I'm looking for other occurances then, to be sure [11:36:52] I'll do some more tests and submit phab ticket for that [11:37:22] ~1100 occurances in last 15 minutes on real servers, syncing then [11:37:44] Urbanecm, thx [11:37:47] Those kind of notices usually happen when checking a block from master in a GET request [11:37:58] And they're pretty common [11:38:00] I'll try to find out whats wrong and find phab ticket [11:38:15] thank you [11:38:18] ah, Daimona thx, yes - it tries to check ` User->getBlockedStatus()` [11:38:24] Indeed [11:38:39] The idea being - for GET request it's probably enough to check the replica [11:38:52] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.16/extensions/MobileFrontend/: SWAT: b7ae4fb: Revert "[AMC] [desktop] [mobile] use AMC by default for desktop users" (T229722) (duration: 00m 49s) [11:38:59] thanks Daimona, just wanted to be sure nothing's wrong with the patch, those notices are scary :-) [11:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:00] T229722: All edits are tagged as "advanced mobile edit" when wgMFAdvancedMobileContributions is true - https://phabricator.wikimedia.org/T229722 [11:39:03] makes sense. Daimona do we have a phab ticket for that? [11:39:14] or can I just ignore that error? [11:39:28] I don't think that specific patch is related [11:39:35] TBH I don't know if we already have a task [11:39:37] [XUgLxApAAE0AAFtFgIAAAABC] /rpc/RunSingleJob.php ArgumentCountError from line 45 of /srv/mediawiki/php-1.34.0-wmf.16/extensions/MobileFrontend/includes/amc/UserMode.php: Too few arguments to function MobileFrontend\AMC\UserMode::__construct(), 2 passed in /srv/mediawiki/php-1.34.0-wmf.16/extensions/MobileFrontend/includes/ServiceWiring.php on line [11:39:38] 53 and exactly 3 expected [11:39:40] a lot of entries like that [11:40:24] raynor ^^ [11:40:41] damn [11:41:05] Urbanecm - yes, that me, let me quickly check that [11:41:25] I see there are some extension-specific tasks for that problem, which is not so serious anyway, especially as long as it doesn't happen too often. [11:41:29] seems to have stopped [11:41:51] [XUgVJwpAAEwAAIJAQVEAAACS] /w/api.php ErrorException from line 57 of /srv/mediawiki/php-1.34.0-wmf.16/extensions/MobileFrontend/includes/amc/Manager.php: PHP Warning: __construct() expects exactly 3 parameters, 2 given [11:41:55] might be problem during switch, the arguments count have changed [11:42:07] !log jbond@cumin1001 START - Cookbook sre.elasticsearch.rolling-restart [11:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:22] true [11:42:24] I hope it's only php cache [11:42:59] !log jbond@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-restart (exit_code=97) [11:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:14] !log jbond@cumin1001 START - Cookbook sre.elasticsearch.rolling-restart [11:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:33] things seems to be back to normal [11:43:55] !log jbond@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-restart (exit_code=99) [11:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:53] it should be good [11:45:23] agreed [11:45:43] the only place where we initialize the Manager/UserMode is ServiceWirings [11:46:05] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 0032b0a: Enable Page Previews as default on hewikivoyage (T222017) (duration: 00m 47s) [11:46:11] userMode also has named contructor (UserMode::newForUser()) but this one is also fixed [11:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:13] T222017: Enable Page Previews as default on hewikivoyage - https://phabricator.wikimedia.org/T222017 [11:46:15] it had to be php cache ;/ [11:46:51] hmm [11:46:59] I'm pretty sure I ran git submodule update extensions/MobileFrontend [11:47:08] but extensions/MobileFrontend is dirty? [11:47:18] is it? [11:47:27] yes [11:47:34] sorry, I'm not logged into deployments server [11:47:43] np [11:48:19] wmf.16 ? [11:48:37] yes [11:48:46] ok, I know what that's [11:48:51] a security patch causes that :/ [11:48:57] should be fine [11:49:24] yup, security [11:49:57] 10Operations, 10Discovery-Search, 10Elasticsearch, 10SRE-tools: cookbook sre.elasticsearch.rolling-restart failed with cluster relforge - https://phabricator.wikimedia.org/T229807 (10jbond) [11:51:32] hmm [11:51:34] the last one is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AbuseFilter/+/528108 [11:51:40] but I can't git fetch it [11:51:54] the Update git submodules commit doesn't seem to appear [11:52:39] Urbanecm, check for security patches [11:53:07] no security patches for AbuseFilter [11:53:12] (in /srv/patches) [11:54:13] jouncebot, next [11:54:14] In 5 hour(s) and 5 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190805T1700) [11:54:15] then I have no idea ;( [11:55:38] the dirty submodules are apparently expected and you’re just supposed to leave them alone [11:55:46] (`git rebase` seems to do the right thing) [11:55:49] see https://phabricator.wikimedia.org/T229285 [11:56:29] yup [11:59:02] Daimona, I can't fetch the "merged" backport, so I can't deploy it [11:59:09] since we're out of time, I'm going to revert it [11:59:30] 10Operations, 10Discovery-Search, 10Elasticsearch, 10SRE-tools: cookbook sre.elasticsearch.rolling-restart failed with cluster relforge - https://phabricator.wikimedia.org/T229807 (10Mathew.onipe) p:05Triage→03Normal [11:59:47] Oh well, fine [12:00:52] I'll try it later :-) [12:01:02] !log EU SWAT done [12:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:40] I'm afraid I won't be there... Either way it should be enough to check https://logstash.wikimedia.org/app/kibana#/discover/38e78a30-1b51-11e9-b106-b154ee768b0a?_g=(refreshInterval%3A(display%3AOff%2Cpause%3A!f%2Cvalue%3A0)%2Ctime%3A(from%3Anow-30m%2Cmode%3Aquick%2Cto%3Anow)) for new errors, and ensure that we get a reduction in "Abusefilter parser [12:02:41] error" messages [12:03:48] 10Operations, 10Discovery-Search, 10Elasticsearch, 10SRE-tools: cookbook sre.elasticsearch.rolling-restart failed with cluster relforge - https://phabricator.wikimedia.org/T229807 (10Mathew.onipe) Connection from cumin hosts to relforge via elastic ports (9[24]00) failed due to firewall I guess. This is a... [12:05:42] yup, will do Daimona [12:05:48] Thank you, bb :) [12:05:56] (03PS1) 10Ladsgroup: Revert "Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528112 [12:06:08] (03PS2) 10Ladsgroup: Revert "Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528112 [12:06:08] PROBLEM - puppet last run on phab1001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:06:14] (03PS2) 10ArielGlenn: add more public tables for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/527505 (https://phabricator.wikimedia.org/T226167) [12:06:35] (03CR) 10Ladsgroup: [C: 03+2] Revert "Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528112 (owner: 10Ladsgroup) [12:09:07] I have a quick deploy [12:11:50] (03Merged) 10jenkins-bot: Revert "Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528112 (owner: 10Ladsgroup) [12:12:05] (03CR) 10jenkins-bot: Revert "Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528112 (owner: 10Ladsgroup) [12:12:13] * Lucas_WMDE counts “revert”s [12:12:24] this… switches to WRITE_NEW again, I think? :D [12:12:51] yes, it's setting it to new [12:12:57] (03CR) 10Lucas Werkmeister (WMDE): Revert "Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"""" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528112 (owner: 10Ladsgroup) [12:13:05] * Amir1 keeps on putting new reverts until someone gets angry [12:13:23] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:526657|Switch property terms migration to WRITE_NEW on production wikidata (T225053)]] (duration: 00m 48s) [12:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:32] T225053: Switch `tmpPropertyTermsMigrationStage` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225053 [12:16:13] (03CR) 10Ladsgroup: Revert "Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"""" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528112 (owner: 10Ladsgroup) [12:26:35] !log mwscript deleteEqualMessages.php --wiki fywiktionary (requested at [[m:Steward_requests/Miscellaneous]]) [12:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:18] (03PS1) 10Jbond: relforge: allow cumin to access elastic search ports [puppet] - 10https://gerrit.wikimedia.org/r/528116 (https://phabricator.wikimedia.org/T229807) [12:29:59] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/528116 (https://phabricator.wikimedia.org/T229807) (owner: 10Jbond) [12:33:26] (03CR) 10Ladsgroup: [C: 03+2] "labs, noop for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528103 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup) [12:33:31] !log uploaded openjdk-8 u222 for jessie-wikimedia [12:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:04] RECOVERY - puppet last run on phab1001 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:34:23] (03Merged) 10jenkins-bot: labs: Set half of wikidata to read from the new term store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528103 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup) [12:35:17] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Traffic: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10Mathew.onipe) About cloudelastic resolving to icinga1001, I had jbond help me do see where it cloudelastic.wikimedia.org resol... [12:35:30] Rebased on deploy node ^ [12:36:31] (03CR) 10jenkins-bot: labs: Set half of wikidata to read from the new term store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528103 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup) [12:40:49] (03PS1) 10Gehel: elasticsearch: correct ports for relforge cluster [software/spicerack] - 10https://gerrit.wikimedia.org/r/528118 (https://phabricator.wikimedia.org/T229807) [12:41:56] (03PS2) 10Jbond: relforge: allow cumin to access elastic search ports [puppet] - 10https://gerrit.wikimedia.org/r/528116 (https://phabricator.wikimedia.org/T229807) [12:42:18] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/528116 (https://phabricator.wikimedia.org/T229807) (owner: 10Jbond) [12:42:37] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/528116 (https://phabricator.wikimedia.org/T229807) (owner: 10Jbond) [12:44:00] !log restarting cassandra on restbase-dev1040 [12:44:02] !log restarting cassandra on restbase-dev1004 [12:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:16] 10Operations, 10Release Pipeline, 10serviceops, 10Goal, 10Release-Engineering-Team (Pipeline): Self-service Deployment Pipeline - https://phabricator.wikimedia.org/T228676 (10akosiaris) [12:53:20] 10Operations, 10Release Pipeline, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 3 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10akosiaris) [12:53:23] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Traffic: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10BBlack) So, yes, cloudelastic is correct in DNS for normal lookups. The issue is that the icinga check defines the virtual ho... [12:55:04] (03CR) 10Jbond: [C: 03+2] relforge: allow cumin to access elastic search ports [puppet] - 10https://gerrit.wikimedia.org/r/528116 (https://phabricator.wikimedia.org/T229807) (owner: 10Jbond) [12:56:17] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/528118 (https://phabricator.wikimedia.org/T229807) (owner: 10Gehel) [12:56:33] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Traffic: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10Mathew.onipe) @BBlack yea yea.. I've missed your musing on complex system. Thanks. I will make a patch [12:56:53] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Traffic: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10Mathew.onipe) p:05Triage→03Normal [12:57:18] 10Operations, 10Release Pipeline, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 3 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10akosiaris) restrouter was temporarily deployed in the staging cluster today. Deployment wa... [13:01:21] !log rolling update of openjdk-8 on restbase [13:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:10] (03CR) 10Gehel: [C: 03+2] elasticsearch: correct ports for relforge cluster [software/spicerack] - 10https://gerrit.wikimedia.org/r/528118 (https://phabricator.wikimedia.org/T229807) (owner: 10Gehel) [13:09:32] (03CR) 10jenkins-bot: elasticsearch: correct ports for relforge cluster [software/spicerack] - 10https://gerrit.wikimedia.org/r/528118 (https://phabricator.wikimedia.org/T229807) (owner: 10Gehel) [13:09:49] (03PS1) 10Elukey: Add PartOf configuration in the Kafka mirror systemd units [puppet] - 10https://gerrit.wikimedia.org/r/528127 (https://phabricator.wikimedia.org/T229003) [13:11:26] (03CR) 10Elukey: [C: 03+2] Add PartOf configuration in the Kafka mirror systemd units [puppet] - 10https://gerrit.wikimedia.org/r/528127 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [13:12:56] (03CR) 10Ema: [C: 03+1] fifo_log_demux: Allow to specify a service that requires fifo_log_demux [puppet] - 10https://gerrit.wikimedia.org/r/524496 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez) [13:14:32] (03CR) 10Ema: [C: 03+1] ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez) [13:15:54] (03PS2) 10Filippo Giunchedi: base: stop per-host puppet critical when master has issues [puppet] - 10https://gerrit.wikimedia.org/r/528087 (https://phabricator.wikimedia.org/T229262) [13:15:56] (03PS4) 10Filippo Giunchedi: prometheus: alert on widespread puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/526431 (https://phabricator.wikimedia.org/T229262) [13:16:19] (03CR) 10Ema: [C: 03+1] prometheus: Collect ncredir nginx metrics [puppet] - 10https://gerrit.wikimedia.org/r/524409 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez) [13:16:28] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker [13:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:21] (03PS1) 10BPirkle: Switch testwiki to use kask (only) for sessions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528130 (https://phabricator.wikimedia.org/T222099) [13:21:39] (03CR) 10Volans: [C: 03+2] sre.hosts.upgrade-varnish: cp1008 decom cleanup [cookbooks] - 10https://gerrit.wikimedia.org/r/527173 (https://phabricator.wikimedia.org/T229586) (owner: 10Volans) [13:22:28] ema: oops... I just noticed a bug, sending a fixing patch [13:23:18] I've always wondered why it's called 'cookbook' :) [13:25:05] hauskatze: it's used in the industry together with runbook. We ended up using runbook for documentation on wikitech and cookbook for automation scripts. [13:26:04] (03CR) 10Vgutierrez: [C: 03+2] fifo_log_demux: Provide pipe creation capabilities [puppet] - 10https://gerrit.wikimedia.org/r/524176 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez) [13:26:08] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Traffic: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10Mathew.onipe) Sadly, I don't think this will work as the host param will not be unique and icinga does not seem to handle that... [13:26:14] (03PS7) 10Vgutierrez: fifo_log_demux: Provide pipe creation capabilities [puppet] - 10https://gerrit.wikimedia.org/r/524176 (https://phabricator.wikimedia.org/T228382) [13:26:16] (03CR) 10Muehlenhoff: "Patch looks good, but best to wait until https://phabricator.wikimedia.org/T229796 is resolved." [puppet] - 10https://gerrit.wikimedia.org/r/527043 (https://phabricator.wikimedia.org/T220503) (owner: 10Jbond) [13:26:51] (03PS2) 10Volans: sre.hosts.upgrade-varnish: cp1008 decom cleanup [cookbooks] - 10https://gerrit.wikimedia.org/r/527173 (https://phabricator.wikimedia.org/T229586) [13:26:55] ema: ^^^ with the fix [13:27:18] volans: looking [13:28:10] (03CR) 10Ema: [C: 03+1] sre.hosts.upgrade-varnish: cp1008 decom cleanup [cookbooks] - 10https://gerrit.wikimedia.org/r/527173 (https://phabricator.wikimedia.org/T229586) (owner: 10Volans) [13:28:17] volans: wooops, looks good! [13:28:36] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) [13:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:42] (03CR) 10Volans: [C: 03+2] sre.hosts.upgrade-varnish: cp1008 decom cleanup [cookbooks] - 10https://gerrit.wikimedia.org/r/527173 (https://phabricator.wikimedia.org/T229586) (owner: 10Volans) [13:29:46] thx, sorry for the trouble :) [13:30:11] (03CR) 10Vgutierrez: [C: 03+2] fifo_log_demux: Allow to specify a service that requires fifo_log_demux [puppet] - 10https://gerrit.wikimedia.org/r/524496 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez) [13:30:23] (03PS4) 10Vgutierrez: fifo_log_demux: Allow to specify a service that requires fifo_log_demux [puppet] - 10https://gerrit.wikimedia.org/r/524496 (https://phabricator.wikimedia.org/T228382) [13:30:51] (03PS2) 10Volans: cumin: remove old scripts converted to cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/527170 (https://phabricator.wikimedia.org/T205886) [13:31:41] (03Merged) 10jenkins-bot: sre.hosts.upgrade-varnish: cp1008 decom cleanup [cookbooks] - 10https://gerrit.wikimedia.org/r/527173 (https://phabricator.wikimedia.org/T229586) (owner: 10Volans) [13:33:32] (03PS3) 10Volans: cumin: remove old scripts converted to cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/527170 (https://phabricator.wikimedia.org/T205886) [13:33:55] * volans picks the next number to puppet-merge :D [13:34:41] (03PS2) 10Muehlenhoff: Switch eqiad labsldapconfig to the read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/525220 (https://phabricator.wikimedia.org/T46722) [13:34:51] (03CR) 10Volans: [C: 03+2] cumin: remove old scripts converted to cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/527170 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [13:35:12] (03PS1) 10Jbond: cassandra: rolling restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/528133 [13:35:30] vgutierrez: there is your change too to puppet-merge, what should I do? [13:35:48] don't be shy, merge it [13:35:54] will do :) [13:35:58] thx <3 [13:36:46] vgutierrez: {done} [13:36:56] (03CR) 10jerkins-bot: [V: 04-1] cassandra: rolling restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/528133 (owner: 10Jbond) [13:37:02] (03PS11) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) [13:37:29] (03CR) 10Eevans: [C: 03+1] Switch testwiki to use kask (only) for sessions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528130 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle) [13:37:47] !log run cumin 'A:cumin' 'rm -v /usr/local/sbin/{wmf-upgrade-varnish,wmf-upgrade-and-reboot,wmf-downtime-host,wmf-decommission-host}' T205886 [13:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:56] T205886: Cookbooks: convert remaining wmf-* scripts - https://phabricator.wikimedia.org/T205886 [13:38:20] (03PS2) 10Jbond: cassandra: rolling restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/528133 [13:38:25] 10Operations, 10SRE-tools, 10Goal: Expand Spicerack library and SRE Cookbooks - Q2 2018-19 Goal - https://phabricator.wikimedia.org/T205867 (10Volans) [13:39:22] (03PS3) 10Eevans: cassandra: rolling restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/528133 (owner: 10Jbond) [13:40:14] (03CR) 10Muehlenhoff: cassandra: rolling restart cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/528133 (owner: 10Jbond) [13:40:15] (03PS4) 10Mathew.onipe: cloudelastic: remove ocsp_proxy [puppet] - 10https://gerrit.wikimedia.org/r/511381 (https://phabricator.wikimedia.org/T223519) [13:41:08] (03CR) 10Mathew.onipe: Add maps reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [13:42:05] !log deploying tiller in kube-system for helmfile changes - T228837 [13:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:14] T228837: recreate codfw cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] - https://phabricator.wikimedia.org/T228837 [13:44:09] (03CR) 10Eevans: cassandra: rolling restart cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/528133 (owner: 10Jbond) [13:45:36] (03CR) 10CDanis: [C: 03+1] "lg!" [puppet] - 10https://gerrit.wikimedia.org/r/526431 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi) [13:46:49] (03CR) 10CDanis: base: stop per-host puppet critical when master has issues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528087 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi) [13:51:25] 10Operations, 10Machine vision, 10Reading-Infrastructure-Team-Backlog, 10serviceops, and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway) [13:52:16] (03PS1) 10Fsero: k8s: bug: fixing initialize_cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/528137 [13:52:20] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Traffic: Rename gerrit-slave to gerrit-replica - https://phabricator.wikimedia.org/T229822 (10Paladox) [13:52:31] (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s: bug: fixing initialize_cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/528137 (owner: 10Fsero) [13:54:43] (03CR) 10Muehlenhoff: cassandra: rolling restart cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/528133 (owner: 10Jbond) [13:56:07] (03PS1) 10Vgutierrez: acme_chief: Add gerrit-replica.wm.o to the gerrit certificate [puppet] - 10https://gerrit.wikimedia.org/r/528138 [13:56:55] !log deploying calico controller in codfw via helmfile - T228837 [13:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:04] T228837: recreate codfw cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] - https://phabricator.wikimedia.org/T228837 [13:57:50] (03PS2) 10Vgutierrez: acme_chief: Add gerrit-replica.wm.o to the gerrit certificate [puppet] - 10https://gerrit.wikimedia.org/r/528138 (https://phabricator.wikimedia.org/T229822) [13:57:50] !log fsero@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [13:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:23] (03CR) 10Paladox: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/528138 (https://phabricator.wikimedia.org/T229822) (owner: 10Vgutierrez) [13:59:36] (03PS3) 10Paladox: Add gerrit-replica [dns] - 10https://gerrit.wikimedia.org/r/527657 [14:00:02] !log fsero@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:39] (03PS1) 10Fsero: bug: incorrect values path [deployment-charts] - 10https://gerrit.wikimedia.org/r/528139 [14:01:42] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:01:48] (03PS4) 10Paladox: Add gerrit-replica [dns] - 10https://gerrit.wikimedia.org/r/527657 [14:01:54] (03CR) 10Fsero: [V: 03+2 C: 03+2] bug: incorrect values path [deployment-charts] - 10https://gerrit.wikimedia.org/r/528139 (owner: 10Fsero) [14:02:26] (03CR) 10jerkins-bot: [V: 04-1] Add gerrit-replica [dns] - 10https://gerrit.wikimedia.org/r/527657 (owner: 10Paladox) [14:03:02] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 40.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:04:02] (03PS5) 10Paladox: Add gerrit-replica [dns] - 10https://gerrit.wikimedia.org/r/527657 [14:04:13] !log fsero@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [14:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:33] !log fsero@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'coredns' . [14:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:45] (03PS1) 10Filippo Giunchedi: prometheus: stop polling varnish on upload backend [puppet] - 10https://gerrit.wikimedia.org/r/528142 [14:05:47] (03PS1) 10Filippo Giunchedi: kubernetes: expand alert description [puppet] - 10https://gerrit.wikimedia.org/r/528143 (https://phabricator.wikimedia.org/T229262) [14:06:16] !log Depool and restart restbase2009 for maint - T227408 [14:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:25] T227408: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 [14:07:48] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [14:07:48] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:11] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Add gerrit-replica.wm.o to the gerrit certificate [puppet] - 10https://gerrit.wikimedia.org/r/528138 (https://phabricator.wikimedia.org/T229822) (owner: 10Vgutierrez) [14:08:11] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [14:08:12] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [14:08:17] (03PS6) 10Paladox: Add gerrit-replica [dns] - 10https://gerrit.wikimedia.org/r/527657 (https://phabricator.wikimedia.org/T229822) [14:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:48] (03CR) 10Herron: [C: 03+1] "Yes!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526431 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi) [14:08:50] (03PS4) 10Jbond: cassandra: rolling restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/528133 [14:09:04] (03PS1) 10Fsero: adding limitranges for codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/528145 [14:09:09] (03CR) 10Giuseppe Lavagetto: Add the mediawiki.restart_appservers cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/527487 (owner: 10Giuseppe Lavagetto) [14:09:14] (03CR) 10Jbond: cassandra: rolling restart cookbook (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/528133 (owner: 10Jbond) [14:10:13] 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Papaul) @jijiki i need this serveur power down thanks [14:10:15] (03PS2) 10Fsero: adding limitranges for codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/528145 [14:10:44] 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10jijiki) @papaul Server is depooled, ping me when do pool it back, many thanks ! [14:10:55] (03CR) 10Fsero: [V: 03+2 C: 03+2] adding limitranges for codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/528145 (owner: 10Fsero) [14:12:57] !log fsero@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=sessionstore|citoid|cxserver|eventgate-analytics|eventgate-main|termbox|blubberoid|mathoid|zotero,name=codfw [14:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:18] (03PS3) 10Paladox: Gerrit: Rename gerrit-slave to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/527656 [14:13:25] (03PS4) 10Paladox: Gerrit: Rename gerrit-slave to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/527656 [14:13:38] (03Abandoned) 10Paladox: gerrit: Add gerrit-replica to acme [puppet] - 10https://gerrit.wikimedia.org/r/527664 (owner: 10Paladox) [14:14:21] (03CR) 10Vgutierrez: [C: 03+2] Add gerrit-replica [dns] - 10https://gerrit.wikimedia.org/r/527657 (https://phabricator.wikimedia.org/T229822) (owner: 10Paladox) [14:19:22] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: alert on widespread puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/526431 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi) [14:19:27] (03PS5) 10Filippo Giunchedi: prometheus: alert on widespread puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/526431 (https://phabricator.wikimedia.org/T229262) [14:19:46] 10Operations, 10serviceops, 10PHP 7.2 support: Socket Errors on PHP7 - https://phabricator.wikimedia.org/T224538 (10jijiki) For connection pooling purposes, when we want to access `search.svc.eqiad.wmnet` from php-fpm, we are doing so via nginx. This nginx is installed on each mw* server listening on localho... [14:21:05] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: alert on widespread puppet failures (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526431 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi) [14:24:31] !log shut down rstbase2009 for battery replacement [14:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:34] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:25:38] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:25:42] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:25:42] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:25:44] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:25:44] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:25:49] (03PS1) 10Marostegui: dbproxy1010: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/528152 (https://phabricator.wikimedia.org/T222978) [14:25:59] ^^ this is me should recover shortly [14:26:37] ^^ possibly not actully i was working on restbase1019[3~ [14:26:58] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:27:04] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:27:12] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:27:14] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:27:14] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:27:16] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:27:21] (03PS2) 10Marostegui: dbproxy1011: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/528152 (https://phabricator.wikimedia.org/T222978) [14:29:06] 10Operations, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review, and 3 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) Removing HHVM and any leftovers are now part of T229792, we mark this as resolved 💃 [14:29:22] 10Operations, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review, and 3 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) 05Open→03Resolved a:03jijiki [14:29:25] 10Operations, 10serviceops: SRE FY19-20 Q1 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10jijiki) [14:31:19] (03CR) 10BBlack: prometheus: stop polling varnish on upload backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528142 (owner: 10Filippo Giunchedi) [14:31:23] (03CR) 10Marostegui: [C: 03+2] dbproxy1011: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/528152 (https://phabricator.wikimedia.org/T222978) (owner: 10Marostegui) [14:31:59] !log Reload haproxy on dbproxy1011 to depool labsdb1010 T222978 [14:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:07] T222978: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 [14:33:37] (03CR) 10Herron: [C: 03+1] kubernetes: expand alert description [puppet] - 10https://gerrit.wikimedia.org/r/528143 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi) [14:36:15] (03CR) 10Herron: [C: 03+1] "I have mixed feeling about this (the general complexity of this check) but overall think it's a good step to reduce alert noise. Maybe we" [puppet] - 10https://gerrit.wikimedia.org/r/528087 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi) [14:40:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] systemd::tmpfile: apply changes when we change the files. [puppet] - 10https://gerrit.wikimedia.org/r/527430 (https://phabricator.wikimedia.org/T204450) (owner: 10Giuseppe Lavagetto) [14:40:11] (03PS2) 10Giuseppe Lavagetto: systemd::tmpfile: apply changes when we change the files. [puppet] - 10https://gerrit.wikimedia.org/r/527430 (https://phabricator.wikimedia.org/T204450) [14:41:45] <_joe_> uhm [14:43:24] (03PS1) 10Fsero: k8s, cache: disabling codfw services for k8s cluster recreation [puppet] - 10https://gerrit.wikimedia.org/r/528164 (https://phabricator.wikimedia.org/T228837) [14:44:06] (03CR) 10CDanis: [C: 03+1] base: stop per-host puppet critical when master has issues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528087 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi) [14:48:17] (03PS5) 10Paladox: Gerrit: Rename gerrit-slave to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/527656 (https://phabricator.wikimedia.org/T229822) [14:54:32] (03CR) 10Giuseppe Lavagetto: [C: 03+1] k8s, cache: disabling codfw services for k8s cluster recreation [puppet] - 10https://gerrit.wikimedia.org/r/528164 (https://phabricator.wikimedia.org/T228837) (owner: 10Fsero) [14:55:42] (03CR) 10Bstorm: [C: 03+1] locales-extended: Add support for Korean [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527653 (https://phabricator.wikimedia.org/T130532) (owner: 10BryanDavis) [14:56:30] (03PS1) 10Fsero: bug: deleted unused envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/528168 [14:56:59] (03CR) 10Fsero: [V: 03+2 C: 03+2] bug: deleted unused envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/528168 (owner: 10Fsero) [14:57:58] (03PS1) 10RobH: ganeti500[123] mgmt dns [dns] - 10https://gerrit.wikimedia.org/r/528169 (https://phabricator.wikimedia.org/T228099) [14:58:13] (03PS2) 10RobH: ganeti500[123] mgmt dns [dns] - 10https://gerrit.wikimedia.org/r/528169 (https://phabricator.wikimedia.org/T228099) [14:59:15] (03PS3) 10RobH: ganeti500[123] mgmt dns [dns] - 10https://gerrit.wikimedia.org/r/528169 (https://phabricator.wikimedia.org/T228099) [14:59:24] (03PS4) 10RobH: ganeti500[123] mgmt dns [dns] - 10https://gerrit.wikimedia.org/r/528169 (https://phabricator.wikimedia.org/T228099) [15:03:18] (03PS1) 10Fsero: helmfile: ammending envs files [deployment-charts] - 10https://gerrit.wikimedia.org/r/528170 [15:06:38] (03CR) 10Fsero: [V: 03+2 C: 03+2] helmfile: ammending envs files [deployment-charts] - 10https://gerrit.wikimedia.org/r/528170 (owner: 10Fsero) [15:07:01] (03CR) 10RobH: [C: 03+2] ganeti500[123] mgmt dns [dns] - 10https://gerrit.wikimedia.org/r/528169 (https://phabricator.wikimedia.org/T228099) (owner: 10RobH) [15:07:49] PROBLEM - Host restbase2009 is DOWN: PING CRITICAL - Packet loss = 100% [15:08:05] ^ downtme expired [15:08:09] fixing [15:09:42] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 6.001 ge 4 Effie Mouzeli We know, host will be retire soon https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [15:10:47] ACKNOWLEDGEMENT - Host restbase2009 is DOWN: PING CRITICAL - Packet loss = 100% Effie Mouzeli Host is down for maint - T227408 [15:10:53] (03PS1) 10Fsero: ammending eqiad env too [deployment-charts] - 10https://gerrit.wikimedia.org/r/528172 [15:11:08] (03CR) 10Fsero: [V: 03+2 C: 03+2] ammending eqiad env too [deployment-charts] - 10https://gerrit.wikimedia.org/r/528172 (owner: 10Fsero) [15:12:43] (03CR) 10CRusnov: "Long explanation inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [15:15:57] (03PS1) 10Elukey: camus: allow to choose the CamusChecker's alert email [puppet] - 10https://gerrit.wikimedia.org/r/528174 [15:16:05] PROBLEM - Host restbase2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:16:49] (03CR) 10CRusnov: "> Patch Set 4:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [15:17:37] (03PS1) 10Pmiazga: Enable editor gender surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528175 (https://phabricator.wikimedia.org/T227793) [15:18:11] (03PS2) 10Pmiazga: Enable editor gender surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528175 (https://phabricator.wikimedia.org/T227793) [15:18:58] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=ncredir2002.codfw.wmnet [15:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:21] (03PS3) 10Isarra: Remove related-articles-footer-blacklisted-skins.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) [15:19:41] (03PS4) 10Isarra: Remove related-articles-footer-blacklisted-skins.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) [15:19:53] (03CR) 10Isarra: "Good point, too much json." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra) [15:21:02] (03CR) 10Vgutierrez: [C: 03+2] ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez) [15:21:16] (03PS34) 10Vgutierrez: ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) [15:21:21] !log Add db2127 to tendril and zarcillo (s3) - T228969 [15:21:25] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: tune after an appserver 100% in production [puppet] - 10https://gerrit.wikimedia.org/r/528176 [15:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:29] T228969: Productionize db21[21-30} - https://phabricator.wikimedia.org/T228969 [15:24:11] (03CR) 10Alexandros Kosiaris: [C: 03+1] k8s, cache: disabling codfw services for k8s cluster recreation [puppet] - 10https://gerrit.wikimedia.org/r/528164 (https://phabricator.wikimedia.org/T228837) (owner: 10Fsero) [15:26:09] !log recreating zotero and termbox from helmfile codfw - T228837 [15:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:17] T228837: recreate codfw cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] - https://phabricator.wikimedia.org/T228837 [15:26:43] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): elastic1031 - PSU status critical - https://phabricator.wikimedia.org/T229453 (10Gehel) >>! In T229453#5381207, @wiki_willy wrote: > If it's actually a bad PSU, I think we can leave it, since it's due to be refreshed via T221636. Confirmed, we ca... [15:27:10] !log recreating zotero and termbox namespaces and services from helmfile codfw - T228837 [15:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:33] PROBLEM - puppet last run on dbproxy1010 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:30:47] PROBLEM - Check systemd state on ncredir2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:32:03] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - termbox_3030: Servers kubernetes2004.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: zotero_1969: Servers kubernetes2004.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:32:13] !log fsero@ helmfile [CODFW] Ran 'apply' command on namespace 'termbox' for release 'production' . [15:32:17] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - termbox_3030: Servers kubernetes2004.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: zotero_1969: Servers kubernetes2002.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:31] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 193.5 ge 130 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [15:36:00] !log fsero@ helmfile [CODFW] Ran 'apply' command on namespace 'zotero' for release 'production' . [15:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:39] PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=DELETE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:37:11] 10Operations, 10SRE-Access-Requests, 10cloud-services-team (Kanban): SRE: root access for Hieu Pham, SRE @ WMCS - https://phabricator.wikimedia.org/T229833 (10aborrero) [15:37:25] 10Operations, 10SRE-Access-Requests, 10cloud-services-team (Kanban): SRE: root access for Hieu Pham, SRE @ WMCS - https://phabricator.wikimedia.org/T229833 (10aborrero) p:05Triage→03High [15:37:33] (03CR) 10Thcipriani: [C: 03+1] Gerrit: Rename gerrit-slave to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/527656 (https://phabricator.wikimedia.org/T229822) (owner: 10Paladox) [15:38:15] RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:38:27] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:38:39] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:39:21] 10Operations, 10SRE-Access-Requests, 10cloud-services-team (Kanban): SRE: root access for Hieu Pham, SRE @ WMCS - https://phabricator.wikimedia.org/T229833 (10bd808) +1 as Hieu's manager. Getting all his accounts setup may take a day or two, but we wanted to get the group approval done as soon as we can. [15:41:10] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ncredir2002.codfw.wmnet [15:41:14] (03PS1) 10Bstorm: Apply black formatting [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528177 [15:41:16] (03PS1) 10Bstorm: docker: add support for "stable" and "testing" tags in addition to latest [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T229058) [15:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:33] (03CR) 10Ema: [C: 03+1] k8s, cache: disabling codfw services for k8s cluster recreation [puppet] - 10https://gerrit.wikimedia.org/r/528164 (https://phabricator.wikimedia.org/T228837) (owner: 10Fsero) [15:41:40] (03CR) 10jerkins-bot: [V: 04-1] docker: add support for "stable" and "testing" tags in addition to latest [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm) [15:41:57] (03PS1) 10Fsero: updating namespace creation in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/528179 [15:42:54] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Remove related-articles-footer-blacklisted-skins.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra) [15:45:37] (03CR) 10Bstorm: "There's tests! I'm overjoyed and fixing" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm) [15:47:25] (03PS1) 10Thcipriani: gerrit: replication replicateOnStartup [puppet] - 10https://gerrit.wikimedia.org/r/528181 (https://phabricator.wikimedia.org/T229756) [15:48:36] (03PS2) 10Elukey: camus: allow to choose the CamusChecker's alert email [puppet] - 10https://gerrit.wikimedia.org/r/528174 [15:48:40] (03CR) 10Paladox: [C: 03+1] gerrit: replication replicateOnStartup [puppet] - 10https://gerrit.wikimedia.org/r/528181 (https://phabricator.wikimedia.org/T229756) (owner: 10Thcipriani) [15:49:13] 10Operations, 10SRE-Access-Requests, 10cloud-services-team (Kanban): SRE: root access for Hieu Pham, SRE @ WMCS - https://phabricator.wikimedia.org/T229833 (10bd808) [15:49:15] (03PS2) 10Bstorm: docker: add support for "stable" and "testing" tags [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T229058) [15:51:07] RECOVERY - Host restbase2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.77 ms [15:51:31] 10Operations, 10Beta-Cluster-Infrastructure, 10Mathoid, 10Core Platform Team Legacy (Watching / External), and 2 others: remove mathoid from scb - https://phabricator.wikimedia.org/T200832 (10Pchelolo) Is this done and can be resolved? There doesn't seem to be a mathoid installation on scb any longer [15:51:43] RECOVERY - Host restbase2009 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [15:52:03] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM overall, see inline for further restricting proxypass, possibly in a followup review" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [15:54:22] (03CR) 10Bstorm: "This is just to restrict to 80 chars, which I think we generally agree we like better than the black default of 110." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528177 (owner: 10Bstorm) [15:54:58] (03CR) 10Fsero: [V: 03+2 C: 03+2] updating namespace creation in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/528179 (owner: 10Fsero) [15:55:27] (03PS2) 10Fsero: k8s, cache: disabling codfw services for k8s cluster recreation [puppet] - 10https://gerrit.wikimedia.org/r/528164 (https://phabricator.wikimedia.org/T228837) [15:55:38] 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Papaul) @jijiki please repool the server when you have a minute. We will have to order a new Storage battery for the server since all the decom HP servers are GEN8 and this one is a GEN9 so diffe... [15:55:57] 10Operations, 10Beta-Cluster-Infrastructure, 10Mathoid, 10Core Platform Team Legacy (Watching / External), and 2 others: remove mathoid from scb - https://phabricator.wikimedia.org/T200832 (10akosiaris) 05Open→03Resolved a:03akosiaris I see `'mathoid' => 'http://deployment-docker-mathoid01.eqiad.wmfl... [15:56:31] RECOVERY - puppet last run on dbproxy1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:57:35] (03CR) 10Fsero: [C: 03+2] k8s, cache: disabling codfw services for k8s cluster recreation [puppet] - 10https://gerrit.wikimedia.org/r/528164 (https://phabricator.wikimedia.org/T228837) (owner: 10Fsero) [15:57:58] jouncebot: now [15:57:59] No deployments scheduled for the next 1 hour(s) and 2 minute(s) [15:58:03] jouncebot: next [15:58:03] In 1 hour(s) and 1 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190805T1700) [15:58:28] !log Deploy patch for T200104 [15:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:01] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - termbox_3030: Servers kubernetes2004.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: zotero_1969: Servers kubernetes2002.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2006.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:02:27] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - termbox_3030: Servers kubernetes2006.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: zotero_1969: Servers kubernetes2006.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:02:37] !log crusnov@ helmfile [CODFW] Ran 'apply' command on namespace 'termbox' for release 'production' . [16:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:19] (03PS2) 10Krinkle: mediawiki: Use HTTPS for /nl-portal and /be-portal redirects [puppet] - 10https://gerrit.wikimedia.org/r/518099 [16:04:27] !log crusnov@ helmfile [CODFW] Ran 'apply' command on namespace 'termbox' for release 'production' . [16:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:31] !log crusnov@ helmfile [CODFW] Ran 'apply' command on namespace 'zotero' for release 'production' . [16:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:48] 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227541 (10akosiaris) [16:08:51] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:09:03] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:10:11] (03PS3) 10Elukey: camus: allow to choose the CamusChecker's alert email [puppet] - 10https://gerrit.wikimedia.org/r/528174 [16:10:18] 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Papaul) @jijiki I made a procurement task the the storage battery at T229847 [16:10:46] !log recreating citoid eventgate-analytics eventgate-main mathoid sessionstore namespaces and redeploying from helmfile T228837 [16:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:54] T228837: recreate codfw cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] - https://phabricator.wikimedia.org/T228837 [16:11:03] (03PS1) 10Vgutierrez: ncredir: Ensure that mtail service doesn't get enabled [puppet] - 10https://gerrit.wikimedia.org/r/528188 (https://phabricator.wikimedia.org/T228382) [16:12:11] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/17729/" [puppet] - 10https://gerrit.wikimedia.org/r/528174 (owner: 10Elukey) [16:13:04] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 (10Marostegui) [16:14:19] (03CR) 10Vgutierrez: [C: 03+2] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1001/17730/" [puppet] - 10https://gerrit.wikimedia.org/r/528188 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez) [16:14:30] (03PS2) 10Vgutierrez: ncredir: Ensure that mtail service doesn't get enabled [puppet] - 10https://gerrit.wikimedia.org/r/528188 (https://phabricator.wikimedia.org/T228382) [16:14:59] (03PS1) 10EBernhardson: Change mjolnir_bulk_daemon kafka topics [puppet] - 10https://gerrit.wikimedia.org/r/528190 (https://phabricator.wikimedia.org/T227364) [16:16:44] !log crusnov@ helmfile [CODFW] Ran 'apply' command on namespace 'citoid' for release 'production' . [16:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:53] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-analytics_31192: Servers kubernetes2006.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: citoid_1970: Servers kubernetes2002.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:17:05] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-analytics_31192: Servers kubernetes2006.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: citoid_1970: Servers kubernetes2002.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2006.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:18:19] PROBLEM - puppet last run on cloudvirt1007 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:18:58] !log crusnov@ helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'analytics' . [16:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:15] !log crusnov@ helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'main' . [16:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:22] !log crusnov@ helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'main' . [16:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:38] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10Marostegui) As per the sync on the SRE meeting, @JHedden will be online from WMCS. I will handle the announcement for wikitech, could... [16:25:09] PROBLEM - LVS HTTP IPv4 #page on mathoid.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.20 and port 10042: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:25:09] PROBLEM - puppet last run on ncredir2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Service[mtail] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:26:47] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:26:55] 10Operations, 10ops-codfw, 10DBA: (2019-08-31)rack/setup/install db2131.codfw.wmnet - https://phabricator.wikimedia.org/T229251 (10Papaul) [16:27:33] ACKNOWLEDGEMENT - LVS HTTP IPv4 #page on mathoid.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.20 and port 10042: Connection refused Fsero T228837 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:28:25] RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:28:51] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10JHedden) >>! In T229657#5393428, @Marostegui wrote: > As per the sync on the SRE meeting, @JHedden will be online from WMCS. > I will... [16:28:53] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:29:14] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10Marostegui) Thanks! [16:29:17] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:29:56] (03PS1) 10Vgutierrez: ncredir: Ensure that the default mtail service gets masked [puppet] - 10https://gerrit.wikimedia.org/r/528193 (https://phabricator.wikimedia.org/T228382) [16:30:12] ACKNOWLEDGEMENT - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet Fsero T228837 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:30:12] ACKNOWLEDGEMENT - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet Fsero T228837 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:30:45] RECOVERY - puppet last run on ncredir2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:32:07] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:32:23] !log fsero@ helmfile [CODFW] Ran 'apply' command on namespace 'mathoid' for release 'production' . [16:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:31] RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:33:15] RECOVERY - LVS HTTP IPv4 #page on mathoid.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 925 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:33:15] (03CR) 10Vgutierrez: [C: 03+2] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/17731/" [puppet] - 10https://gerrit.wikimedia.org/r/528193 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez) [16:33:27] (03PS2) 10Vgutierrez: ncredir: Ensure that the default mtail service gets masked [puppet] - 10https://gerrit.wikimedia.org/r/528193 (https://phabricator.wikimedia.org/T228382) [16:34:07] (03CR) 10Jforrester: [C: 03+1] gerrit: replication replicateOnStartup [puppet] - 10https://gerrit.wikimedia.org/r/528181 (https://phabricator.wikimedia.org/T229756) (owner: 10Thcipriani) [16:37:13] !log fsero@ helmfile [CODFW] Ran 'apply' command on namespace 'sessionstore' for release 'production' . [16:37:19] PROBLEM - LVS HTTP IPv4 #page on sessionstore.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.29 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:53] are we trying to set a new record for number of paging alerts during an SRE meeting? :) [16:37:59] lol [16:38:03] hold my pager [16:38:09] 6 seconds from the log to the 📟 was impressive [16:38:15] ACKNOWLEDGEMENT - LVS HTTP IPv4 #page on sessionstore.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.29 and port 8081: Connection refused Fsero expected page from cluster recreation https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:38:58] RECOVERY - LVS HTTP IPv4 #page on sessionstore.svc.codfw.wmnet is OK: HTTP OK: Status line output matched 200 - 258 bytes in 0.154 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:39:10] what the [16:39:47] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:40:02] is the cluster recreation sorry, this shouldnt have paged (it was downtimed) [16:40:03] no mor enoise [16:40:05] it ended [16:41:13] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:41:27] RECOVERY - Check systemd state on ncredir2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:43:14] (03CR) 10Ottomata: [C: 03+1] camus: allow to choose the CamusChecker's alert email [puppet] - 10https://gerrit.wikimedia.org/r/528174 (owner: 10Elukey) [16:43:45] (03PS4) 10Elukey: camus: allow to choose the CamusChecker's alert email [puppet] - 10https://gerrit.wikimedia.org/r/528174 [16:44:08] (03CR) 10Ottomata: [C: 03+1] "Hm, I think refine calls this emails_to, and accepts a list of emails. $check_emails_to? I don't care that much, target is ok too" [puppet] - 10https://gerrit.wikimedia.org/r/528174 (owner: 10Elukey) [16:45:24] (03CR) 10Elukey: "> Hm, I think refine calls this emails_to, and accepts a list of" [puppet] - 10https://gerrit.wikimedia.org/r/528174 (owner: 10Elukey) [16:45:44] (03CR) 10Ottomata: [C: 03+1] "Probably don't need list." [puppet] - 10https://gerrit.wikimedia.org/r/528174 (owner: 10Elukey) [16:46:17] RECOVERY - puppet last run on cloudvirt1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:47:21] (03PS1) 10Alexandros Kosiaris: Increase mathoid resourcequotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/528196 (https://phabricator.wikimedia.org/T228837) [16:48:11] (03CR) 10Muehlenhoff: [C: 03+1] "Seems fine as a stopgap until fixed upstream." [puppet] - 10https://gerrit.wikimedia.org/r/520207 (https://phabricator.wikimedia.org/T199406) (owner: 10Filippo Giunchedi) [16:49:32] (03CR) 10Fsero: Increase mathoid resourcequotas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/528196 (https://phabricator.wikimedia.org/T228837) (owner: 10Alexandros Kosiaris) [16:52:11] !log fsero@ helmfile [CODFW] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [16:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:27] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_8748: Servers kubernetes2002.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: cxserver_8080: Servers kubernetes2004.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:52:39] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_8748: Servers kubernetes2006.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: cxserver_8080: Servers kubernetes2006.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:53:35] !log fsero@ helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [16:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:44] !log fsero@ helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [16:53:45] (03PS2) 10Alexandros Kosiaris: Increase mathoid resourcequotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/528196 (https://phabricator.wikimedia.org/T228837) [16:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:12] (03CR) 10Alexandros Kosiaris: Increase mathoid resourcequotas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/528196 (https://phabricator.wikimedia.org/T228837) (owner: 10Alexandros Kosiaris) [16:54:48] (03CR) 10Fsero: [V: 03+2 C: 03+2] Increase mathoid resourcequotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/528196 (https://phabricator.wikimedia.org/T228837) (owner: 10Alexandros Kosiaris) [16:55:41] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:55:51] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:00:04] gehel and onimisionipe: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190805T1700). [17:00:16] here here [17:00:34] jouncebot: no deployment for wdqs [17:04:16] (03CR) 10Elukey: [C: 03+2] camus: allow to choose the CamusChecker's alert email [puppet] - 10https://gerrit.wikimedia.org/r/528174 (owner: 10Elukey) [17:10:57] (03PS1) 10Fsero: k8s, codfw: disabling quotas on eventgate, cxserver and mathoid as they need more work [deployment-charts] - 10https://gerrit.wikimedia.org/r/528202 (https://phabricator.wikimedia.org/T228837) [17:11:45] PROBLEM - toolschecker: kubernetes webservice running #page on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 177 bytes in 9.828 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [17:12:13] arturo: ? [17:12:25] argh. I'll put that back into long term downtime [17:12:31] ah. thank you [17:12:37] that test is crappy [17:12:46] thanks bd808 ! [17:12:48] (03PS2) 10Fsero: k8s, codfw: disabling quotas on eventgate, zotero, cxserver and mathoid as they need more work [deployment-charts] - 10https://gerrit.wikimedia.org/r/528202 (https://phabricator.wikimedia.org/T228837) [17:18:01] Downtimed until 2019-09-02. Hopefully we will either fix the test to decide to turn them back off entirely by then. [17:18:25] (03PS3) 10Fsero: k8s, codfw: disabling quotas on eventgate, zotero, cxserver and mathoid as they need more work [deployment-charts] - 10https://gerrit.wikimedia.org/r/528202 (https://phabricator.wikimedia.org/T228837) [17:19:17] (03PS4) 10Fsero: k8s, codfw: disabling quotas on some namespaces. [deployment-charts] - 10https://gerrit.wikimedia.org/r/528202 (https://phabricator.wikimedia.org/T228837) [17:19:36] (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s, codfw: disabling quotas on some namespaces. [deployment-charts] - 10https://gerrit.wikimedia.org/r/528202 (https://phabricator.wikimedia.org/T228837) (owner: 10Fsero) [17:24:11] robh: can you update the topic in here re: clinic duty? [17:24:24] is it you? [17:24:34] i assume yes since you asked ;D [17:24:42] no [17:24:46] it is shdubsh [17:24:50] I'm next week [17:24:58] I think it was shdubsh [17:25:30] (03PS1) 10Fsero: bug: it expects and empty map [deployment-charts] - 10https://gerrit.wikimedia.org/r/528203 [17:25:39] robh: ^ [17:25:47] (03CR) 10Fsero: [V: 03+2 C: 03+2] bug: it expects and empty map [deployment-charts] - 10https://gerrit.wikimedia.org/r/528203 (owner: 10Fsero) [17:25:51] =] [17:26:12] thanks robh! [17:28:06] !log fsero@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=sessionstore|citoid|cxserver|eventgate-analytics|eventgate-main|termbox|blubberoid|mathoid|zotero,name=codfw [17:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:50] (03CR) 10RobH: [C: 03+2] adding jclark to shell and dc ops group [puppet] - 10https://gerrit.wikimedia.org/r/525847 (https://phabricator.wikimedia.org/T229124) (owner: 10RobH) [17:30:59] (03PS3) 10RobH: adding jclark to shell and dc ops group [puppet] - 10https://gerrit.wikimedia.org/r/525847 (https://phabricator.wikimedia.org/T229124) [17:31:41] (03PS1) 10Bstorm: toolschecker: Remove a lot of paging of all SRE [puppet] - 10https://gerrit.wikimedia.org/r/528204 [17:33:42] !log Pool restbase2009 - T227408 [17:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:51] T227408: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 [17:36:07] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 96.25 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [17:42:19] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm, anything to wake up fewer SREs (although this might wake up wmcs more... we'll see!)" [puppet] - 10https://gerrit.wikimedia.org/r/528204 (owner: 10Bstorm) [17:46:02] (03PS1) 10RobH: updating notation for dc operations group [puppet] - 10https://gerrit.wikimedia.org/r/528206 (https://phabricator.wikimedia.org/T229124) [17:58:08] (03CR) 10RobH: [C: 03+2] updating notation for dc operations group [puppet] - 10https://gerrit.wikimedia.org/r/528206 (https://phabricator.wikimedia.org/T229124) (owner: 10RobH) [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Morning SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190805T1800). [18:00:04] raynor, bpirkle, jdlrobson, and ebernhardson: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:09] I'm here [18:00:14] I can SWAT today! [18:00:20] x`o/ [18:00:55] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528175 (https://phabricator.wikimedia.org/T227793) (owner: 10Pmiazga) [18:01:34] \o [18:01:40] hi jdlrobson [18:01:54] (03Merged) 10jenkins-bot: Enable editor gender surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528175 (https://phabricator.wikimedia.org/T227793) (owner: 10Pmiazga) [18:02:24] raynor: Your patch is on mwdebug1002 [18:02:44] Urbanecm, thx [18:02:45] (03CR) 10jenkins-bot: Enable editor gender surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528175 (https://phabricator.wikimedia.org/T227793) (owner: 10Pmiazga) [18:03:18] testing [18:03:19] (03PS2) 10Urbanecm: Switch testwiki to use kask (only) for sessions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528130 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle) [18:03:25] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528130 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle) [18:03:34] thanks raynor [18:03:44] bpirkle: +2'ed your patch, will ping you once it's on mwdebug1002 [18:04:40] (03Merged) 10jenkins-bot: Switch testwiki to use kask (only) for sessions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528130 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle) [18:04:44] ebernhardson: Around? [18:04:57] (03CR) 10jenkins-bot: Switch testwiki to use kask (only) for sessions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528130 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle) [18:05:04] bpirkle: your patch is on mwdebug1002 [18:05:49] Urbanecm: thank you. Good to go. [18:05:52] (03PS5) 10Urbanecm: Remove related-articles-footer-blacklisted-skins.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra) [18:05:57] Urbanecm - it works preprerply, please sync prod [18:06:01] properly* [18:06:18] syncing [18:06:34] !log reinit postgres on maps1001 - T229788 [18:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:43] T229788: postgresql replication issues on maps1001 - https://phabricator.wikimedia.org/T229788 [18:07:31] Urbanecm: ya [18:07:32] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: e44a6e6: Enable editor gender surveys (T227793) (duration: 00m 48s) [18:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:41] T227793: First round editor gender surveys - https://phabricator.wikimedia.org/T227793 [18:07:54] ebernhardson: Hi, I've +2'ed your backport, will ping you once it's ready to be tested [18:08:00] Urbanecm: k [18:09:10] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra) [18:09:39] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 254ecc1: Switch testwiki to use kask (only) for sessions (T222099) (duration: 00m 48s) [18:09:46] bpirkle, synced [18:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:48] T222099: Staging release of RESTBagOStuff using Kask - https://phabricator.wikimedia.org/T222099 [18:10:39] Urbanecm: thank you [18:10:45] happy to help! [18:11:14] (03Merged) 10jenkins-bot: Remove related-articles-footer-blacklisted-skins.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra) [18:11:29] (03CR) 10jenkins-bot: Remove related-articles-footer-blacklisted-skins.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra) [18:11:58] jdlrobson: your patch is on mwdebug1002, please test [18:12:55] on it [18:13:26] <3 [18:13:26] ebernhardson: Your patch is on mwdebug1002, please test [18:17:09] Seems to work! Although why are we getting a flash of extra footer at first what the crap. [18:17:15] Totally unrelated, though. [18:17:40] Urbanecm +1 to @isarra that works! Please sync :) [18:17:54] syncing! [18:18:29] Urbanecm, thx for deploying the genders survey and related-articles patches \o/ [18:18:43] happy to help! [18:19:27] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: a9e4ed8: Remove related-articles-footer-blacklisted-skins.dblist (T229644, 1/3) (duration: 00m 49s) [18:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:35] T229644: RelatedArticles showing on all German and Russian Wikipedia due to incorrect configuration settings - https://phabricator.wikimedia.org/T229644 [18:20:29] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: a9e4ed8: Remove related-articles-footer-blacklisted-skins.dblist (T229644, 2/3) (duration: 00m 47s) [18:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:28] !log urbanecm@deploy1001 Synchronized dblists/: SWAT: a9e4ed8: Remove related-articles-footer-blacklisted-skins.dblist (T229644, 3/3) (duration: 00m 46s) [18:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:36] jdlrobson: synced [18:21:42] (03PS1) 10Ppchelko: Switch updateBetaFeaturesUserCounts job to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528209 (https://phabricator.wikimedia.org/T228705) [18:21:42] ebernhardson: Ping? [18:22:37] jdlrobson, Urbanecm: Thanks for handling this! [18:22:44] happy to help! [18:24:26] yeh thanks Urbanecm :) [18:24:34] yw [18:25:39] !log Deployed patch for T207094 [18:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:29] Urbanecm: sorry, debugging things in multiple places :) [18:27:40] Urbanecm: it generally looks to work [18:27:46] np, thanks [18:27:48] syncing [18:29:20] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.16/extensions/WikimediaEvents/: SWAT: 3ee0e84: Temporarily log search to two schemas (duration: 00m 47s) [18:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:28] ebernhardson: Synced! [18:32:22] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.16/extensions/AbuseFilter/: SWAT: 936a462: Better handling of DNONE (T214674, T228677) (duration: 00m 47s) [18:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:32] T228677: use of get_matches function returns "Requesting array item of non-array" - https://phabricator.wikimedia.org/T228677 [18:32:33] T214674: Short circuit fails with assignments - https://phabricator.wikimedia.org/T214674 [18:35:26] Urbanecm: thanks [18:35:34] yw ebernhardson [18:39:44] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.16/extensions/AbuseFilter/: SWAT: d358f17: Revert "Better handling of DNONE" (T214674, T228677) (duration: 00m 47s) [18:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:54] T228677: use of get_matches function returns "Requesting array item of non-array" - https://phabricator.wikimedia.org/T228677 [18:39:54] T214674: Short circuit fails with assignments - https://phabricator.wikimedia.org/T214674 [18:41:03] (03PS1) 10BBlack: cloudelastic: Fix LVS IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/528215 (https://phabricator.wikimedia.org/T224324) [18:41:06] (03PS1) 10BBlack: cloudelastic: Fix LVS IPv6 address [dns] - 10https://gerrit.wikimedia.org/r/528216 (https://phabricator.wikimedia.org/T224324) [18:44:26] !log Morning SWAT done [18:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:49] PROBLEM - Disk space on wdqs1005 is CRITICAL: DISK CRITICAL - free space: /srv 53079 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wdqs1005&var-datasource=eqiad+prometheus/ops [18:47:34] SMalyshev: ∆ disk space above, any idea ? [18:58:31] RECOVERY - toolschecker: kubernetes webservice running #page on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 32.216 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [19:10:10] (03Abandoned) 10Dzahn: mediawiki::php::restarts: try to avoid including LVS but still get pools [puppet] - 10https://gerrit.wikimedia.org/r/527285 (owner: 10Dzahn) [19:16:35] (03CR) 10Dzahn: [C: 03+2] gerrit: replication replicateOnStartup [puppet] - 10https://gerrit.wikimedia.org/r/528181 (https://phabricator.wikimedia.org/T229756) (owner: 10Thcipriani) [19:16:42] (03PS2) 10Dzahn: gerrit: replication replicateOnStartup [puppet] - 10https://gerrit.wikimedia.org/r/528181 (https://phabricator.wikimedia.org/T229756) (owner: 10Thcipriani) [19:17:44] (03PS2) 10Paladox: gerrit: Re-enable the use of HTTP auth tokens [puppet] - 10https://gerrit.wikimedia.org/r/527596 (https://phabricator.wikimedia.org/T225308) [19:20:42] mutante: thanks for the merge! [19:21:05] (03CR) 10Cwhite: [C: 04-1] "Comments inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528142 (owner: 10Filippo Giunchedi) [19:24:18] thcipriani: yw. now actually applied on cobalt but did not restart [19:26:25] mutante: great, thanks, I'll give gerrit a restart (on gerrit2001 now as well :)) [19:26:40] (03PS6) 10Dzahn: Gerrit: Rename gerrit-slave to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/527656 (https://phabricator.wikimedia.org/T229822) (owner: 10Paladox) [19:26:56] thcipriani: we can add this ^ [19:27:07] looks like DNS and acme is already done [19:27:17] mutante: sure, let's get that in as well [19:27:44] (03CR) 10Dzahn: [C: 03+2] Gerrit: Rename gerrit-slave to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/527656 (https://phabricator.wikimedia.org/T229822) (owner: 10Paladox) [19:28:49] (03PS1) 10Andrew Bogott: nova: update scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/528231 (https://phabricator.wikimedia.org/T216195) [19:29:29] thcipriani we can now use gerrit2001 to accurately test config changes/plugin updates :D (since we will now be able to verify if it'll take gerrit down) [19:29:32] (03CR) 10Andrew Bogott: [C: 03+2] nova: update scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/528231 (https://phabricator.wikimedia.org/T216195) (owner: 10Andrew Bogott) [19:29:36] paladox: https://gerrit-replica.wikimedia.org/r/ [19:29:43] yup [19:29:43] thcipriani: + ServerAlias gerrit-replica.wikimedia.org [19:29:51] \o/ [19:31:01] git clone https://gerrit-replica.wikimedia.org/r/operations/puppetCloning into 'puppet'... [19:31:11] (03CR) 10BBlack: [C: 03+2] cloudelastic: Fix LVS IPv6 address [dns] - 10https://gerrit.wikimedia.org/r/528216 (https://phabricator.wikimedia.org/T224324) (owner: 10BBlack) [19:31:44] (03PS2) 10BBlack: cloudelastic: Fix LVS IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/528215 (https://phabricator.wikimedia.org/T224324) [19:31:53] mutante: nice :) [19:32:07] (03CR) 10BBlack: [C: 03+2] cloudelastic: Fix LVS IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/528215 (https://phabricator.wikimedia.org/T224324) (owner: 10BBlack) [19:32:44] mutante: ok, so are we ready for a restart? [19:32:54] thcipriani: yes, we are [19:33:25] * thcipriani does [19:33:42] !log gerrit restart for gerrit-replica on gerrit2001 [19:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:04] !log fixing up cloudelastic LVS IPv6 stuff on lvs1014, lvs1016, cloudelastic* - possible monitoring noise [19:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:31] alright, gerrit2001 looks ok, cobalt incoming [19:35:49] !log gerrit restart on cobalt for configuration updates [19:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:14] paladox / thcipriani: cloning puppet repo from gerrit: real time: 24 seconds. cloning from gerrit-replica: real time: 34s .. oops? [19:37:27] PROBLEM - puppet last run on elastic1020 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:37:45] gerrit's unavailable, expected I assume [19:37:56] (getting 503s on code review pages in gerrit.wikimedia.org) [19:38:08] seems back now! [19:38:12] bblack: yes, expected. it's restarting as we speak [19:38:23] already back for me [19:38:26] yep, should be back now [19:39:22] (03CR) 10CDanis: [C: 03+1] gerrit: Re-enable the use of HTTP auth tokens [puppet] - 10https://gerrit.wikimedia.org/r/527596 (https://phabricator.wikimedia.org/T225308) (owner: 10Paladox) [19:39:50] paladox: can't reproduce the same time .. next attempt it's just 24s [19:40:01] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_wikistats-v2],Exec[git_pull_analytics.wikimedia.org] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:40:07] mutante it's very slow for me over the Atlantic :P [19:40:12] silly question [19:40:16] gerrit2001 has more RAM, right? [19:40:19] paladox: gerrit-replica.esams.wikimedia.org :P [19:40:24] cdanis yup [19:40:26] yea [19:40:29] and cpu power [19:40:32] any thought to making it the primary? 🙃 [19:40:48] the db is read only cdanis [19:40:55] (the proxy in codfw) [19:41:01] i'd just try to unblock the "switch eqiad to a 64GB machine: [19:41:06] fair enough [19:41:20] actually, i will look at just that now.. hmm [19:41:41] paladox: and now we see how long updating all repos on the replica takes to clear the queue :) [19:42:11] :D [19:43:17] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [19:43:19] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:44:02] that looks real [19:44:07] (esams problems) [19:44:14] mutante it took me 6m5.376s [19:46:35] paladox: i will answer your question from another channel here: i just confirmed cobalt and gerrit2001 have same NIC speed. 1000MB/s [19:46:39] Mb/s [19:46:43] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [19:46:45] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:46:50] ok, thanks [19:47:57] it's a link flap, appears over now - https://phabricator.wikimedia.org/T205609 [19:49:00] heh wrong task [19:49:54] that one: https://phabricator.wikimedia.org/T228827 [19:49:54] show-queue shows github && gerrit2001 are being updated [19:49:55] paladox: i wonder how long it takes you to use the mirror in wmflabs "OK, I set up a Gerrit mirror today. Clone URLs are https://ggmirror.wmflabs.org/git/.git. And https://ggmirror.wmflabs.org/cgit/ as a web view/debugger." [19:50:02] oh [19:50:04] * paladox tries [19:50:39] that's much faster in the cloud [19:50:40] 0m11.568s [19:51:12] !log depool wdqs1005 - T229876 [19:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:22] T229876: blazegraph journal on wdqs1005 has doubled in space - https://phabricator.wikimedia.org/T229876 [19:51:51] ACKNOWLEDGEMENT - Disk space on wdqs1005 is CRITICAL: DISK CRITICAL - free space: /srv 46842 MB (3% inode=99%): Gehel tracked in https://phabricator.wikimedia.org/T229876 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wdqs1005&var-datasource=eqiad+prometheus/ops [19:52:24] cobalt is going at ~6MiB/s and gerrit2001 is at ~200 KiB/s for me [19:53:59] bblack: checked whether CenturyLink had maintenance announced .. nope.. not this time [19:54:03] PROBLEM - puppet last run on dns5001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:54:17] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: site=eqsin https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [19:54:41] paladox: lol, that's such a huge difference. can you repeat that a couple times? is the time different each time? [19:54:59] and i mean after deleting everything ..so always from scratch [19:55:19] mutante it's the same each time from what i can tell (gerrit2001 being in the KiB/s range [19:55:21] PROBLEM - puppet last run on mw1258 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:55:29] first time i see a single "idespread puppet agent failures" by itself, nice [19:55:45] yeah! [19:55:57] but also we are trained already to just be alarmed if it scrolls :p [19:56:01] hehe [19:56:08] I think it’s the first time it’s alerted. old ones haven’t been squelched yet though [19:57:06] herron: i wonder if it should link to puppetboard instead of grafana [19:57:21] when this happens we want the server names rather than a number, right [19:57:31] also does https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1 look like widespread to you? [19:57:45] the forced perspective of 0%-100% on those graphs is ... not very helpful [19:57:56] swift .. hmm.. i see [19:58:27] yes although does puppet board have details when compilation fails? iirc data is sent to puppetdb after the catalog is compiled, so you might not see it in puppetboard? [19:58:28] i want to click something on that dashboard to get to the list of actual servers [19:59:17] herron: https://puppetboard.wikimedia.org/nodes?status=failed [19:59:39] i think i would like that as the link target [20:00:04] cscott, arlolra, subbu, bearND, halfak, and accraze: That opportune time is upon us again. Time for a Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190805T2000). [20:00:04] nice, yes! [20:00:20] probably good to include both the dashboard and the puppetboard [20:00:59] yea, and that should be possible to add multiple links with grafana checks [20:01:19] swift = 1 host = 1 disk always dies becuase ..scale [20:01:32] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@e774a05]: Update mobileapps to c713c2e [20:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:08] herron: more things we can improve. if a check is already "handled" (ACKed) from Icinga's point of view then it should ignore that in the overall "widespread issues" check [20:04:23] like ms-be1040 in this case.. was already known that it has hardware issue [20:04:30] yet it pushed the check over the edge [20:05:01] PROBLEM - MariaDB Slave Lag: m3 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [20:06:07] PROBLEM - MariaDB Slave Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 354.70 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [20:06:23] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@e774a05]: Update mobileapps to c713c2e (duration: 04m 51s) [20:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:30] (03PS2) 10Bstorm: toolschecker: Remove a lot of paging of all SRE [puppet] - 10https://gerrit.wikimedia.org/r/528204 [20:09:27] (03PS3) 10Bstorm: toolschecker: Remove a lot of paging of all SRE [puppet] - 10https://gerrit.wikimedia.org/r/528204 [20:09:31] RECOVERY - puppet last run on elastic1020 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:10:01] (03PS1) 10CDanis: Release 1.1.4 [software/conftool] - 10https://gerrit.wikimedia.org/r/528251 [20:10:44] whoa nice widespread puppet failure alert [20:10:47] that's awesome [20:11:29] mutante: I added an iframe for the puppetboard failed list to https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:11:35] (03CR) 10Jhedden: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/528204 (owner: 10Bstorm) [20:11:36] should save the need to have two links in the alert [20:12:02] (03PS4) 10Bstorm: toolschecker: Remove a lot of paging of all SRE [puppet] - 10https://gerrit.wikimedia.org/r/528204 [20:12:35] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:15:08] herron: wow, nice. i did not expect i would like iframes that much [20:15:28] haha yeah me either [20:15:59] (03PS5) 10Bstorm: toolschecker: Remove a lot of paging of all SRE [puppet] - 10https://gerrit.wikimedia.org/r/528204 [20:16:53] (03CR) 10Bstorm: "Ok, I'm done now. I've pinned everything to wmcs/paws groups. When done, I'd like to test it by doing something like stopping one of the" [puppet] - 10https://gerrit.wikimedia.org/r/528204 (owner: 10Bstorm) [20:17:01] (03PS2) 10CDanis: Release 1.1.4 [software/conftool] - 10https://gerrit.wikimedia.org/r/528251 [20:17:03] (03PS1) 10CDanis: debian: Release 1.1.4-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/528257 [20:19:43] !log arlolra@deploy1001 Started deploy [parsoid/deploy@d3a2937]: Updating Parsoid to 7232dff [20:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:03] (03CR) 10CDanis: [C: 03+2] Release 1.1.4 [software/conftool] - 10https://gerrit.wikimedia.org/r/528251 (owner: 10CDanis) [20:21:36] (03PS6) 10Bstorm: toolschecker: Remove a lot of paging of all SRE [puppet] - 10https://gerrit.wikimedia.org/r/528204 [20:21:49] RECOVERY - puppet last run on dns5001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:23:05] RECOVERY - puppet last run on mw1258 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:23:21] (03CR) 10Bstorm: [C: 03+2] toolschecker: Remove a lot of paging of all SRE [puppet] - 10https://gerrit.wikimedia.org/r/528204 (owner: 10Bstorm) [20:23:49] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:25:00] (03Merged) 10jenkins-bot: Release 1.1.4 [software/conftool] - 10https://gerrit.wikimedia.org/r/528251 (owner: 10CDanis) [20:25:13] (03CR) 10CDanis: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/528257 (owner: 10CDanis) [20:25:40] (03PS2) 10MSantos: First version of the wikifeeds chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/526679 (https://phabricator.wikimedia.org/T229287) [20:27:13] (03PS1) 10Thcipriani: gerrit: do not treat github as a mirror [puppet] - 10https://gerrit.wikimedia.org/r/528259 [20:27:51] (03CR) 10CDanis: [C: 03+2] debian: Release 1.1.4-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/528257 (owner: 10CDanis) [20:28:46] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@d3a2937]: Updating Parsoid to 7232dff (duration: 09m 02s) [20:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:33] (03Merged) 10jenkins-bot: debian: Release 1.1.4-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/528257 (owner: 10CDanis) [20:32:14] (03CR) 10Paladox: [C: 03+1] "Wouldn't this mean branches that are deleted won't be deleted on the mirror?" [puppet] - 10https://gerrit.wikimedia.org/r/528259 (owner: 10Thcipriani) [20:34:08] !log Updated Parsoid to 7232dff (T228223) [20:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:16] T228223: tokensToString being called on KV->V looks like a signature mismatch - https://phabricator.wikimedia.org/T228223 [20:34:54] (03PS1) 10EBernhardson: Repoint cloudelastic at LB dns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528260 (https://phabricator.wikimedia.org/T220625) [20:39:17] (03CR) 10Dzahn: [C: 03+2] gerrit: do not treat github as a mirror [puppet] - 10https://gerrit.wikimedia.org/r/528259 (owner: 10Thcipriani) [20:41:27] (03PS1) 10Paladox: Gerrit: Set replicatePermissions to false for GitHub remote [puppet] - 10https://gerrit.wikimedia.org/r/528262 [20:41:46] (03PS2) 10Paladox: Gerrit: Set replicatePermissions to false for GitHub remote [puppet] - 10https://gerrit.wikimedia.org/r/528262 [20:42:41] (03PS3) 10Dzahn: Gerrit: Set replicatePermissions to false for GitHub remote [puppet] - 10https://gerrit.wikimedia.org/r/528262 (owner: 10Paladox) [20:44:01] (03PS2) 10EBernhardson: Repoint cloudelastic at LB dns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528260 (https://phabricator.wikimedia.org/T220625) [20:44:03] (03PS1) 10EBernhardson: Temporarily stop writing to cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528263 (https://phabricator.wikimedia.org/T220625) [20:44:57] (03CR) 10jerkins-bot: [V: 04-1] Temporarily stop writing to cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528263 (https://phabricator.wikimedia.org/T220625) (owner: 10EBernhardson) [20:45:04] (03CR) 10jerkins-bot: [V: 04-1] Repoint cloudelastic at LB dns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528260 (https://phabricator.wikimedia.org/T220625) (owner: 10EBernhardson) [20:45:39] PROBLEM - puppet last run on an-worker1091 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:47:27] (03CR) 10Dzahn: [C: 03+1] Gerrit: Set replicatePermissions to false for GitHub remote [puppet] - 10https://gerrit.wikimedia.org/r/528262 (owner: 10Paladox) [20:49:14] !log nuke all search indices on cloudelastic preparing for fresh imports and live updates T220625 [20:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:23] T220625: Initialize CirrusSearch on cloudelastic - https://phabricator.wikimedia.org/T220625 [20:56:00] (03CR) 10Thcipriani: [C: 03+1] Gerrit: Set replicatePermissions to false for GitHub remote [puppet] - 10https://gerrit.wikimedia.org/r/528262 (owner: 10Paladox) [20:56:59] (03CR) 10Dzahn: [C: 03+2] Gerrit: Set replicatePermissions to false for GitHub remote [puppet] - 10https://gerrit.wikimedia.org/r/528262 (owner: 10Paladox) [21:00:04] Reedy and sbassett: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190805T2100). [21:12:19] RECOVERY - puppet last run on an-worker1091 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:13:47] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 163 threshold =0.15 breach: timed_out: False, initializing_shards: 0, number_of_data_nodes: 4, active_primary_shards: 179, cluster_name: cloudelastic-chi-eqiad, number_of_pending_tasks: 0, status: red, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0, unassigned_shards: 163, task_max_waiting_ [21:13:47] 0, active_shards_percent_as_number: 73.36601307189542, relocating_shards: 0, number_of_nodes: 4, active_shards: 449 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:47] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 163 threshold =0.15 breach: active_primary_shards: 179, delayed_unassigned_shards: 0, number_of_data_nodes: 4, timed_out: False, unassigned_shards: 163, number_of_pending_tasks: 0, relocating_shards: 0, task_max_waiting_in_queue_millis: 0, initializing_shards: 0, number_of_nodes: 4, cluster_name: cloud [21:13:47] , active_shards_percent_as_number: 73.36601307189542, status: red, active_shards: 449, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:14:19] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 165 threshold =0.15 breach: delayed_unassigned_shards: 0, number_of_nodes: 4, unassigned_shards: 165, number_of_pending_tasks: 0, active_shards_percent_as_number: 73.5576923076923, timed_out: False, number_of_in_flight_fetch: 0, initializing_shards: 0, active_primary_shards: 183, status: red, task_max_ [21:14:19] millis: 0, cluster_name: cloudelastic-chi-eqiad, number_of_data_nodes: 4, relocating_shards: 0, active_shards: 459 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:14:19] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 165 threshold =0.15 breach: delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, number_of_pending_tasks: 0, status: red, active_shards: 459, unassigned_shards: 165, number_of_data_nodes: 4, timed_out: False, cluster_name: cloudelastic-chi-eqiad, relocating_shards: 0, number_of_in_flight_ [21:14:20] primary_shards: 183, number_of_nodes: 4, initializing_shards: 0, active_shards_percent_as_number: 73.5576923076923 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:15:35] cloudelastic problems expected, i'm reinitializing it [21:16:40] (03CR) 10BryanDavis: docker: add support for "stable" and "testing" tags (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm) [21:18:02] (03Abandoned) 10EBernhardson: Temporarily stop writing to cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528263 (https://phabricator.wikimedia.org/T220625) (owner: 10EBernhardson) [21:18:34] (03PS3) 10EBernhardson: Repoint cloudelastic at LB dns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528260 (https://phabricator.wikimedia.org/T220625) [21:22:25] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1004 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, active_shards_percent_as_number: 85.38913362701909, number_of_pending_tasks: 0, active_shards: 1163, unassigned_shards: 199, status: red, initializing_shards: 0, active_primary_shards: 423, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0, number_of_d [21:22:25] k_max_waiting_in_queue_millis: 0, number_of_nodes: 4, timed_out: False, relocating_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:22:25] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1002 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: active_shards_percent_as_number: 85.38913362701909, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_in_flight_fetch: 0, unassigned_shards: 199, number_of_data_nodes: 4, initializing_shards: 0, active_shards: 11 [21:22:25] ing_in_queue_millis: 0, active_primary_shards: 423, number_of_nodes: 4, relocating_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:22:41] !log start importing group0 to cloudelastic from mwmaint1002 [21:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:37] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1001 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, delayed_unassigned_shards: 0, unassigned_shards: 216, initializing_shards: 0, number_of_in_flight_fetch: 0, relocating_shards: 0, timed_out: False, number_of_nodes: 4, active_shards: 1260, active_shards_percent_as_number: 85.36585365853658, number_of_data_nodes [21:24:37] ting_in_queue_millis: 0, active_primary_shards: 458, status: red, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:39] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1003 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: task_max_waiting_in_queue_millis: 0, number_of_in_flight_fetch: 0, delayed_unassigned_shards: 0, number_of_data_nodes: 4, active_shards: 1260, cluster_name: cloudelastic-chi-eqiad, initializing_shards: 0, relocating_shards: 0, number_of_nodes: 4, number_of_pending_tasks: 0, active_shards_percent_as [21:24:39] 365853658, unassigned_shards: 216, timed_out: False, status: red, active_primary_shards: 458 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:45] !log ✔️ cdanis@install1002.wikimedia.org ~ 🕠 sudo -E reprepro -C main include stretch-wikimedia conftool-1.1.4-1/conftool_1.1.4-1_amd64.changes [21:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:06] !log ✔️ cdanis@install1002.wikimedia.org ~ 🕠🍺 sudo -E reprepro -C main include buster-wikimedia conftool-1.1.4-1/conftool_1.1.4-1+deb10u1_amd64.changes [21:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:17] !log ✔️ cdanis@install1002.wikimedia.org ~ 🕠🍺 sudo -E reprepro -C main include jessie-wikimedia conftool-1.1.4-1/conftool_1.1.4-1+deb8u1_amd64.changes [21:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:35] (03PS1) 10Herron: kafka-main: replace kafka1001 hardware with kafka-main1001 [puppet] - 10https://gerrit.wikimedia.org/r/528271 (https://phabricator.wikimedia.org/T225005) [21:27:08] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕠🍺 sudo debdeploy deploy -u 2019-08-05-conftool.yaml -s mw-canary [21:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:12] !log [21:28:12] mutante: Message missing. Nothing logged. [21:28:43] !log 🔔 scandium - ree-enabled icinga notifications for various services [21:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:05] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕠🍺 sudo cumin -p99 -b100 'A:all' 'apt-get update' [21:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:52] (03CR) 10Bstorm: docker: add support for "stable" and "testing" tags (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm) [21:35:34] !log ❌cdanis@cumin1001.eqiad.wmnet ~ 🕠🍺 sudo debdeploy deploy -u 2019-08-05-conftool.yaml -s eqsin [21:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:09] RECOVERY - MariaDB Slave Lag: m3 on db2065 is OK: OK slave_sql_lag Replication lag: 23.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [21:39:11] RECOVERY - MariaDB Slave Lag: m3 on db2078 is OK: OK slave_sql_lag Replication lag: 20.87 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [21:39:25] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕠🍺 sudo debdeploy deploy -u 2019-08-05-conftool.yaml -s all [21:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:35] (03PS1) 10Herron: calico: add all kafka-main hosts to k8s eventgate policy [puppet] - 10https://gerrit.wikimedia.org/r/528275 (https://phabricator.wikimedia.org/T225005) [21:40:45] PROBLEM - puppet last run on cloudstore1009 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:42:31] (03PS1) 10Thcipriani: gerrit: replication: exclude some projects [puppet] - 10https://gerrit.wikimedia.org/r/528276 [21:44:31] (03CR) 10Cwhite: [C: 03+1] kubernetes: expand alert description [puppet] - 10https://gerrit.wikimedia.org/r/528143 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi) [21:44:59] (03CR) 10Paladox: [C: 03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/528276 (owner: 10Thcipriani) [21:55:28] !log powering down wtp2011 for BIOS upgrade [21:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:23] PROBLEM - Host wtp2011 is DOWN: PING CRITICAL - Packet loss = 100% [22:06:19] RECOVERY - Host wtp2011 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms [22:08:39] RECOVERY - puppet last run on cloudstore1009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:37:02] (03PS1) 10Viztor: Add fonts-noto-cjk-extra, replace fonts-noto-cjk [puppet] - 10https://gerrit.wikimedia.org/r/528279 (https://phabricator.wikimedia.org/T226633) [22:45:05] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for db2131 [dns] - 10https://gerrit.wikimedia.org/r/528281 [22:45:58] (03PS5) 10CRusnov: netbox: Fix additional swift parameters [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) [22:47:27] (03CR) 10CRusnov: netbox: Fix additional swift parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [22:47:34] (03CR) 10CRusnov: [C: 03+2] netbox: Fix additional swift parameters [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [22:49:40] (03PS2) 10Dzahn: DNS: Add mgmt and production DNS for db2131 [dns] - 10https://gerrit.wikimedia.org/r/528281 (owner: 10Papaul) [22:51:17] (03CR) 10Dzahn: [C: 03+2] DNS: Add mgmt and production DNS for db2131 [dns] - 10https://gerrit.wikimedia.org/r/528281 (owner: 10Papaul) [22:53:58] (03PS1) 10Dzahn: mediawiki:maintenance: switch readinglists cron to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528282 (https://phabricator.wikimedia.org/T195392) [22:55:15] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:55:25] ^ yah i'm looking at it [22:55:29] looks like a merge weirdness or something [22:56:51] (03PS1) 10Papaul: DHCP: Add MAC address for db2131 [puppet] - 10https://gerrit.wikimedia.org/r/528283 (https://phabricator.wikimedia.org/T229251) [22:58:32] (03CR) 10Dzahn: [C: 03+2] DHCP: Add MAC address for db2131 [puppet] - 10https://gerrit.wikimedia.org/r/528283 (https://phabricator.wikimedia.org/T229251) (owner: 10Papaul) [23:00:04] MaxSem, RoanKattouw, and Niharika: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190805T2300). [23:00:04] ebernhardson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:13] (03CR) 10Dzahn: [C: 03+2] mediawiki:maintenance: switch readinglists cron to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528282 (https://phabricator.wikimedia.org/T195392) (owner: 10Dzahn) [23:00:17] (03PS2) 10Dzahn: mediawiki:maintenance: switch readinglists cron to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528282 (https://phabricator.wikimedia.org/T195392) [23:00:19] I can SWAT today! [23:00:21] ebernhardson, around? [23:00:22] Urbanecm: :) [23:00:43] Urbanecm: this one isn't really noticable, it only effects job runners [23:00:46] (03CR) 10Dzahn: "applied on install server, you can start install" [puppet] - 10https://gerrit.wikimedia.org/r/528283 (https://phabricator.wikimedia.org/T229251) (owner: 10Papaul) [23:00:48] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528260 (https://phabricator.wikimedia.org/T220625) (owner: 10EBernhardson) [23:00:50] and then only group0 [23:01:00] ok [23:01:08] so nothing to test, i guess ebernhardson ? [23:01:15] Urbanecm: right [23:01:17] ok [23:01:49] (03Merged) 10jenkins-bot: Repoint cloudelastic at LB dns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528260 (https://phabricator.wikimedia.org/T220625) (owner: 10EBernhardson) [23:02:09] (03CR) 10jenkins-bot: Repoint cloudelastic at LB dns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528260 (https://phabricator.wikimedia.org/T220625) (owner: 10EBernhardson) [23:03:09] (03PS1) 10CRusnov: netbox: Fix parameter that didnt get passed [puppet] - 10https://gerrit.wikimedia.org/r/528285 [23:03:15] !log urbanecm@deploy1001 Synchronized wmf-config/ProductionServices.php: SWAT: 87b428d: Repoint cloudelastic at LB dns (T220625) (duration: 00m 48s) [23:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:24] T220625: Initialize CirrusSearch on cloudelastic - https://phabricator.wikimedia.org/T220625 [23:03:25] ebernhardson, synced! [23:03:36] Urbanecm: thanks, i'll watch the logs if anything silly happens [23:03:39] anything else? [23:03:43] thanks [23:03:44] Urbanecm: nope, thats it [23:03:47] (03CR) 10Dzahn: "manually tested this on mwmaint1002 - no difference - purges db rows using mwscript" [puppet] - 10https://gerrit.wikimedia.org/r/528282 (https://phabricator.wikimedia.org/T195392) (owner: 10Dzahn) [23:03:50] !log Evening SWAT done [23:03:51] okay than ^^ [23:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:56] (03PS1) 10CRusnov: netbox: add dummy swift url key [labs/private] - 10https://gerrit.wikimedia.org/r/528286 [23:06:27] The MediaWiki script file "/srv/mediawiki/php-1.34.0-wmf.16/extensions/WikimediaMaintenance/getJobQueueLengths.php" does not exist. [23:06:36] well then.. i guess that's an outdated maintenance cron :p [23:06:54] (03PS2) 10CRusnov: netbox: Fix parameter that didnt get passed [puppet] - 10https://gerrit.wikimedia.org/r/528285 [23:10:30] (03PS1) 10Dzahn: mediawiki:maintenance: remove non-working jobqueue_stats cron [puppet] - 10https://gerrit.wikimedia.org/r/528287 (https://phabricator.wikimedia.org/T195392) [23:11:23] (03CR) 10jerkins-bot: [V: 04-1] mediawiki:maintenance: remove non-working jobqueue_stats cron [puppet] - 10https://gerrit.wikimedia.org/r/528287 (https://phabricator.wikimedia.org/T195392) (owner: 10Dzahn) [23:11:43] (03CR) 10Paladox: "You have to either remove this manually or use the ensure => absent." [puppet] - 10https://gerrit.wikimedia.org/r/528287 (https://phabricator.wikimedia.org/T195392) (owner: 10Dzahn) [23:12:19] PROBLEM - HHVM rendering on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:13:20] (03PS2) 10Dzahn: mediawiki:maintenance: remove non-working jobqueue_stats cron [puppet] - 10https://gerrit.wikimedia.org/r/528287 (https://phabricator.wikimedia.org/T195392) [23:14:16] (03CR) 10jerkins-bot: [V: 04-1] mediawiki:maintenance: remove non-working jobqueue_stats cron [puppet] - 10https://gerrit.wikimedia.org/r/528287 (https://phabricator.wikimedia.org/T195392) (owner: 10Dzahn) [23:14:51] (03CR) 10Dzahn: "I am going to remove it manually since it's just 2 hosts." [puppet] - 10https://gerrit.wikimedia.org/r/528287 (https://phabricator.wikimedia.org/T195392) (owner: 10Dzahn) [23:15:25] RECOVERY - HHVM rendering on mw1341 is OK: HTTP OK: HTTP/1.1 200 OK - 81879 bytes in 0.126 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:15:45] PROBLEM - puppet last run on netmon2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [23:17:48] (03CR) 10CRusnov: [V: 03+2 C: 03+2] netbox: add dummy swift url key [labs/private] - 10https://gerrit.wikimedia.org/r/528286 (owner: 10CRusnov) [23:18:37] mutante: the core commit paladox pointed to is not directly related [23:20:07] Krinkle: you mean it's not the change that actually removed it but just that it updates a comment? well, i was already suprised positively either way [23:20:42] (03PS3) 10Dzahn: mediawiki:maintenance: remove non-working jobqueue_stats cron [puppet] - 10https://gerrit.wikimedia.org/r/528287 (https://phabricator.wikimedia.org/T195392) [23:21:25] (03PS4) 10Dzahn: mediawiki:maintenance: remove non-working jobqueue_stats cron [puppet] - 10https://gerrit.wikimedia.org/r/528287 (https://phabricator.wikimedia.org/T195392) [23:22:41] (03CR) 10Krinkle: [C: 03+1] mediawiki:maintenance: remove non-working jobqueue_stats cron [puppet] - 10https://gerrit.wikimedia.org/r/528287 (https://phabricator.wikimedia.org/T195392) (owner: 10Dzahn) [23:25:22] (03CR) 10Dzahn: [C: 03+2] mediawiki:maintenance: remove non-working jobqueue_stats cron [puppet] - 10https://gerrit.wikimedia.org/r/528287 (https://phabricator.wikimedia.org/T195392) (owner: 10Dzahn) [23:25:30] (03PS5) 10Dzahn: mediawiki:maintenance: remove non-working jobqueue_stats cron [puppet] - 10https://gerrit.wikimedia.org/r/528287 (https://phabricator.wikimedia.org/T195392) [23:25:38] (03PS1) 10Viztor: Add Noto Sans CJK + Noto Mono CJK fonts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528290 [23:28:53] mutante i think he means the script (the script was removed from the Wikimedia extension due to the removal of the class it was using from MW core) [23:30:07] PROBLEM - puppet last run on cloudvirt1006 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [23:34:39] !log mwmaint1002 - remove getJobQueueLengths.php from www-data's crontab (T195392) [23:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:48] T195392: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 [23:34:54] paladox: ok, ack [23:36:26] (03PS1) 10Jhedden: toolschecker: webservice final process status [puppet] - 10https://gerrit.wikimedia.org/r/528292 (https://phabricator.wikimedia.org/T221301) [23:37:54] (03PS2) 10Viztor: Add Noto Sans CJK + Noto Mono CJK fonts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528290 [23:43:31] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:45:47] (03PS2) 10Jhedden: toolschecker: Ensure webservice is fully stopped [puppet] - 10https://gerrit.wikimedia.org/r/528292 (https://phabricator.wikimedia.org/T221301) [23:53:39] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:58:03] RECOVERY - puppet last run on cloudvirt1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun