[00:00:04] <wikibugs>	 (03CR) 10Dzahn: "> Patch Set 10:" [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn)
[00:03:18] <wikibugs>	 (03PS4) 10Bstorm: sssd: Add a whole duplicate hierarchy of sssd images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/536692 (https://phabricator.wikimedia.org/T229058)
[00:05:33] <wikibugs>	 (03CR) 10Dzahn: "> Patch Set 10: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn)
[00:05:40] <ebernhardson>	 Krinkle: indeed
[00:05:53] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.01087 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[00:07:19] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.00432 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[00:09:05] <icinga-wm>	 RECOVERY - puppet last run on cloudnet2002-dev is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[00:13:05] <wikibugs>	 (03PS5) 10Bstorm: toolforge-kubernetes: restructure pod security policies [puppet] - 10https://gerrit.wikimedia.org/r/537732 (https://phabricator.wikimedia.org/T227290)
[00:13:59] <wikibugs>	 (03PS12) 10Dzahn: parsoid: introduce parameter to use parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654)
[00:14:06] <wikibugs>	 (03PS6) 10Bstorm: toolforge-k8s: proposed role for all tools [puppet] - 10https://gerrit.wikimedia.org/r/537755 (https://phabricator.wikimedia.org/T227290)
[00:16:17] <wikibugs>	 (03CR) 10Dzahn: "> - has_lvs removed" [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn)
[00:18:15] <wikibugs>	 (03PS1) 10Dzahn: add fake key for parsoid.svc, delete fake key for wtp2001 [labs/private] - 10https://gerrit.wikimedia.org/r/540251 (https://phabricator.wikimedia.org/T233654)
[00:18:36] <wikibugs>	 (03PS2) 10Dzahn: add fake key for parsoid.svc, delete fake key for wtp2001 [labs/private] - 10https://gerrit.wikimedia.org/r/540251 (https://phabricator.wikimedia.org/T233654)
[00:22:46] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake key for parsoid.svc, delete fake key for wtp2001 [labs/private] - 10https://gerrit.wikimedia.org/r/540251 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn)
[00:24:51] <wikibugs>	 (03CR) 10Dzahn: "i needed https://gerrit.wikimedia.org/r/c/labs/private/+/540251  to get parsoid.svc.key to make the compiler work. that was the other cert" [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn)
[00:28:44] <wikibugs>	 (03CR) 10Dzahn: "so yea. this looks alright in compiler but isn't modules/secret/secrets/ssl/parsoid.svc.eqiad.wmnet.key missing in the actually private pr" [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn)
[00:30:05] <wikibugs>	 (03CR) 10Dzahn: "> Patch Set 12:" [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn)
[00:30:52] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.25/skins/Vector/: d30064229f9 (duration: 00m 59s)
[00:30:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:34:21] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.25/resources/src: 5eb3ae1e888e353 (duration: 01m 00s)
[00:34:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:49:09] <wikibugs>	 (03PS1) 10Dzahn: add certificate for parsoid.discovery/parsoid.svc [puppet] - 10https://gerrit.wikimedia.org/r/540252 (https://phabricator.wikimedia.org/T233654)
[00:51:28] <wikibugs>	 (03PS2) 10Dzahn: add certificate for parsoid.discovery/parsoid.svc [puppet] - 10https://gerrit.wikimedia.org/r/540252 (https://phabricator.wikimedia.org/T233654)
[00:52:32] <wikibugs>	 (03CR) 10Dzahn: "also created puppet CA cert for parsoid.svc / parsoid.discovery following" [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn)
[01:09:10] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[01:10:42] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[03:11:48] <icinga-wm>	 PROBLEM - ps1-c1-codfw-infeed-load-tower-A-phase-X on ps1-c1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:11:52] <icinga-wm>	 PROBLEM - ps1-c1-codfw-infeed-load-tower-B-phase-Z on ps1-c1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:11:58] <icinga-wm>	 PROBLEM - ps1-c1-codfw-infeed-load-tower-A-phase-Y on ps1-c1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:12:36] <icinga-wm>	 PROBLEM - ps1-c1-codfw-infeed-load-tower-B-phase-Y on ps1-c1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:14:30] <icinga-wm>	 RECOVERY - ps1-c1-codfw-infeed-load-tower-B-phase-Z on ps1-c1-codfw is OK: SNMP OK - ps1-c1-codfw-infeed-load-tower-B-phase-Z 188 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:15:04] <icinga-wm>	 PROBLEM - Host db2077.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[03:17:50] <icinga-wm>	 RECOVERY - ps1-c1-codfw-infeed-load-tower-A-phase-Y on ps1-c1-codfw is OK: SNMP OK - ps1-c1-codfw-infeed-load-tower-A-phase-Y 405 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:20:56] <icinga-wm>	 RECOVERY - Host db2077.mgmt is UP: PING WARNING - Packet loss = 86%, RTA = 36.76 ms
[03:21:14] <icinga-wm>	 PROBLEM - Host elastic2031.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[03:21:34] <icinga-wm>	 PROBLEM - ps1-c1-codfw-infeed-load-tower-B-phase-Z on ps1-c1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:22:28] <icinga-wm>	 PROBLEM - Host db2112.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[03:22:38] <icinga-wm>	 PROBLEM - Host restbase2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[03:22:54] <icinga-wm>	 PROBLEM - ps1-c1-codfw-infeed-load-tower-A-phase-Z on ps1-c1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:23:18] <icinga-wm>	 PROBLEM - ps1-c1-codfw-infeed-load-tower-A-phase-Y on ps1-c1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:23:44] <icinga-wm>	 PROBLEM - ps1-c1-codfw-infeed-load-tower-B-phase-X on ps1-c1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:24:40] <icinga-wm>	 PROBLEM - Host db2087.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[03:24:58] <icinga-wm>	 PROBLEM - Host mc2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[03:25:16] <icinga-wm>	 PROBLEM - Host cloudservices2002-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[03:25:22] <icinga-wm>	 PROBLEM - Host restbase2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[03:25:48] <icinga-wm>	 PROBLEM - Host cloudcontrol2003-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[03:26:18] <icinga-wm>	 PROBLEM - Host ps1-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[03:26:50] <icinga-wm>	 PROBLEM - Host es2012.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[03:27:54] <icinga-wm>	 PROBLEM - Host ganeti2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[03:27:56] <icinga-wm>	 PROBLEM - Host db2125.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[03:27:58] <icinga-wm>	 PROBLEM - Host es2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[03:28:02] <icinga-wm>	 PROBLEM - Host ores2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[03:28:08] <icinga-wm>	 PROBLEM - Host mc2027.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[03:28:12] <icinga-wm>	 PROBLEM - Host pc2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[03:28:30] <icinga-wm>	 RECOVERY - Host restbase2011.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 36.78 ms
[03:28:42] <icinga-wm>	 PROBLEM - Host db2077.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[03:35:18] <icinga-wm>	 PROBLEM - Host ganeti2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[03:36:16] <icinga-wm>	 PROBLEM - Host restbase2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[04:03:33] <wikibugs>	 10Operations, 10Acme-chief, 10Traffic: Memory leak on acme-chief 0.21 - https://phabricator.wikimedia.org/T234131 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez
[04:07:16] <vgutierrez>	 !log restarting trafficserver-tls on cp5007
[04:07:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:08:40] <icinga-wm>	 RECOVERY - traffic_server tls process restarted on cp5007 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin+prometheus/ops&var-instance=cp5007&var-layer=tls
[04:46:16] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:51:08] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:53:54] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:58:48] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:14:03] <wikibugs>	 10Operations, 10ops-codfw, 10netops: msw-c1 down? - https://phabricator.wikimedia.org/T234411 (10faidon) p:05Triage→03High
[05:16:44] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "> > Patch Set 4: Verified-1" [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi)
[05:49:18] <icinga-wm>	 RECOVERY - Host mc2027.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 36.78 ms
[05:55:36] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I tried to think about this, and given what you are trying to do with the prometheus exporter there is no way around it, and you will need" [software/service-checker] - 10https://gerrit.wikimedia.org/r/538711 (owner: 10Cwhite)
[05:55:40] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: change EndpointMetrics from static to instance variable [software/service-checker] - 10https://gerrit.wikimedia.org/r/538711 (owner: 10Cwhite)
[05:55:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] change EndpointMetrics from static to instance variable [software/service-checker] - 10https://gerrit.wikimedia.org/r/538711 (owner: 10Cwhite)
[05:56:55] <wikibugs>	 (03Merged) 10jenkins-bot: change EndpointMetrics from static to instance variable [software/service-checker] - 10https://gerrit.wikimedia.org/r/538711 (owner: 10Cwhite)
[05:57:02] <icinga-wm>	 PROBLEM - Host mc2027.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[06:04:23] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 (10Marostegui) Ping @Cmjohnson @Jclark-ctr was the mainboard replaced? Thanks!
[06:09:01] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission
[06:09:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:09:11] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[06:09:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:09:15] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1073.eqiad.wmnet` -  db1073.eqiad.wmnet (**PASS**)   - Downtimed host on Ic...
[06:10:06] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove db1073 references. [puppet] - 10https://gerrit.wikimedia.org/r/540256 (https://phabricator.wikimedia.org/T231892)
[06:10:45] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Remove production DNS entries from db1073 [dns] - 10https://gerrit.wikimedia.org/r/540257 (https://phabricator.wikimedia.org/T231892)
[06:11:07] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Remove db1073 references. [puppet] - 10https://gerrit.wikimedia.org/r/540256 (https://phabricator.wikimedia.org/T231892) (owner: 10Marostegui)
[06:11:20] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production DNS entries from db1073 [dns] - 10https://gerrit.wikimedia.org/r/540257 (https://phabricator.wikimedia.org/T231892) (owner: 10Marostegui)
[06:12:25] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10Marostegui) a:05RobH→03Cmjohnson
[06:12:56] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10Marostegui) Host ready for on-site steps  + switch port disablement
[06:17:11] <marostegui>	 !log Fix replication on labsdb1011:s1 - T233986
[06:17:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:23:01] <marostegui>	 	!log Fix replication on labsdb1011:s7 - T233986
[06:23:05] <marostegui>	 !log Fix replication on labsdb1011:s7 - T233986
[06:23:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:26:50] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 (10elukey)
[06:31:47] <wikibugs>	 10Operations, 10netops: Telia IC-314534 (eqord/eqdfw 10Gbps wave) down - https://phabricator.wikimedia.org/T234335 (10elukey) p:05Triage→03Normal
[06:32:26] <wikibugs>	 10Operations, 10netops: Telia IC-314534 (eqord/eqdfw 10Gbps wave) down - https://phabricator.wikimedia.org/T234335 (10elukey) Link is down again as far as I can see from icinga and:  ` elukey@re0.cr2-codfw> show interfaces descriptions |match down` xe-5/2/1        up    down Transport: cr2-eqord:xe-0/1/0 (Teli...
[06:32:29] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: Puppet fail on deployment-mediawiki-07, missing private hiera variable - https://phabricator.wikimedia.org/T210497 (10mobrovac) 05Open→03Resolved a:03fgiunchedi Puppet is running fine there, closing.
[06:35:03] <wikibugs>	 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Traffic: Puppet fails on deployment-cache-text05 - https://phabricator.wikimedia.org/T234412 (10mobrovac)
[06:47:12] <wikibugs>	 10Operations, 10netops: Telia IC-314534 (eqord/eqdfw 10Gbps wave) down - https://phabricator.wikimedia.org/T234335 (10elukey) ` elukey@re0.cr2-codfw> show interfaces diagnostics optics xe-5/2/1 Physical interface: xe-5/2/1     Laser bias current                        :  46.512 mA     Laser output power...
[07:06:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, one typo inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540207 (https://phabricator.wikimedia.org/T233636) (owner: 10Herron)
[07:07:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/539973 (owner: 10Dzahn)
[07:07:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/540161 (owner: 10Elukey)
[07:09:22] <wikibugs>	 (03PS2) 10Elukey: profile::kerberos::replication: add AAAA ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/540161
[07:12:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::kerberos::replication: add AAAA ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/540161 (owner: 10Elukey)
[07:12:58] <elukey>	 ufff
[07:14:24] <elukey>	 not clear to me why it failed
[07:15:33] <wikibugs>	 (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/540161 (owner: 10Elukey)
[07:20:29] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::kerberos::replication: add AAAA ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/540161 (owner: 10Elukey)
[07:23:08] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+1] "Fine by me." [puppet] - 10https://gerrit.wikimedia.org/r/540238 (owner: 10Bstorm)
[07:39:44] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[07:42:56] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[07:46:04] <moritzm>	 !log upgrading remaining stretch hosts to ferm 2.4.2pre
[07:46:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:37] <wikibugs>	 (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/540189 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron)
[07:53:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add toggle for rsyslog udp logback compat include [puppet] - 10https://gerrit.wikimedia.org/r/540189 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron)
[07:54:29] <wikibugs>	 10Operations, 10netops: Telia IC-314534 (eqord/eqdfw 10Gbps wave) down - https://phabricator.wikimedia.org/T234335 (10elukey) I missed an email from Telia, they are replacing a faulty card that apparently caused flaps and the impact that we saw. Hopefully we'll see recovery soon.
[07:57:16] <wikibugs>	 (03CR) 10Hashar: "I have raised the CI job timeout from 3 minutes to 5 minutes, even though it should usually take less than 2 minutes.  In those specific f" [puppet] - 10https://gerrit.wikimedia.org/r/540189 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron)
[08:00:22] <icinga-wm>	 RECOVERY - Host cloudservices2002-dev.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 36.72 ms
[08:00:24] <icinga-wm>	 RECOVERY - Host es2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.75 ms
[08:00:26] <icinga-wm>	 RECOVERY - Host ps1-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.89 ms
[08:00:26] <icinga-wm>	 RECOVERY - ps1-c1-codfw-infeed-load-tower-A-phase-X on ps1-c1-codfw is OK: SNMP OK - ps1-c1-codfw-infeed-load-tower-A-phase-X 512 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:00:28] <icinga-wm>	 RECOVERY - ps1-c1-codfw-infeed-load-tower-A-phase-Y on ps1-c1-codfw is OK: SNMP OK - ps1-c1-codfw-infeed-load-tower-A-phase-Y 384 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:00:28] <icinga-wm>	 RECOVERY - ps1-c1-codfw-infeed-load-tower-B-phase-Z on ps1-c1-codfw is OK: SNMP OK - ps1-c1-codfw-infeed-load-tower-B-phase-Z 177 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:00:28] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:00:30] <icinga-wm>	 RECOVERY - Host cloudcontrol2003-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.35 ms
[08:00:44] <icinga-wm>	 RECOVERY - ps1-c1-codfw-infeed-load-tower-B-phase-X on ps1-c1-codfw is OK: SNMP OK - ps1-c1-codfw-infeed-load-tower-B-phase-X 272 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:00:56] <icinga-wm>	 RECOVERY - ps1-c1-codfw-infeed-load-tower-A-phase-Z on ps1-c1-codfw is OK: SNMP OK - ps1-c1-codfw-infeed-load-tower-A-phase-Z 678 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:01:30] <icinga-wm>	 RECOVERY - Host es2012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 42.47 ms
[08:01:30] <icinga-wm>	 RECOVERY - Host elastic2031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.88 ms
[08:02:22] <icinga-wm>	 RECOVERY - Host ganeti2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 42.51 ms
[08:02:22] <icinga-wm>	 RECOVERY - Host ores2005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.59 ms
[08:02:26] <icinga-wm>	 RECOVERY - Host db2112.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.99 ms
[08:02:36] <icinga-wm>	 RECOVERY - Host pc2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 43.56 ms
[08:03:06] <icinga-wm>	 RECOVERY - Host db2077.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.68 ms
[08:03:26] <icinga-wm>	 RECOVERY - Host ganeti2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.08 ms
[08:03:26] <icinga-wm>	 RECOVERY - Host db2125.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.01 ms
[08:03:40] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[08:04:32] <icinga-wm>	 RECOVERY - Host mc2027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.71 ms
[08:04:44] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:04:46] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:04:50] <icinga-wm>	 RECOVERY - Host restbase2011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.34 ms
[08:04:54] <icinga-wm>	 RECOVERY - Host db2087.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.81 ms
[08:05:06] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[08:05:34] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[08:05:34] <icinga-wm>	 RECOVERY - Host mc2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.76 ms
[08:05:42] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:05:44] <icinga-wm>	 RECOVERY - Host restbase2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.05 ms
[08:05:46] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:06:10] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:06:10] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[08:06:22] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:06:30] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[08:06:44] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[08:06:56] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[08:07:10] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[08:07:18] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:07:22] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:07:44] <elukey>	 mmmm api appservers slowness from the red dashboard afaics
[08:07:46] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[08:07:58] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:08:06] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[08:08:06] <elukey>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200&from=now-1h&to=now
[08:08:32] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[08:09:07] <elukey>	 Cc: vgutierrez _joe_ --^
[08:09:21] <elukey>	 should be on its way to recover fully in theory
[08:09:22] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:09:34] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:09:34] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:10:08] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[08:10:15] <_joe_>	 we had a spike of 5xx on the appserver side a few minutes ago
[08:10:30] <elukey>	 there are fluctuations in latency for apis afaics
[08:10:50] <_joe_>	 yes
[08:10:53] <_joe_>	 pretty severe even
[08:11:06] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:11:10] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:11:23] <_joe_>	 ok
[08:11:34] <_joe_>	 can someone check the logstash for 5xx for finding patterns?
[08:11:44] <paravoid>	 seems correlated with the mgmt issue?
[08:11:46] <_joe_>	 I will try to look at an api server to get what's happening there
[08:11:55] <paravoid>	 so perhaps it wasn't just a management switch then?
[08:11:59] <_joe_>	 paravoid: I doubt it, this is the api being slow in eqiad
[08:12:00] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[08:12:12] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:12:15] <_joe_>	 unless the problem is much much deeper
[08:12:46] <_joe_>	 we have some latency in the memcached response times
[08:12:46] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:12:48] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:13:13] <_joe_>	 the api servers are not overloaded either
[08:13:34] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[08:13:48] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:14:10] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:14:13] <_joe_>	 I can't find anything significant in running api processes
[08:14:27] <_joe_>	 but I think we got here after the worst had passed
[08:14:38] <_joe_>	 it was less than 5 minutes
[08:15:34] <paravoid>	 would cross-DC networking issues explain any of that?
[08:15:37] <elukey>	 sorry I was afk, checking
[08:15:48] <_joe_>	 paravoid: only intra-eqiad ones would
[08:16:14] <_joe_>	 as if for instance getting data from memcached took 5 ms instead of 1, that can explain partially this
[08:16:34] <_joe_>	 but if that was due to network errors, I'd see in the mcrouter logs
[08:16:37] <_joe_>	 lemme see
[08:17:29] <elukey>	 from the memcached dashboard there doesn't seem to be anything big ongoing 
[08:18:11] <_joe_>	 no the last tko was tonight at 6 am
[08:18:21] <_joe_>	 sorry yesterday night
[08:18:41] <_joe_>	 elukey: there is something wrong with the mcrouter unit file btw, sigh
[08:19:37] <elukey>	 the issue seems to have subsided right? the appservers latency looks better now
[08:19:41] <_joe_>	 yes
[08:19:56] <_joe_>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&panelId=9&fullscreen&from=now-30m&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200
[08:21:37] <_joe_>	 I'll take some time to dig into this in a few
[08:21:50] <elukey>	 ack
[08:24:45] <paravoid>	 so
[08:24:48] <paravoid>	 Oct  2 08:04:25  asw2-a-eqiad fpc7 sfp-7/0/46 link 46 SFP receive power low  warning set
[08:25:03] <paravoid>	 with xe-7/0/46 being the cr2-eqiad <-> asw2-a-eqiad 10G
[08:25:24] <paravoid>	 so that could explain some networking issues like lost packets
[08:25:25] <paravoid>	 but
[08:25:44] <paravoid>	 this seems to be timestamped at :04 which is after the issues started
[08:29:27] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: asw2-a-eqiad <-> cr2-eqiad fiber issue - https://phabricator.wikimedia.org/T234416 (10faidon) p:05Triage→03High
[08:29:32] <paravoid>	 (filed above)
[08:29:44] <icinga-wm>	 PROBLEM - Host restbase2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:29:46] <icinga-wm>	 PROBLEM - Host db2087.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:29:48] <paravoid>	 the wtf again
[08:29:57] <paravoid>	 the timing of this is pretty weird
[08:30:02] <icinga-wm>	 PROBLEM - Host ps1-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[08:30:26] <icinga-wm>	 PROBLEM - Host mc2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:30:38] <icinga-wm>	 PROBLEM - Host restbase2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:31:02] <icinga-wm>	 PROBLEM - Host cloudservices2002-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:31:08] <icinga-wm>	 PROBLEM - Host cloudcontrol2003-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:32:00] <icinga-wm>	 PROBLEM - Host es2012.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:32:00] <icinga-wm>	 PROBLEM - Host elastic2031.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:32:34] <elukey>	 nice catch for the fiber issue
[08:32:50] <icinga-wm>	 PROBLEM - Host ganeti2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:32:50] <icinga-wm>	 PROBLEM - Host es2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:32:50] <icinga-wm>	 PROBLEM - Host ores2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:32:56] <icinga-wm>	 PROBLEM - Host db2112.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:33:04] <icinga-wm>	 PROBLEM - Host pc2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:33:11] <paravoid>	 yeah I'm looking closer at the C1 mgmt issue, it seems entirely unrelated
[08:33:38] <icinga-wm>	 PROBLEM - Host db2077.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:33:56] <icinga-wm>	 PROBLEM - Host db2125.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:34:38] <icinga-wm>	 PROBLEM - Host ganeti2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:35:20] <icinga-wm>	 PROBLEM - Host mc2027.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:36:24] <paravoid>	 just one of these days I guess :)
[08:36:59] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "See comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540189 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron)
[08:38:03] <_joe_>	 yeah :/
[08:48:44] <icinga-wm>	 RECOVERY - Host cloudcontrol2003-dev.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 36.76 ms
[08:48:48] <icinga-wm>	 RECOVERY - Host ps1-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.30 ms
[08:49:26] <icinga-wm>	 RECOVERY - Host es2012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 39.04 ms
[08:49:26] <icinga-wm>	 RECOVERY - Host elastic2031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 42.06 ms
[08:50:10] <icinga-wm>	 RECOVERY - Host es2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.38 ms
[08:50:14] <icinga-wm>	 RECOVERY - Host ganeti2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.99 ms
[08:50:14] <icinga-wm>	 RECOVERY - Host ores2005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.73 ms
[08:50:22] <icinga-wm>	 RECOVERY - Host db2112.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.42 ms
[08:50:30] <icinga-wm>	 RECOVERY - Host pc2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.06 ms
[08:50:31] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10DNS, 10Traffic, and 4 others: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10MoritzMuehlenhoff) 05Open...
[08:51:02] <icinga-wm>	 RECOVERY - Host db2077.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.42 ms
[08:51:22] <icinga-wm>	 RECOVERY - Host db2125.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.04 ms
[08:52:04] <icinga-wm>	 RECOVERY - Host ganeti2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.80 ms
[08:52:46] <icinga-wm>	 RECOVERY - Host mc2027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.06 ms
[08:53:02] <icinga-wm>	 RECOVERY - Host restbase2011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.81 ms
[08:53:04] <icinga-wm>	 RECOVERY - Host db2087.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.79 ms
[08:53:44] <icinga-wm>	 RECOVERY - Host mc2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.70 ms
[08:53:54] <icinga-wm>	 RECOVERY - Host restbase2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.08 ms
[08:54:18] <icinga-wm>	 RECOVERY - Host cloudservices2002-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.69 ms
[08:58:19] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "Good first draft! A few issues inline." (0310 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe)
[09:17:40] <icinga-wm>	 PROBLEM - Host mc2027.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:17:52] <icinga-wm>	 PROBLEM - Host restbase2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:17:54] <icinga-wm>	 PROBLEM - Host db2087.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:18:10] <icinga-wm>	 PROBLEM - Host ps1-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[09:18:32] <icinga-wm>	 PROBLEM - Host mc2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:18:42] <icinga-wm>	 PROBLEM - Host restbase2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:19:04] <icinga-wm>	 PROBLEM - Host cloudservices2002-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:19:10] <icinga-wm>	 PROBLEM - Host cloudcontrol2003-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:19:58] <icinga-wm>	 PROBLEM - Host es2012.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:19:58] <icinga-wm>	 PROBLEM - Host elastic2031.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:20:40] <icinga-wm>	 PROBLEM - Host es2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:20:44] <icinga-wm>	 PROBLEM - Host ganeti2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:20:44] <icinga-wm>	 PROBLEM - Host ores2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:20:52] <icinga-wm>	 PROBLEM - Host db2112.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:21:00] <icinga-wm>	 PROBLEM - Host pc2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:21:34] <icinga-wm>	 PROBLEM - Host db2077.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:21:54] <icinga-wm>	 PROBLEM - Host db2125.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:22:40] <icinga-wm>	 PROBLEM - Host ganeti2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:28:50] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Bump version number. [software/service-checker] - 10https://gerrit.wikimedia.org/r/540364
[09:29:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Bump version number. [software/service-checker] - 10https://gerrit.wikimedia.org/r/540364 (owner: 10Giuseppe Lavagetto)
[09:34:05] <wikibugs>	 (03CR) 10Ema: [C: 03+1] fifo-log-tailer: Retry on errors (031 comment) [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/539312 (owner: 10Vgutierrez)
[09:41:29] <moritzm>	 !log rebalancing Ganeti/codfw Row A after rolling reboot of Ganeti nodes
[09:41:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:49] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: restrouter: Revert the initialDelay seconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/540365 (https://phabricator.wikimedia.org/T223953)
[09:54:58] <wikibugs>	 10Operations, 10DC-Ops, 10decommission: decommission elastic1017 - https://phabricator.wikimedia.org/T234045 (10Gehel) a:05Gehel→03RobH
[09:55:18] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add a commit message guide [puppet] - 10https://gerrit.wikimedia.org/r/540366
[09:55:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add a commit message guide [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris)
[09:56:51] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "@mobrovac, I am guessing this is fine to revert now ? Wanna do the honors?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/540365 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris)
[10:01:43] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Add a commit message guide [puppet] - 10https://gerrit.wikimedia.org/r/540366
[10:02:43] <paravoid>	 akosiaris: how is that different from  https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines ?
[10:05:43] <wikibugs>	 10Operations, 10DC-Ops, 10decommission: decommission elastic1017 - https://phabricator.wikimedia.org/T234045 (10Gehel) Host is shutdown and has networking issues. As such the role(system::spare) was not applied and it is still in puppetdb. Since that cleanup step is part of the "uninterruptible" steps, it ha...
[10:08:14] <akosiaris>	 paravoid: it can be used as a template
[10:08:26] <akosiaris>	 that's the idea, git config commit.template .gitmessage
[10:08:34] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add a commit message guide (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris)
[10:08:46] <akosiaris>	 and every time you git commit, you get a ready to edit thing
[10:09:13] <akosiaris>	 I may have diverged a bit, I 'll try to reconcile with https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines
[10:13:39] <wikibugs>	 (03PS1) 10Jbond: puppetmaster1001: migrate esams puppet traffic to codfw [dns] - 10https://gerrit.wikimedia.org/r/540367 (https://phabricator.wikimedia.org/T234315)
[10:13:41] <wikibugs>	 (03PS1) 10Jbond: puppetmaster1001: move eqaid puppet to codfw [dns] - 10https://gerrit.wikimedia.org/r/540368 (https://phabricator.wikimedia.org/T234315)
[10:13:54] <akosiaris>	 paravoid: essentially, just trying to give people a way to implement https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines a bit easier
[10:15:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/540367 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond)
[10:15:54] <wikibugs>	 (03CR) 10Mobrovac: "Yup, but it'll have to go together with a new version of the chart that bumps RR's tag. I'll prepare that and merge both once all is ready" [deployment-charts] - 10https://gerrit.wikimedia.org/r/540365 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris)
[10:17:11] <logmsgbot>	 !log gehel@cumin1001 START - Cookbook sre.hosts.decommission
[10:17:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:27] <logmsgbot>	 !log gehel@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
[10:17:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/540368 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond)
[10:17:31] <wikibugs>	 10Operations, 10DC-Ops, 10decommission: decommission elastic1017 - https://phabricator.wikimedia.org/T234045 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by gehel@cumin1001 for hosts: `elastic1017.eqiad.wmnet` -  elastic1017.eqiad.wmnet (**FAIL**)   - Downtimed host on Icinga   - Downtime...
[10:17:53] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "Isn't this it ?https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/540158/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/540365 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris)
[10:18:00] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Only build the package for python3 [software/service-checker] - 10https://gerrit.wikimedia.org/r/540369
[10:18:03] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Allow properly running tests while using pybuild [software/service-checker] - 10https://gerrit.wikimedia.org/r/540370
[10:18:32] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "oh, you mean the image version tag? sorry, my mistake. Yup, fine by me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/540365 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris)
[10:20:36] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Only build the package for python3 [software/service-checker] - 10https://gerrit.wikimedia.org/r/540369 (owner: 10Giuseppe Lavagetto)
[10:20:47] <wikibugs>	 10Operations, 10DC-Ops, 10decommission: decommission elastic1017 - https://phabricator.wikimedia.org/T234045 (10Gehel) After discussion with @Volans, I ran the [[ https://wikitech.wikimedia.org/wiki/Decom_script | decom script ]]. Host is now properly removed from icinga and puppetdb.
[10:21:04] <wikibugs>	 10Operations, 10DC-Ops, 10decommission: decommission elastic1017 - https://phabricator.wikimedia.org/T234045 (10Gehel)
[10:21:42] <wikibugs>	 (03Merged) 10jenkins-bot: Only build the package for python3 [software/service-checker] - 10https://gerrit.wikimedia.org/r/540369 (owner: 10Giuseppe Lavagetto)
[10:29:10] <wikibugs>	 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Traffic: Puppet fails on deployment-cache-text05 - https://phabricator.wikimedia.org/T234412 (10ema) p:05Triage→03Normal
[10:33:33] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Allow properly running tests while using pybuild [software/service-checker] - 10https://gerrit.wikimedia.org/r/540370 (owner: 10Giuseppe Lavagetto)
[10:34:13] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "No problem, thanks for fixing it :)" [puppet] - 10https://gerrit.wikimedia.org/r/513759 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE))
[10:34:30] <wikibugs>	 (03Merged) 10jenkins-bot: Allow properly running tests while using pybuild [software/service-checker] - 10https://gerrit.wikimedia.org/r/540370 (owner: 10Giuseppe Lavagetto)
[10:36:44] <wikibugs>	 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Traffic: Puppet fails on deployment-cache-text05 - https://phabricator.wikimedia.org/T234412 (10Vgutierrez) This is caused by adding the ATS-TLS instance to the text cluster. So you need to provide a valid configuration for the ats-tls profile. See:...
[10:44:58] <wikibugs>	 (03PS2) 10Muehlenhoff: Add repo sync for buster/grafana [puppet] - 10https://gerrit.wikimedia.org/r/540113
[10:46:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add repo sync for buster/grafana [puppet] - 10https://gerrit.wikimedia.org/r/540113 (owner: 10Muehlenhoff)
[10:49:48] <wikibugs>	 (03PS1) 10Mathew.onipe: elasticsearch: move rsyslog profile to cirrus profile [puppet] - 10https://gerrit.wikimedia.org/r/540373
[10:59:04] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Further fixes to debianization [software/service-checker] - 10https://gerrit.wikimedia.org/r/540375
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191002T1100).
[11:00:04] <jouncebot>	 raynor: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:17] <wikibugs>	 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Traffic, 10Core Platform Team Workboards (Clinic Duty Team): Puppet fails on deployment-cache-text05 - https://phabricator.wikimedia.org/T234412 (10mobrovac) 05Open→03Resolved a:03mobrovac As per @Vgutierrez' instructions, I looked up the ATS...
[11:00:28] <raynor>	 o/
[11:00:53] <raynor>	 so looks like I'm the only one with two changes, I can deploy by my own
[11:00:57] <raynor>	 can I proceed?
[11:01:14] <Urbanecm>	 raynor: No problem from my side
[11:01:35] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[11:01:36] <wikibugs>	 (03CR) 10Pmiazga: [C: 03+2] Do not set wgMFNoindexPages config flag in mobile.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540193 (https://phabricator.wikimedia.org/T206497) (owner: 10Pmiazga)
[11:02:18] <Urbanecm>	 raynor: let me know once you're done, please
[11:03:00] <raynor>	 kk
[11:03:30] <wikibugs>	 (03Merged) 10jenkins-bot: Do not set wgMFNoindexPages config flag in mobile.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540193 (https://phabricator.wikimedia.org/T206497) (owner: 10Pmiazga)
[11:03:51] <wikibugs>	 (03PS1) 10Urbanecm: Enable partial blocks at ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540377 (https://phabricator.wikimedia.org/T233754)
[11:06:44] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Further fixes to debianization [software/service-checker] - 10https://gerrit.wikimedia.org/r/540375
[11:09:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Further fixes to debianization [software/service-checker] - 10https://gerrit.wikimedia.org/r/540375 (owner: 10Giuseppe Lavagetto)
[11:11:43] <wikibugs>	 (03PS3) 10Pmiazga: Set new MFMobileFormatterOptions config using old config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540107 (https://phabricator.wikimedia.org/T232690)
[11:12:11] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[11:12:35] <logmsgbot>	 !log pmiazga@deploy1001 Synchronized wmf-config/mobile.php: SWAT: [[gerrit:540193|Do not set wgMFNoindexPages config flag in mobile.php (T206497)]] (duration: 01m 14s)
[11:12:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:39] <stashbot>	 T206497: Enable $wgMFNoindexPages for: Italian, Dutch, Korean, Arabic, Chinese, and Hindi Wikipedias - https://phabricator.wikimedia.org/T206497
[11:13:51] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Add a commit message guide (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris)
[11:14:03] <wikibugs>	 (03PS1) 10Urbanecm: Grant autocreateaccount to everyone on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540380 (https://phabricator.wikimedia.org/T222117)
[11:14:04] <wikibugs>	 (03CR) 10Pmiazga: [C: 03+2] Set new MFMobileFormatterOptions config using old config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540107 (https://phabricator.wikimedia.org/T232690) (owner: 10Pmiazga)
[11:14:18] <_joe_>	 !log uploaded service-checker 0.2.0 to stretch-wikimedia
[11:14:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:14:43] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Add a commit message guide [puppet] - 10https://gerrit.wikimedia.org/r/540366
[11:15:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster1001: migrate esams puppet traffic to codfw [dns] - 10https://gerrit.wikimedia.org/r/540367 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond)
[11:15:07] <wikibugs>	 (03Merged) 10jenkins-bot: Set new MFMobileFormatterOptions config using old config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540107 (https://phabricator.wikimedia.org/T232690) (owner: 10Pmiazga)
[11:15:23] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "I guess that this could benefit from a wider audience. While voluntary and not enforced in any way, it would be nice to inform the rest of" [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris)
[11:15:28] <_joe_>	 !log testing the package on restbase-dev1006
[11:15:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:14] <_joe_>	 heh as I feared
[11:17:20] <_joe_>	 I need the puppet change ASAP
[11:18:19] <wikibugs>	 (03PS2) 10Urbanecm: Enable partial blocks at ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540377 (https://phabricator.wikimedia.org/T233754)
[11:20:50] <logmsgbot>	 !log pmiazga@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:540107|Set new MFMobileFormatterOptions config using old config (T232690)]] (duration: 01m 01s)
[11:20:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:53] <stashbot>	 T232690: Skip Expensive MobileFormatter Transformations On Pages With Extremely High Image/Heading counts - https://phabricator.wikimedia.org/T232690
[11:22:17] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[11:22:35] <icinga-wm>	 PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[11:23:13] <icinga-wm>	 PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[11:23:15] <raynor>	 Urbanecm, I'm done, do you want to push sth more or can I close the SWAT?
[11:23:33] <Urbanecm>	 raynor: I have one patch
[11:23:36] <_joe_>	 uh what are those ripe alerts?
[11:23:37] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable partial blocks at ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540377 (https://phabricator.wikimedia.org/T233754) (owner: 10Urbanecm)
[11:23:45] <raynor>	 kk, Urbanecm so SWAT is yours
[11:23:49] <icinga-wm>	 PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[11:23:55] <Urbanecm>	 thx raynor 
[11:24:21] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[11:24:27] <jbond42>	 !log update puppet.esams.wmnet to puppetmaster2001
[11:24:28] <raynor>	 _joe_, no idea, most probably not related to my patches
[11:24:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:47] <wikibugs>	 (03Merged) 10jenkins-bot: Enable partial blocks at ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540377 (https://phabricator.wikimedia.org/T233754) (owner: 10Urbanecm)
[11:24:47] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[11:24:53] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[11:25:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster1001: move eqaid puppet to codfw [dns] - 10https://gerrit.wikimedia.org/r/540368 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond)
[11:25:59] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[11:26:09] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[11:26:36] <jbond42>	 !log update puppet.eqiad.wmnet to puppetmaster2001
[11:26:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:26:48] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 01711d5: Enable partial blocks at ptwiki (T233754) (duration: 00m 55s)
[11:26:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:26:51] <stashbot>	 T233754: Enable Partial Blocks on Portuguese Wikipedia - https://phabricator.wikimedia.org/T233754
[11:26:56] <Urbanecm>	 !log EU SWAT done
[11:26:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:03] <moritzm>	 !log installing cryptsetup bugfix from buster 10.1 point release
[11:28:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:19] <_joe_>	 raynor: surely not
[11:41:24] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: service-checker: bump to python3 version on stretch+ [puppet] - 10https://gerrit.wikimedia.org/r/540386
[11:44:58] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: service-checker: bump to python3 version on stretch+ [puppet] - 10https://gerrit.wikimedia.org/r/540386
[11:45:35] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[11:47:39] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/18714/ LGTM, will merge later." [puppet] - 10https://gerrit.wikimedia.org/r/540386 (owner: 10Giuseppe Lavagetto)
[11:50:45] <icinga-wm>	 RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 3 probes of 507 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[11:51:27] <icinga-wm>	 RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 2 probes of 507 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[11:51:59] <icinga-wm>	 RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 2 probes of 507 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[11:52:53] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 21 probes of 459 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[11:53:01] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 2 probes of 507 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[11:54:01] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 25 probes of 459 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[11:54:11] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 21 probes of 459 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[11:55:57] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 24 probes of 459 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[11:59:02] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] service-checker: bump to python3 version on stretch+ [puppet] - 10https://gerrit.wikimedia.org/r/540386 (owner: 10Giuseppe Lavagetto)
[12:11:05] <wikibugs>	 (03PS1) 10Jbond: puppetmaster1001: update config-manager to prepare for reimage [dns] - 10https://gerrit.wikimedia.org/r/540390 (https://phabricator.wikimedia.org/T234315)
[12:11:55] <wikibugs>	 (03PS2) 10Jbond: puppetmaster1001: update config-master to prepare for reimage [dns] - 10https://gerrit.wikimedia.org/r/540390 (https://phabricator.wikimedia.org/T234315)
[12:18:41] <wikibugs>	 (03PS1) 10Jbond: puppetmaster1001: move ca to puppetmaster2001 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/540392 (https://phabricator.wikimedia.org/T234315)
[12:19:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppetmaster1001: move ca to puppetmaster2001 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/540392 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond)
[12:22:09] <wikibugs>	 (03PS1) 10Jbond: pybal_config: remove puppetmaster1001 from pybal_config backend [puppet] - 10https://gerrit.wikimedia.org/r/540393 (https://phabricator.wikimedia.org/T234315)
[12:26:49] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:27:11] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:40:47] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10CorinnaHillebrand_WMDE)
[12:53:02] <wikibugs>	 (03PS2) 10Gehel: elasticsearch: move rsyslog profile to cirrus profile [puppet] - 10https://gerrit.wikimedia.org/r/540373 (owner: 10Mathew.onipe)
[12:55:09] <wikibugs>	 (03CR) 10Kwanele22: "free vpn" [puppet] - 10https://gerrit.wikimedia.org/r/509595 (https://phabricator.wikimedia.org/T221654) (owner: 10Alex Monk)
[12:55:28] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add Conflicts: line to debian/control [software/service-checker] - 10https://gerrit.wikimedia.org/r/540399
[12:56:10] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] elasticsearch: move rsyslog profile to cirrus profile [puppet] - 10https://gerrit.wikimedia.org/r/540373 (owner: 10Mathew.onipe)
[12:56:19] <gehel>	 onimisionipe: ^
[12:57:20] <wikibugs>	 (03CR) 10Muehlenhoff: puppetmaster1001: move ca to puppetmaster2001 for reimage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540392 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond)
[12:57:57] <wikibugs>	 (03PS1) 10Kwanele22: vpn [puppet] - 10https://gerrit.wikimedia.org/r/540400
[12:58:02] <wikibugs>	 (03PS1) 10Jbond: puppetmaster1001: update puppet.wikimedia.org to point to codfw [dns] - 10https://gerrit.wikimedia.org/r/540401
[12:59:38] <wikibugs>	 (03PS2) 10Jbond: puppetmaster1001: move ca to puppetmaster2001 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/540392 (https://phabricator.wikimedia.org/T234315)
[12:59:50] <wikibugs>	 (03CR) 10Jbond: puppetmaster1001: move ca to puppetmaster2001 for reimage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540392 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond)
[13:00:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/540392 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond)
[13:02:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/540393 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond)
[13:03:33] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add Conflicts: line to debian/control [software/service-checker] - 10https://gerrit.wikimedia.org/r/540399 (owner: 10Giuseppe Lavagetto)
[13:06:05] <wikibugs>	 (03PS1) 10Pmiazga: Remove Minerva EventLogging error tracking configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540404 (https://phabricator.wikimedia.org/T233663)
[13:14:19] <wikibugs>	 (03PS3) 10Jbond: puppetmaster1001: move ca to puppetmaster2001 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/540392 (https://phabricator.wikimedia.org/T234315)
[13:15:14] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Sadly those SANs won't be enough. You also need to include in the SAN the same domains that are included in the certs for the appservers a" [puppet] - 10https://gerrit.wikimedia.org/r/540252 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn)
[13:16:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster1001: move ca to puppetmaster2001 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/540392 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond)
[13:16:14] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] puppetmaster1001: update config-master to prepare for reimage [dns] - 10https://gerrit.wikimedia.org/r/540390 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond)
[13:16:39] <moritzm>	 !log installing console-setup bugfix update from buster point release
[13:16:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster1001: update config-master to prepare for reimage [dns] - 10https://gerrit.wikimedia.org/r/540390 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond)
[13:18:47] <moritzm>	 !log installing mariabd-10.3 update from buster point release (just client side libs, tools)
[13:18:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:51] <wikibugs>	 (03PS2) 10Jbond: puppetmaster1001: update puppet.wikimedia.org to point to codfw [dns] - 10https://gerrit.wikimedia.org/r/540401
[13:20:58] <wikibugs>	 10Operations, 10DC-Ops, 10decommission: decommission elastic1017 - https://phabricator.wikimedia.org/T234045 (10MoritzMuehlenhoff)
[13:23:29] <icinga-wm>	 PROBLEM - puppetmaster backend https on puppetmaster2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[13:24:00] <wikibugs>	 (03PS1) 10Muehlenhoff: elastic1017: Remove puppet references [puppet] - 10https://gerrit.wikimedia.org/r/540407 (https://phabricator.wikimedia.org/T234045)
[13:24:45] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.7464 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[13:25:39] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.02954 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[13:30:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Cumin alias updates [puppet] - 10https://gerrit.wikimedia.org/r/540408
[13:30:59] <icinga-wm>	 PROBLEM - puppetmaster https on puppetmaster2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[13:31:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster1001: update puppet.wikimedia.org to point to codfw [dns] - 10https://gerrit.wikimedia.org/r/540401 (owner: 10Jbond)
[13:33:29] <jbond42>	 ^^ looking
[13:35:53] <icinga-wm>	 RECOVERY - puppetmaster https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 398 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[13:39:14] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Cumin alias updates [puppet] - 10https://gerrit.wikimedia.org/r/540408 (owner: 10Muehlenhoff)
[13:40:18] <elukey>	 jbond42: just to avoid issues, where should I puppet-merge?
[13:40:50] <jbond42>	 elukey: you can still merge on 1001
[13:41:19] <wikibugs>	 (03PS1) 10Jbond: Revert "puppetmaster1001: move ca to puppetmaster2001 for reimage" [puppet] - 10https://gerrit.wikimedia.org/r/540409
[13:42:44] <elukey>	 ack
[13:44:15] <wikibugs>	 (03PS2) 10Jbond: Revert "puppetmaster1001: move ca to puppetmaster2001 for reimage" [puppet] - 10https://gerrit.wikimedia.org/r/540409
[13:44:41] <_joe_>	 jbond42: what is going wrong?
[13:45:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert "puppetmaster1001: move ca to puppetmaster2001 for reimage" [puppet] - 10https://gerrit.wikimedia.org/r/540409 (owner: 10Jbond)
[13:45:35] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to 'analytics-privatedata-users' and 'researchers' for Jerrie Kumalah - https://phabricator.wikimedia.org/T234433 (10jkumalah)
[13:45:39] <jbond42>	 _joe_: passanger keeps dieing
[13:45:57] <_joe_>	 just too much load?
[13:46:04] <jbond42>	 i also notice that the socket utilasation shot up shortly after i change the puppet_ca server so im reverting that
[13:46:07] <jbond42>	 yes could be
[13:46:24] <_joe_>	 lmk if you need me to look into it
[13:46:58] <jbond42>	 i also change the confi-master to point to puppetmaster2001 what affacte would that have
[13:47:23] <elukey>	 (I am rebooting an-conf1001 for tests)
[13:48:35] <icinga-wm>	 PROBLEM - Host an-conf1001 is DOWN: PING CRITICAL - Packet loss = 100%
[13:48:47] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to 'analytics-privatedata-users' and 'researchers' for Jerrie Kumalah - https://phabricator.wikimedia.org/T234433 (10jkumalah) ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDJpUZ9XpqGc8Oz2rFbYdfiJHTp3JYDnOthxVwRfv+aIjFQzq7bPHsTqbCi8g/thANeCwk32l4PQaMOXpRezW/de+MZh...
[13:48:48] <_joe_>	 jbond42: null
[13:48:59] <_joe_>	 it should have no effect, it's very low traffic
[13:49:15] <jbond42>	 ack thanks
[13:56:29] <wikibugs>	 (03PS1) 10Jbond: puppet.wikimedia.org: move iot back to puppetmaster1001 [dns] - 10https://gerrit.wikimedia.org/r/540412
[13:56:58] <wikibugs>	 10Operations: Integrate Buster 10.1 point update - https://phabricator.wikimedia.org/T232310 (10MoritzMuehlenhoff)
[13:57:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet.wikimedia.org: move iot back to puppetmaster1001 [dns] - 10https://gerrit.wikimedia.org/r/540412 (owner: 10Jbond)
[13:58:41] <icinga-wm>	 RECOVERY - Host an-conf1001 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[14:06:05] <icinga-wm>	 PROBLEM - Host an-conf1001 is DOWN: PING CRITICAL - Packet loss = 100%
[14:07:33] <icinga-wm>	 RECOVERY - puppetmaster backend https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 398 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[14:10:07] <icinga-wm>	 RECOVERY - Host an-conf1001 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[14:15:19] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1006.eqiad.wmnet, wdqs1004.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1006.eqiad.wmnet, wdqs1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:15:51] <icinga-wm>	 PROBLEM - puppetmaster backend https on puppetmaster2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[14:16:14] <moritzm>	 !log installing babeltrace bugfix update from buster point release
[14:16:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:43] <icinga-wm>	 PROBLEM - puppetmaster https on puppetmaster2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[14:17:02] <elukey>	 onimisionipe, gehel --^
[14:17:19] <icinga-wm>	 RECOVERY - puppetmaster backend https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 398 bytes in 1.079 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[14:17:21] <wikibugs>	 10Operations: Integrate Buster 10.1 point update - https://phabricator.wikimedia.org/T232310 (10MoritzMuehlenhoff)
[14:17:33] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1006.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[14:17:35] <wikibugs>	 10Operations: Integrate Buster 10.1 point update - https://phabricator.wikimedia.org/T232310 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done
[14:18:02] <wikibugs>	 (03PS2) 10Muehlenhoff: elastic1017: Remove puppet references [puppet] - 10https://gerrit.wikimedia.org/r/540407 (https://phabricator.wikimedia.org/T234045)
[14:18:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:21:13] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/540407 (https://phabricator.wikimedia.org/T234045) (owner: 10Muehlenhoff)
[14:23:09] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[14:23:51] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove DNS entry for elastic1017 [dns] - 10https://gerrit.wikimedia.org/r/540415 (https://phabricator.wikimedia.org/T234045)
[14:23:52] <elukey>	 wdqs seems fine now (at least from pybal's point of view)
[14:23:56] <elukey>	 should we follow up?
[14:23:57] <icinga-wm>	 PROBLEM - puppetmaster backend https on puppetmaster2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[14:24:01] <elukey>	 gehel: --^
[14:24:18] <gehel>	 it's still overloaded, looks related to edits, so not much we can do
[14:24:31] <gehel>	 it's the external endpoint, instabilities are somewhat expected
[14:24:40] <elukey>	 yep yep
[14:25:03] <gehel>	 architectural issues...
[14:25:08] <elukey>	 as curiosiry - you mentioned that it is related to edits, what does it mean?
[14:26:06] <gehel>	 the edit rate on wikidata will impact the load on wdqs (as edits are ingested by wdqs)
[14:26:32] <gehel>	 a common problem is some bot doing a high number of edits and wdqs having issues in catching up
[14:26:50] <gehel>	 honestly, we should throttle our update process to protect the reads, at the cost of lag...
[14:27:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove DNS entry for elastic1017 [dns] - 10https://gerrit.wikimedia.org/r/540415 (https://phabricator.wikimedia.org/T234045) (owner: 10Muehlenhoff)
[14:27:47] <elukey>	 ahh okok
[14:27:56] <elukey>	 and ingesting the edits is done via the updater?
[14:28:03] <gehel>	 what makes me think this is edit related, is that I can also see minor lag on the internal cluster, which has a very low and consistent read load
[14:28:24] <gehel>	 of course, in the end the contention comes from the cumulated effect of read and writes
[14:29:33] <gehel>	 elukey: yep, the updater is the process that consumes edits on wikidata and push them into blazegraph
[14:31:50] <moritzm>	 !log installing libxslt security updates on stretch
[14:31:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:14] <elukey>	 gehel: super thanks for the explanation
[14:35:15] <wikibugs>	 (03CR) 10Volans: backends: add Netbox backend (035 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov)
[14:35:25] <gehel>	 elukey: always a pleasure!
[14:41:06] <wikibugs>	 (03CR) 10Volans: "I had a couple of old comments, plus one post-merge." (034 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/537242 (https://phabricator.wikimedia.org/T230449) (owner: 10CRusnov)
[14:44:24] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to deployment for andrew-wmde - https://phabricator.wikimedia.org/T233202 (10Tobi_WMDE_SW) >>! In T233202#5538406, @Lucas_Werkmeister_WMDE wrote: >> Please reach out to fellow WMDE deployers for a training or if not possible, RelEng team members. >  > I’d...
[14:53:09] <wikibugs>	 (03CR) 10Herron: admin: add expiry_date and expiry_contact for user eyener (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540207 (https://phabricator.wikimedia.org/T233636) (owner: 10Herron)
[14:56:35] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:56:52] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on db2050 - https://phabricator.wikimedia.org/T234444 (10ops-monitoring-bot)
[14:57:51] <bblack>	 exceptions issue known?
[15:07:09] <wikibugs>	 (03PS2) 10Herron: admin: add expiry_date and expiry_contact for user eyener [puppet] - 10https://gerrit.wikimedia.org/r/540207 (https://phabricator.wikimedia.org/T233636)
[15:07:12] <wikibugs>	 (03PS1) 10Mforns: analytics:refinery:job:data_purge: Re-enable deletion of geoeditors [puppet] - 10https://gerrit.wikimedia.org/r/540421 (https://phabricator.wikimedia.org/T234238)
[15:07:25] <wikibugs>	 (03PS1) 10Jbond: puppetmaster1001: switch puppet.eqiad to eqiad [dns] - 10https://gerrit.wikimedia.org/r/540422
[15:07:59] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:08:11] <wikibugs>	 (03CR) 10Mforns: "I tested this and checked the checksum, good to go!" [puppet] - 10https://gerrit.wikimedia.org/r/540421 (https://phabricator.wikimedia.org/T234238) (owner: 10Mforns)
[15:08:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster1001: switch puppet.eqiad to eqiad [dns] - 10https://gerrit.wikimedia.org/r/540422 (owner: 10Jbond)
[15:08:48] <wikibugs>	 (03PS2) 10Jbond: puppetmaster1001: switch puppet.eqiad to eqiad [dns] - 10https://gerrit.wikimedia.org/r/540422
[15:09:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] elastic1017: Remove puppet references [puppet] - 10https://gerrit.wikimedia.org/r/540407 (https://phabricator.wikimedia.org/T234045) (owner: 10Muehlenhoff)
[15:10:46] <wikibugs>	 (03PS3) 10Herron: admin: add expiry_date and expiry_contact for user eyener [puppet] - 10https://gerrit.wikimedia.org/r/540207 (https://phabricator.wikimedia.org/T233636)
[15:11:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] admin: add expiry_date and expiry_contact for user eyener [puppet] - 10https://gerrit.wikimedia.org/r/540207 (https://phabricator.wikimedia.org/T233636) (owner: 10Herron)
[15:12:18] <wikibugs>	 (03PS1) 10Muehlenhoff: Simplify partman config for elastic* [puppet] - 10https://gerrit.wikimedia.org/r/540423
[15:12:49] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:12:51] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/540423 (owner: 10Muehlenhoff)
[15:13:34] <wikibugs>	 (03CR) 10Herron: [C: 03+2] admin: add expiry_date and expiry_contact for user eyener [puppet] - 10https://gerrit.wikimedia.org/r/540207 (https://phabricator.wikimedia.org/T233636) (owner: 10Herron)
[15:13:35] <godog>	 !log run swiftrepl eqiad -> codfw on ms-fe1005 (no deletes)
[15:13:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:52] <wikibugs>	 (03PS2) 10Muehlenhoff: Simplify partman config for elastic* [puppet] - 10https://gerrit.wikimedia.org/r/540423
[15:14:22] <wikibugs>	 10Operations, 10DC-Ops, 10decommission: decommission elastic1017 - https://phabricator.wikimedia.org/T234045 (10MoritzMuehlenhoff)
[15:15:13] <icinga-wm>	 RECOVERY - puppetmaster https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 399 bytes in 1.086 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[15:15:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Simplify partman config for elastic* [puppet] - 10https://gerrit.wikimedia.org/r/540423 (owner: 10Muehlenhoff)
[15:15:59] <icinga-wm>	 RECOVERY - puppetmaster backend https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 398 bytes in 0.166 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[15:19:37] <wikibugs>	 (03PS1) 10CDanis: grafana: remove obsolete deleteDatasources stanza [puppet] - 10https://gerrit.wikimedia.org/r/540424
[15:19:59] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.005076 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[15:22:23] <wikibugs>	 (03CR) 10Herron: elasticsearch: add toggle for rsyslog udp logback compat include (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540189 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron)
[15:24:11] <icinga-wm>	 RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[15:24:20] <wikibugs>	 (03CR) 10Mathew.onipe: elasticsearch: add toggle for rsyslog udp logback compat include (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540189 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron)
[15:26:11] <wikibugs>	 (03CR) 10Herron: elasticsearch: add toggle for rsyslog udp logback compat include (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540189 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron)
[15:26:30] <wikibugs>	 10Operations, 10ops-codfw, 10netops: msw-c1 down? - https://phabricator.wikimedia.org/T234411 (10ayounsi) a:03Papaul @papaul, can you check the LED status, cables (all properly connected), then power cycle the device?
[15:26:58] <wikibugs>	 (03Abandoned) 10Herron: elasticsearch: add toggle for rsyslog udp logback compat include [puppet] - 10https://gerrit.wikimedia.org/r/540189 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron)
[15:29:56] <godog>	 !log swift codfw-prod: add ms-be2051 T233638
[15:29:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:00] <stashbot>	 T233638: rack/setup/install ms-be205[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T233638
[15:30:35] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:31:10] <godog>	 !log correction, add ms-be2052
[15:31:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:42] <wikibugs>	 (03PS6) 10Herron: logstash: throttle duplicate normalized_message with level:ERR* [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739)
[15:34:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: remove obsolete deleteDatasources stanza [puppet] - 10https://gerrit.wikimedia.org/r/540424 (owner: 10CDanis)
[15:34:56] <wikibugs>	 (03CR) 10Herron: logstash: throttle duplicate normalized_message with level:ERR* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739) (owner: 10Herron)
[15:35:26] <wikibugs>	 (03PS2) 10CDanis: grafana: remove obsolete deleteDatasources stanza [puppet] - 10https://gerrit.wikimedia.org/r/540424
[15:35:39] <wikibugs>	 10Puppet, 10Patch-For-Review: ensure additional puppetmaster files are managed by puppet - https://phabricator.wikimedia.org/T234332 (10jbond) p:05Triage→03Normal
[15:36:12] <wikibugs>	 10Puppet, 10Patch-For-Review: upgrade puppet master frontends servers - https://phabricator.wikimedia.org/T234315 (10jbond) Tagging https://phabricator.wikimedia.org/T184562#4069362 as a usefull refrence
[15:36:40] <wikibugs>	 (03PS1) 10Jbond: puppetmaster_ca: move ca functions to puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/540431 (https://phabricator.wikimedia.org/T234315)
[15:37:37] <wikibugs>	 (03CR) 10CDanis: [V: 03+2] grafana: remove obsolete deleteDatasources stanza [puppet] - 10https://gerrit.wikimedia.org/r/540424 (owner: 10CDanis)
[15:38:16] <wikibugs>	 10Puppet, 10Patch-For-Review: upgrade puppet master frontends servers - https://phabricator.wikimedia.org/T234315 (10jbond)  tagging:  * https://wikitech.wikimedia.org/wiki/Puppet_CA_replacement  * https://phabricator.wikimedia.org/T184562#4069362
[15:38:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739) (owner: 10Herron)
[15:38:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster_ca: move ca functions to puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/540431 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond)
[15:38:30] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: asw2-a-eqiad <-> cr2-eqiad fiber issue - https://phabricator.wikimedia.org/T234416 (10ayounsi) a:03Cmjohnson Related to T203719.  @Cmjohnson same as when there are interfaces errors, but here monitor for new: `sfp-7/0/46 link 46 SFP receive power low warning set` in `sh...
[15:38:33] <wikibugs>	 (03PS2) 10Jbond: puppetmaster_ca: move ca functions to puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/540431 (https://phabricator.wikimedia.org/T234315)
[15:40:29] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[15:40:38] <chaomodus>	 shush
[15:43:23] <wikibugs>	 10Operations, 10ops-eqiad: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10fgiunchedi) p:05Normal→03High Please note that putting these systems in production is becoming urgent, is there a status update and/or ETA?
[15:45:11] <icinga-wm>	 PROBLEM - puppetmaster backend https on puppetmaster2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[15:46:53] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.01513 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[15:47:33] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.6952 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[15:48:19] <wikibugs>	 (03PS1) 10Jbond: Revert "puppetmaster_ca: move ca functions to puppetmaster2001" [puppet] - 10https://gerrit.wikimedia.org/r/540434
[15:49:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert "puppetmaster_ca: move ca functions to puppetmaster2001" [puppet] - 10https://gerrit.wikimedia.org/r/540434 (owner: 10Jbond)
[15:50:18] <wikibugs>	 (03CR) 10Krinkle: Add a commit message guide (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris)
[15:51:05] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[15:51:38] <godog>	 puppet master in trouble but no irc spam from puppet failures -> great win https://i.imgur.com/28wUHI0.gifv
[15:51:41] <icinga-wm>	 RECOVERY - puppetmaster backend https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 398 bytes in 0.168 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[15:52:16] <jbond42>	 yes should be fixed again now, just wanted to pin down what cause the last error
[15:52:39] <herron>	 godog: this is great
[15:53:06] <wikibugs>	 (03CR) 10Brennen Bearnes: "Just in case they contain any ideas of interest, Tyler's version of this:" [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris)
[15:53:10] <godog>	 jbond42: *nod* thanks for taking care of that
[15:53:11] <jbond42>	 yesindeed irc would have been unusable for the last few hours
[15:53:19] <wikibugs>	 10Operations, 10netops: Telia IC-314534 (eqord/eqdfw 10Gbps wave) down - https://phabricator.wikimedia.org/T234335 (10ayounsi) 05Open→03Resolved a:03elukey Work completed, everything is up, thank to you two!
[15:53:42] <wikibugs>	 (03CR) 10Volans: "We need to add more validation steps, not only on the gdnsd side, but on the content, to ensure that each $ORIGIN has likely good data (at" (034 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/539013 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov)
[15:56:57] <godog>	 herron: https://66.media.tumblr.com/8ff56d9711d8702e77585b5896f3cb4b/tumblr_nfwyq36Pkg1qh9nffo1_400.gif
[15:58:25] <herron>	 hah pssssht
[15:58:45] <wikibugs>	 (03CR) 10Herron: "Krinkle do you have any reservations about deploying this?" [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739) (owner: 10Herron)
[15:59:10] <wikibugs>	 (03CR) 10Volans: "What's the status of this? I think we'll need to re-run it at some point once we'll be ready to switch to the generated ones." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/529977 (https://phabricator.wikimedia.org/T228670) (owner: 10CRusnov)
[16:00:04] <jouncebot>	 MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191002T1600).
[16:00:04] <jouncebot>	 MatmaRex and Lucas_WMDE: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:08] <wikibugs>	 (03CR) 10CRusnov: "> Patch Set 2:" (034 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/539013 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov)
[16:00:27] <icinga-wm>	 ACKNOWLEDGEMENT - Host restbase2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411
[16:00:27] <icinga-wm>	 ACKNOWLEDGEMENT - Host restbase2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411
[16:00:27] <icinga-wm>	 ACKNOWLEDGEMENT - Host ps1-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411
[16:00:27] <icinga-wm>	 ACKNOWLEDGEMENT - Host pc2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411
[16:00:27] <icinga-wm>	 ACKNOWLEDGEMENT - Host ores2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411
[16:00:27] <icinga-wm>	 ACKNOWLEDGEMENT - Host mc2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411
[16:00:27] <icinga-wm>	 ACKNOWLEDGEMENT - Host mc2027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411
[16:00:28] <icinga-wm>	 ACKNOWLEDGEMENT - Host ganeti2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411
[16:00:28] <icinga-wm>	 ACKNOWLEDGEMENT - Host ganeti2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411
[16:00:29] <icinga-wm>	 ACKNOWLEDGEMENT - Host es2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411
[16:00:29] <icinga-wm>	 ACKNOWLEDGEMENT - Host es2012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411
[16:00:30] <icinga-wm>	 ACKNOWLEDGEMENT - Host elastic2031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411
[16:00:30] <icinga-wm>	 ACKNOWLEDGEMENT - Host db2125.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411
[16:00:31] <icinga-wm>	 ACKNOWLEDGEMENT - Host db2112.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411
[16:00:31] <icinga-wm>	 ACKNOWLEDGEMENT - Host db2087.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411
[16:00:32] <icinga-wm>	 ACKNOWLEDGEMENT - Host db2077.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411
[16:00:32] <icinga-wm>	 ACKNOWLEDGEMENT - Host cloudservices2002-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411
[16:00:33] <icinga-wm>	 ACKNOWLEDGEMENT - Host cloudcontrol2003-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411
[16:00:49] <wikibugs>	 (03CR) 10CRusnov: "> Patch Set 3:" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/529977 (https://phabricator.wikimedia.org/T228670) (owner: 10CRusnov)
[16:01:01] <MatmaRex>	 hi
[16:01:02] <wikibugs>	 10Puppet, 10Patch-For-Review: upgrade puppet master frontends servers - https://phabricator.wikimedia.org/T234315 (10jbond) during the upgrade of the puppet master i attempted to move the puppetmaster_ca from puppetmaster1001 to puppetmaster2001.  however when i attempted this we saw a whole skew of errors·  L...
[16:01:20] <Lucas_WMDE>	 o/
[16:01:27] <Lucas_WMDE>	 I’ll be ready in a second
[16:03:02] <wikibugs>	 (03PS3) 10CRusnov: Add script to generate DNS records from Netbox [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/539013 (https://phabricator.wikimedia.org/T233183)
[16:03:02] <Lucas_WMDE>	 MatmaRex: do you want to go ahead and start merging your changes?
[16:03:03] <icinga-wm>	 RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.002161 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[16:03:08] <Lucas_WMDE>	 they’ll take a while to go through CI anyways, probably
[16:03:19] <MatmaRex>	 Lucas_WMDE: i don't have +2 rights in wmf branches
[16:03:24] <Lucas_WMDE>	 oh, ok
[16:03:29] <Lucas_WMDE>	 then I can do it
[16:03:37] <MatmaRex>	 :)
[16:03:58] <Lucas_WMDE>	 …the version on master hasn’t been merged yet?
[16:07:22] <MatmaRex>	 yeah
[16:07:54] <Lucas_WMDE>	 I don’t think I’m supposed to deploy backports if they’re not merged in master yet…
[16:08:04] <Lucas_WMDE>	 is it okay if I deploy my backport first and come back to you?
[16:08:15] <Lucas_WMDE>	 (or, start deploying it, will take a while in CI anyways)
[16:08:25] <MatmaRex>	 sure
[16:09:20] <MatmaRex>	 i can try again in the next SWAT slot too, after hopefully other VE folks wake up and review the patches
[16:10:59] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739) (owner: 10Herron)
[16:13:07] <Lucas_WMDE>	 well, I also left a review on one of the patches d)
[16:13:10] <Lucas_WMDE>	 * :)
[16:15:32] <Lucas_WMDE>	 oh, and one of them is already backported to .24?
[16:15:40] <Lucas_WMDE>	 I was wondering why that one seemed to be missing
[16:15:40] <wikibugs>	 10Operations: Puppet breakage in automation-feedback VMs - https://phabricator.wikimedia.org/T234452 (10Andrew)
[16:16:06] <wikibugs>	 10Operations: Puppet breakage in automation-feedback VMs - https://phabricator.wikimedia.org/T234452 (10Andrew) I dug a little deeper, and the primary issue is local diffs in /var/lib/git/operations/puppet on af-puppetmaster02.automation-framework.eqiad.wmflabs.  If you commit those and are able to get a sensibl...
[16:16:36] <wikibugs>	 10Operations: Puppet breakage in automation-feedback VMs - https://phabricator.wikimedia.org/T234452 (10crusnov) THanks for the heads up, we'll loop around to fix these up.
[16:16:45] <Krinkle>	 Lucas_WMDE: Landing now, maybe good to go if other stuff is done with swat
[16:16:48] <Krinkle>	 (in master)
[16:16:57] <Lucas_WMDE>	 ok
[16:17:00] <Krinkle>	 I also have a late addition to the swat - https://gerrit.wikimedia.org/r/#/c/mediawiki/skins/Vector/+/540396/
[16:17:00] <Lucas_WMDE>	 then I’ll just +2 the lot
[16:17:09] <Krinkle>	 but I'm okay doing it afterwards if other stuff is still ahead, no problem.
[16:18:27] <Lucas_WMDE>	 MatmaRex: is this VisualEditor channel registered already? I don’t see it being used elsewhere
[16:19:08] <MatmaRex>	 Lucas_WMDE: it should be, as of yesterday evening
[16:19:17] * Lucas_WMDE pulls mediawiki-config again
[16:19:22] <MatmaRex>	 unless i did something wrong
[16:19:24] <Lucas_WMDE>	 ah, indeed :)
[16:19:27] <Lucas_WMDE>	 ok
[16:19:30] <MatmaRex>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/540235
[16:22:37] <Lucas_WMDE>	 ah damn, the Wikibase gate-and-submit is failing
[16:23:05] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.002882 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[16:34:34] <wikibugs>	 (03PS2) 10Elukey: analytics:refinery:job:data_purge: Re-enable deletion of geoeditors [puppet] - 10https://gerrit.wikimedia.org/r/540421 (https://phabricator.wikimedia.org/T234238) (owner: 10Mforns)
[16:35:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] analytics:refinery:job:data_purge: Re-enable deletion of geoeditors [puppet] - 10https://gerrit.wikimedia.org/r/540421 (https://phabricator.wikimedia.org/T234238) (owner: 10Mforns)
[16:39:06] <Lucas_WMDE>	 okay, let’s merge those VE backports
[16:39:07] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Add a commit message guide (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris)
[16:39:10] <Lucas_WMDE>	 MatmaRex: can they be tested at all?
[16:39:16] <Lucas_WMDE>	 (probably not?)
[16:39:29] <MatmaRex>	 Lucas_WMDE: not really. it's logging for a rare problem that we're trying to debug
[16:39:35] <Lucas_WMDE>	 yeah, ok
[16:39:46] <Lucas_WMDE>	 then I’ll just sync them and rely on the scap canaries, I think
[16:39:57] <Lucas_WMDE>	 skipping mwdebug
[16:40:10] <MatmaRex>	 alright. thanks
[16:42:17] <wikibugs>	 (03PS4) 10Ottomata: Revert "Disable public revision-score events until we figure out a good schema" [puppet] - 10https://gerrit.wikimedia.org/r/475818 (owner: 10Ppchelko)
[16:42:36] <Lucas_WMDE>	 Krinkle: I’ve +2ed your change, let’s hope there’s enough time for it to go through CI
[16:42:42] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.34.0-wmf.24/extensions/VisualEditor/: SWAT: [[gerrit:540428|ApiVisualEditorEdit: Add logging for funny etags (T233320)]] (duration: 01m 03s)
[16:42:45] <Lucas_WMDE>	 (the Wikibase change definitely won’t make it, I’m giving up on that one)
[16:42:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:47] <stashbot>	 T233320: VisualEditor <-> RESTBase communication and ETags - https://phabricator.wikimedia.org/T233320
[16:46:24] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.34.0-wmf.25/extensions/VisualEditor/: SWAT: [[gerrit:540427|ApiVisualEditor: Add logging for RESTBase HTTP errors (T233127)]] + [[gerrit:540429|ApiVisualEditorEdit: Add logging for funny etags (T233320)]] (duration: 01m 04s)
[16:46:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:30] <stashbot>	 T233127: HTTP 404 error in VE possibly when confronted with an edit conflict - https://phabricator.wikimedia.org/T233127
[16:47:01] <Krinkle>	 Lucas_WMDE: thx
[16:47:08] <Lucas_WMDE>	 adding it to the SWAT calendar now
[16:47:41] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Revert "Disable public revision-score events until we figure out a good schema" [puppet] - 10https://gerrit.wikimedia.org/r/475818 (owner: 10Ppchelko)
[16:50:29] <logmsgbot>	 !log otto@deploy1001 Started deploy [eventstreams/deploy@dbc9bbb]: (no justification provided)
[16:50:30] <logmsgbot>	 !log otto@deploy1001 Finished deploy [eventstreams/deploy@dbc9bbb]: (no justification provided) (duration: 00m 01s)
[16:50:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:40] <logmsgbot>	 !log otto@deploy1001 Started restart [eventstreams/deploy@dbc9bbb]: (no justification provided)
[16:50:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:51:36] <MatmaRex>	 (thanks for deploying)
[16:51:45] <Lucas_WMDE>	 (no problem)
[16:53:39] <logmsgbot>	 !log otto@deploy1001 Started restart [eventstreams/deploy@dbc9bbb]: Enabling revision-score stream in eventstreams
[16:53:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:56:23] <wikibugs>	 10Operations, 10hardware-requests: eqiad+codfw: 6x hardware request for swift backend (each site) - https://phabricator.wikimedia.org/T227314 (10RobH)
[16:59:42] <Lucas_WMDE>	 Krinkle: can your change be tested?
[16:59:53] <Lucas_WMDE>	 (I would guess no, looks like just an optimization?)
[17:01:32] <Lucas_WMDE>	 well, I’ll do a quick test on mwdebug
[17:02:45] <Lucas_WMDE>	 nothing looks broken, syncing
[17:02:56] <wikibugs>	 10Operations, 10hardware-requests: eqiad+codfw: 6x hardware request for swift backend (each site) - https://phabricator.wikimedia.org/T227314 (10RobH) 05Open→03Resolved a:03RobH all hardware requested was ordered so this is resolved
[17:03:46] <Krinkle>	 Lucas_WMDE: testing now
[17:03:53] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.34.0-wmf.25/skins/Vector/: SWAT: [[gerrit:540396|vector.js: Remove eager calculation of p-cactions width on page load]] (duration: 01m 00s)
[17:03:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:52] <wikibugs>	 (03PS1) 10Mforns: web::fetches::stats.pp Absent mediawiki history rsync [puppet] - 10https://gerrit.wikimedia.org/r/540442 (https://phabricator.wikimedia.org/T208612)
[17:05:18] <Lucas_WMDE>	 (I just realized I tested on wikidatawiki, which is still on .24 🤦)
[17:05:35] <Lucas_WMDE>	 (but testwikidatawiki also looks ok)
[17:05:58] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "not used in prod DB but fixes something on cloud" [puppet] - 10https://gerrit.wikimedia.org/r/539973 (owner: 10Dzahn)
[17:06:11] <wikibugs>	 (03PS6) 10Dzahn: mariadb::packages: support buster, drop libmariadbclient18 [puppet] - 10https://gerrit.wikimedia.org/r/539973
[17:07:18] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: bump shard size threshold for logstash [puppet] - 10https://gerrit.wikimedia.org/r/540444
[17:08:09] <Lucas_WMDE>	 !log Morning SWAT done
[17:08:11] <Krinkle>	 Lucas_WMDE: getting cache issue, one min..
[17:08:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:17] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] web::fetches::stats.pp Absent mediawiki history rsync [puppet] - 10https://gerrit.wikimedia.org/r/540442 (https://phabricator.wikimedia.org/T208612) (owner: 10Mforns)
[17:08:24] <wikibugs>	 (03PS2) 10Ottomata: web::fetches::stats.pp Absent mediawiki history rsync [puppet] - 10https://gerrit.wikimedia.org/r/540442 (https://phabricator.wikimedia.org/T208612) (owner: 10Mforns)
[17:08:27] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] web::fetches::stats.pp Absent mediawiki history rsync [puppet] - 10https://gerrit.wikimedia.org/r/540442 (https://phabricator.wikimedia.org/T208612) (owner: 10Mforns)
[17:08:34] <Krinkle>	 Lucas_WMDE: anyway, np, will follow up if needed.
[17:08:36] <Lucas_WMDE>	 I have to leave pretty soon :/
[17:08:39] <Lucas_WMDE>	 ok
[17:09:14] <wikibugs>	 (03CR) 10Dzahn: "what is using this: quarry, wikistats (cloud), cloud VPS using simplelamp in the future (to remove mysql module)" [puppet] - 10https://gerrit.wikimedia.org/r/539973 (owner: 10Dzahn)
[17:09:55] <Krinkle>	 OK, all good in prod.
[17:09:58] <Krinkle>	 Thanks Lucas_WMDE 
[17:10:07] <wikibugs>	 10Operations, 10Patch-For-Review, 10User-fgiunchedi: rsyslog's in:imtcp thread stuck on recvfrom loop from down/rebooted hosts - https://phabricator.wikimedia.org/T199406 (10fgiunchedi) 05Open→03Stalled Setting as stalled for now, the immediate issue has been bandaided
[17:11:38] <Lucas_WMDE>	 great! :)
[17:12:15] <wikibugs>	 (03PS1) 10Ottomata: Add ensure parameter to dumps::web::fetches [puppet] - 10https://gerrit.wikimedia.org/r/540445
[17:13:02] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Add ensure parameter to dumps::web::fetches [puppet] - 10https://gerrit.wikimedia.org/r/540445 (owner: 10Ottomata)
[17:13:09] <wikibugs>	 (03PS2) 10Ottomata: Add ensure parameter to dumps::web::fetches [puppet] - 10https://gerrit.wikimedia.org/r/540445
[17:13:11] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add ensure parameter to dumps::web::fetches [puppet] - 10https://gerrit.wikimedia.org/r/540445 (owner: 10Ottomata)
[17:17:06] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10Jdforrester-WMF)
[17:20:20] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Goal, and 2 others: Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759 (10dduvall) 05Open→03Stalled Work on this has stalled. I've uninstalled the Prometheus plugin from Jenki...
[17:20:24] <wikibugs>	 10Operations, 10Goal, 10User-fgiunchedi: Export Prometheus-compatible JVM metrics from JVMs in production - https://phabricator.wikimedia.org/T177197 (10dduvall)
[17:21:43] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Goal, and 2 others: Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759 (10dduvall) 05Stalled→03Declined Marking this as "declined" to remove the task from view. We can always...
[17:21:47] <wikibugs>	 10Operations, 10Goal, 10User-fgiunchedi: Export Prometheus-compatible JVM metrics from JVMs in production - https://phabricator.wikimedia.org/T177197 (10dduvall)
[17:31:15] <wikibugs>	 (03PS1) 10Dzahn: wikistats (cloud): add deployment info to motd [puppet] - 10https://gerrit.wikimedia.org/r/540447
[17:32:30] <wikibugs>	 10Operations, 10Puppet, 10Patch-For-Review: Puppet: get row/rack info from Netbox - https://phabricator.wikimedia.org/T229397 (10crusnov) a:03crusnov
[17:35:38] <wikibugs>	 (03PS2) 10Dzahn: wikistats (cloud): add deployment info to motd [puppet] - 10https://gerrit.wikimedia.org/r/540447
[17:37:00] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikistats#How_to_deploy_latest_code" [puppet] - 10https://gerrit.wikimedia.org/r/540447 (owner: 10Dzahn)
[17:43:18] <wikibugs>	 (03CR) 10Hashar: "> Package[openjdk-8-dbg]/ensure: created on gerrit2001." [puppet] - 10https://gerrit.wikimedia.org/r/540094 (https://phabricator.wikimedia.org/T231872) (owner: 10Hashar)
[17:44:10] <wikibugs>	 10Puppet: Allow variables without hiera calls as lookup() default parameters - https://phabricator.wikimedia.org/T234459 (10fgiunchedi)
[17:45:07] <wikibugs>	 (03PS5) 10Filippo Giunchedi: profile: sanity checks for cluster [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232)
[17:45:09] <wikibugs>	 (03PS2) 10Filippo Giunchedi: DNM Revert "hieradata: add acmechief cluster" [puppet] - 10https://gerrit.wikimedia.org/r/540246
[17:46:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi)
[17:49:41] <wikibugs>	 (03CR) 10Nuria: "Thanks Marcel for taking care of this." [puppet] - 10https://gerrit.wikimedia.org/r/540421 (https://phabricator.wikimedia.org/T234238) (owner: 10Mforns)
[17:52:34] <wikibugs>	 (03PS1) 10Dzahn: wikistats (cloud): fix template path, refactor db, httpd to profiles [puppet] - 10https://gerrit.wikimedia.org/r/540450
[17:59:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18717/wikistats-greyhound.wikistats.eqiad.wmflabs/change.wikistats-greyhound.wikistats.eq" [puppet] - 10https://gerrit.wikimedia.org/r/540450 (owner: 10Dzahn)
[17:59:58] <wikibugs>	 (03PS2) 10Dzahn: wikistats (cloud): fix template path, refactor db, httpd to profiles [puppet] - 10https://gerrit.wikimedia.org/r/540450
[18:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191002T1800)
[18:01:14] <wikibugs>	 (03PS7) 10Herron: logstash: throttle duplicate normalized_message with level:ERR* [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739)
[18:03:03] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] gerrit: install openjdk dbg package (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/540094 (https://phabricator.wikimedia.org/T231872) (owner: 10Hashar)
[18:09:05] <MatmaRex>	 Krinkle: hey, do you have a minute? the logging patches i had swatted earlier don't seem to be actually logging anything. can you help me figure out if i did the logging wrong, or if i'm looking at the wrong place for them?
[18:09:31] <wikibugs>	 (03CR) 10Krinkle: Grant autocreateaccount to everyone on closed wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540380 (https://phabricator.wikimedia.org/T222117) (owner: 10Urbanecm)
[18:10:09] <wikibugs>	 (03PS1) 10Dzahn: wikistats (cloud): fix duplicate declaration of mariabd-server pkg [puppet] - 10https://gerrit.wikimedia.org/r/540454
[18:10:36] <wikibugs>	 (03PS2) 10Dzahn: wikistats (cloud): fix duplicate declaration of mariabd-server pkg [puppet] - 10https://gerrit.wikimedia.org/r/540454
[18:10:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wikistats (cloud): fix duplicate declaration of mariabd-server pkg [puppet] - 10https://gerrit.wikimedia.org/r/540454 (owner: 10Dzahn)
[18:10:43] <Krinkle>	 MatmaRex: ok, where are you looking?
[18:10:43] <MatmaRex>	 i go to logstash.wikimedia.org, i add a filter for "channel is VisualEditor", adjust the time range, and get "No results found"
[18:11:12] <MatmaRex>	 i'd give you a link, but the "Share" feature doesn't seem to work
[18:11:32] <Krinkle>	 Share -> bottom right panel -> Short URL
[18:11:34] <Krinkle>	 should work
[18:11:43] <Krinkle>	 I'm looking at https://logstash.wikimedia.org/goto/5f4d3be5e6279601de970da1fc46df88
[18:11:47] <Krinkle>	 which is showing no results indeed
[18:12:12] <Krinkle>	 MatmaRex: Can you repro as user a logger call?
[18:12:23] <Krinkle>	 If so, try with XWD and []Log enabled.
[18:12:27] <wikibugs>	 (03PS1) 10Dzahn: wikistats (cloud): fix duplicate declare of mariadb-server package [puppet] - 10https://gerrit.wikimedia.org/r/540455
[18:12:33] <Krinkle>	 which should rule out any issue with filtering and config settings
[18:12:34] <bblack>	  /win 11
[18:12:34] <MatmaRex>	 Krinkle: thanks, that works, https://logstash.wikimedia.org/goto/78fdce6b25619569e5929c3d60eae4ee (i was taking the wrong link)
[18:12:36] <wikibugs>	 (03PS8) 10Herron: logstash: throttle duplicate normalized_message with level:ERR* [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739)
[18:13:32] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on db2050 - https://phabricator.wikimedia.org/T234444 (10Marostegui) 05Open→03Invalid Ignore this, this host will be powered off tomorrow: {T230391}
[18:13:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] wikistats (cloud): fix duplicate declare of mariadb-server package [puppet] - 10https://gerrit.wikimedia.org/r/540455 (owner: 10Dzahn)
[18:14:25] <wikibugs>	 10Operations, 10netops: configure BGP route damping on IX sessions - https://phabricator.wikimedia.org/T222424 (10ayounsi) Updated change with the above feedbacks: `lang=diff [edit protocols bgp group IX4] +    damping; [edit protocols bgp group IX6] +    damping; [edit policy-options policy-statement BGP_IXP_...
[18:14:43] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Marostegui)
[18:14:46] <MatmaRex>	 Krinkle: it should be possible to trigger this by opening VE anywhere, doing `ve.init.target.etag = 'test'` in the console, and opening the save dialog (without saving changes)
[18:15:12] <XioNoX>	 !log add BGP route damping on IX sessions - ulsfo - T222424
[18:15:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:15:16] <stashbot>	 T222424: configure BGP route damping on IX sessions - https://phabricator.wikimedia.org/T222424
[18:15:19] <Krinkle>	 MatmaRex: ok, try that with XWD/mwdebug/Log and check https://logstash.wikimedia.org/app/kibana#/dashboard/mwdebug1002
[18:15:21] <MatmaRex>	 Krinkle: what do you mena by " XWD and []Log"?
[18:15:28] <wikibugs>	 (03PS9) 10Herron: logstash: throttle duplicate normalized_message with level:ERR* [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739)
[18:15:34] <Krinkle>	 MatmaRex: the WikimediaDebug browser extension
[18:15:41] <Krinkle>	 X-Wikimedia-Debug
[18:15:53] <Krinkle>	 https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Debug_logging
[18:15:56] <wikibugs>	 (03Abandoned) 10Dzahn: wikistats (cloud): fix duplicate declaration of mariabd-server pkg [puppet] - 10https://gerrit.wikimedia.org/r/540454 (owner: 10Dzahn)
[18:16:35] <MatmaRex>	 Krinkle: right, i just tried it, and i don't see any logged entries
[18:16:49] <MatmaRex>	 (there's an unrelated AbuseFilter one, actually)
[18:16:55] <wikibugs>	 (03PS17) 10Thcipriani: Add restbase chart (port from local-charts) [deployment-charts] - 10https://gerrit.wikimedia.org/r/517557 (https://phabricator.wikimedia.org/T228910) (owner: 10Jeena Huneidi)
[18:17:55] <wikibugs>	 (03CR) 10Herron: [C: 03+2] logstash: throttle duplicate normalized_message with level:ERR* [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739) (owner: 10Herron)
[18:18:57] <Krinkle>	 MatmaRex: OK, so that would mean either the code isn't actually being run the way you think, or the Logger might be a NullLogger?
[18:19:16] <Krinkle>	 but from the commit I reviewed, that wasn't the case as you get one from the factory directly
[18:19:37] <MatmaRex>	 $this->logger = LoggerFactory::getInstance( 'VisualEditor' );
[18:20:46] <MatmaRex>	 Krinkle: can we deploy a patch where i just log anything in execute() and see if that gets recorded?
[18:21:45] <Krinkle>	 https://logstash.wikimedia.org/goto/df0a315d76a3bb27f7a88410155da141
[18:21:52] <Krinkle>	 I've run it from eval.php to see if it works
[18:21:59] <Krinkle>	 I did debug(), info() and warning()
[18:22:01] <Krinkle>	 only warning() made it
[18:22:17] <Krinkle>	 nope info() made it as well
[18:23:05] <Krinkle>	 but debug() didn't
[18:23:07] <Krinkle>	 tried it a few more times
[18:23:09] <wikibugs>	 10Operations, 10netops: configure BGP route damping on IX sessions - https://phabricator.wikimedia.org/T222424 (10ayounsi) For the record: ` cr4-ulsfo> show bgp neighbor | match "Suppressed due to damping"| except "    0"                           Suppressed due to damping:    1     Suppressed due to damping:...
[18:23:10] <MatmaRex>	 Krinkle: hm, okay. so i guess the bug is in my code
[18:25:04] <XioNoX>	 !log add BGP route damping on IX sessions - eqdfw - T222424
[18:25:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:07] <stashbot>	 T222424: configure BGP route damping on IX sessions - https://phabricator.wikimedia.org/T222424
[18:26:34] <icinga-wm>	 PROBLEM - exim queue on mx1001 is CRITICAL: CRITICAL: 4145 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim
[18:27:35] <Krinkle>	 MatmaRex: See the comment in IS.php above the wmf var you modified
[18:27:46] <Krinkle>	 Looks like there is a special rule about level 'debug' and Logstash by default.
[18:27:55] <Krinkle>	 you can override that, but might be better to use info() instead.
[18:28:06] <XioNoX>	 !log add BGP route damping on IX sessions - eqord - T222424
[18:28:08] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:28:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:57] <wikibugs>	 (03PS1) 10Herron: logstash: fix ordering in filter-throttle-errors [puppet] - 10https://gerrit.wikimedia.org/r/540456
[18:29:00] <MatmaRex>	 Krinkle: i am using info() though. (and warning() for another message)
[18:29:27] <MatmaRex>	 Krinkle: the conf for 'AbuseFilter' channel looks the same, and that one works as expected
[18:29:31] <Krinkle>	 The one I saw in CR used debug(). If it's using info() or above already, then yeah, might be a code issue.
[18:29:40] <Krinkle>	 e.g. the condition not being reached
[18:29:59] <Krinkle>	 gotta go now :) hope this helps
[18:30:34] <MatmaRex>	 thanks
[18:30:43] <Krinkle>	 some messages just appeared on my query btw MatmaRex 
[18:30:52] <Krinkle>	 so maybe it's working now :)
[18:30:53] <Krinkle>	 bye!
[18:31:14] <wikibugs>	 (03CR) 10Herron: [C: 03+2] logstash: fix ordering in filter-throttle-errors [puppet] - 10https://gerrit.wikimedia.org/r/540456 (owner: 10Herron)
[18:33:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] dumps-misc.sh.erb: Remove designate_pool_manager from backups [puppet] - 10https://gerrit.wikimedia.org/r/539839 (https://phabricator.wikimedia.org/T233978) (owner: 10Marostegui)
[18:33:30] <wikibugs>	 (03PS2) 10Andrew Bogott: deployment-prep: Fix purge_host_regex [puppet] - 10https://gerrit.wikimedia.org/r/536794 (owner: 10Alex Monk)
[18:34:22] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] deployment-prep: Fix purge_host_regex [puppet] - 10https://gerrit.wikimedia.org/r/536794 (owner: 10Alex Monk)
[18:35:05] <wikibugs>	 10Operations, 10netops: configure BGP route damping on IX sessions - https://phabricator.wikimedia.org/T222424 (10ayounsi) Eqord: `     Suppressed due to damping:    4     Suppressed due to damping:    4     Suppressed due to damping:    1     Suppressed due to damping:    1 ` eqdfw: `     Suppressed due to da...
[18:35:56] <wikibugs>	 (03PS1) 10Ottomata: Run hdfs balancer with threshold of 5% [puppet] - 10https://gerrit.wikimedia.org/r/540457 (https://phabricator.wikimedia.org/T231828)
[18:37:20] <XioNoX>	 chaomodus: ^ (Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL)
[18:38:27] <wikibugs>	 10Operations, 10DC-Ops, 10SRE-tools: Host decommission improvements - https://phabricator.wikimedia.org/T231066 (10RobH)
[18:38:38] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] mariadb backups: Include extra valid sections on checking script [puppet] - 10https://gerrit.wikimedia.org/r/538885 (https://phabricator.wikimedia.org/T231208) (owner: 10Jcrespo)
[18:38:42] <wikibugs>	 10Operations, 10DC-Ops, 10SRE-tools: Host decommission improvements - https://phabricator.wikimedia.org/T231066 (10RobH) updated:  - Update the [[ https://phabricator.wikimedia.org/maniphest/task/edit/form/52/ | Phabricator template for host decommissioning ]] -- Done by @robh on 2019-10-02 - using https://e...
[18:38:42] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:38:58] <wikibugs>	 (03PS1) 10Paladox: Gerrit: Tweek replication config [puppet] - 10https://gerrit.wikimedia.org/r/540458
[18:43:25] <hauskater>	 paladox: Tweak maybe?
[18:43:34] <paladox>	 hashar that one, yup!
[18:43:37] <paladox>	 err
[18:43:41] <paladox>	 wrong ping i ment hauskater :)
[18:44:39] <wikibugs>	 10Operations, 10Analytics, 10Analytics-EventLogging, 10EventBus, and 3 others: Public EventGate endpoint for analytics event intake - https://phabricator.wikimedia.org/T233629 (10Ottomata)
[18:46:00] <wikibugs>	 (03PS1) 10Dzahn: install_server: reinstall gerrit1001 with buster [puppet] - 10https://gerrit.wikimedia.org/r/540460 (https://phabricator.wikimedia.org/T222391)
[18:46:57] <wikibugs>	 (03PS2) 10Dzahn: install_server: reinstall gerrit1001 with buster [puppet] - 10https://gerrit.wikimedia.org/r/540460 (https://phabricator.wikimedia.org/T222391)
[18:47:37] <wikibugs>	 (03PS2) 10Paladox: Gerrit: Tweak replication config [puppet] - 10https://gerrit.wikimedia.org/r/540458
[18:48:24] <wikibugs>	 (03PS3) 10Paladox: Gerrit: Tweak replication config [puppet] - 10https://gerrit.wikimedia.org/r/540458
[18:48:30] <wikibugs>	 (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/540458 (owner: 10Paladox)
[18:48:54] <wikibugs>	 (03PS1) 10Paladox: Gerrit: Rename "slaves" in replication config to replica [puppet] - 10https://gerrit.wikimedia.org/r/540462
[18:49:12] <wikibugs>	 (03PS2) 10Paladox: Gerrit: Rename "slaves" in replication config to replica [puppet] - 10https://gerrit.wikimedia.org/r/540462
[18:49:35] <wikibugs>	 (03PS3) 10Paladox: Gerrit: Rename "slaves" in replication config to replica [puppet] - 10https://gerrit.wikimedia.org/r/540462
[18:50:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] install_server: reinstall gerrit1001 with buster [puppet] - 10https://gerrit.wikimedia.org/r/540460 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn)
[18:50:50] <wikibugs>	 (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/540462 (owner: 10Paladox)
[18:50:58] <wikibugs>	 (03PS3) 10Dzahn: install_server: reinstall gerrit1001 with buster [puppet] - 10https://gerrit.wikimedia.org/r/540460 (https://phabricator.wikimedia.org/T222391)
[18:55:34] <paladox>	 mutante \o/
[18:55:50] <hauskater>	 ain't "buster" an insult? :?
[18:56:02] <wikibugs>	 (03CR) 10Dzahn: ""If pushing to multiple remotes, over differing types of network connections (e.g. LAN and also public Internet), its a good idea to put t" [puppet] - 10https://gerrit.wikimedia.org/r/540164 (owner: 10Paladox)
[18:58:03] <paladox>	 mutante it should only be temp
[18:58:28] <paladox>	 we will be reverting as soon as we are ready to make gerrit1001 the primary master
[18:59:26] <wikibugs>	 (03PS2) 10Ottomata: Add missing an-worker1088 to hadoop net_topology [puppet] - 10https://gerrit.wikimedia.org/r/540149 (https://phabricator.wikimedia.org/T209929)
[18:59:35] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Add missing an-worker1088 to hadoop net_topology [puppet] - 10https://gerrit.wikimedia.org/r/540149 (https://phabricator.wikimedia.org/T209929) (owner: 10Ottomata)
[19:00:04] <jouncebot>	 marxarelli: How many deployers does it take to do MediaWiki train - American version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191002T1900).
[19:01:02] <wikibugs>	 (03CR) 10Paladox: "> "If pushing to multiple remotes, over differing types of network" [puppet] - 10https://gerrit.wikimedia.org/r/540164 (owner: 10Paladox)
[19:02:21] <wikibugs>	 (03CR) 10Dzahn: "2 threads - seems to make sense if we have 2 remotes per " Each thread can push one project at a time, to one destination URL. "" [puppet] - 10https://gerrit.wikimedia.org/r/540458 (owner: 10Paladox)
[19:03:25] <wikibugs>	 (03PS4) 10Paladox: Gerrit: Tweak replication config [puppet] - 10https://gerrit.wikimedia.org/r/540458
[19:03:34] <wikibugs>	 10Operations, 10decommission: Decommission analytics1032 - https://phabricator.wikimedia.org/T233080 (10elukey) ` elukey@asw2-c-eqiad# show | compare [edit interfaces interface-range disabled]      member ge-7/0/34 { ... } +    member ge-3/0/12; [edit interfaces] -   ge-3/0/12 { -       description "analytics1...
[19:04:15] <wikibugs>	 (03CR) 10Paladox: "> 2 threads - seems to make sense if we have 2 remotes per " Each" [puppet] - 10https://gerrit.wikimedia.org/r/540458 (owner: 10Paladox)
[19:04:24] <wikibugs>	 (03PS4) 10Paladox: Gerrit: Rename "slaves" in replication config to replica [puppet] - 10https://gerrit.wikimedia.org/r/540462
[19:05:33] <wikibugs>	 10Operations, 10decommission: Decommission analytics1032 - https://phabricator.wikimedia.org/T233080 (10elukey)
[19:08:30] <wikibugs>	 (03PS1) 10Herron: logstash: add id to drop action in filter-throttle-errors [puppet] - 10https://gerrit.wikimedia.org/r/540465 (https://phabricator.wikimedia.org/T233739)
[19:09:50] <wikibugs>	 (03CR) 10Herron: "in retrospect this should be more useful in terms of metrics/monitoring than the throttle filters themselves" [puppet] - 10https://gerrit.wikimedia.org/r/540465 (https://phabricator.wikimedia.org/T233739) (owner: 10Herron)
[19:12:59] <James_F>	 MatmaRex: I should wait for marxarelli to be done deploying the train before screwing around in production, if that's OK. :-)
[19:13:23] <MatmaRex>	 oh right. heh
[19:14:47] <marxarelli>	 James_F, MatmaRex: rolling it now
[19:14:55] * James_F stands well back. ;-)
[19:15:06] <wikibugs>	 (03CR) 10Herron: [C: 03+2] logstash: add id to drop action in filter-throttle-errors [puppet] - 10https://gerrit.wikimedia.org/r/540465 (https://phabricator.wikimedia.org/T233739) (owner: 10Herron)
[19:16:30] <wikibugs>	 (03PS1) 10BBlack: Add Digicert 2019 unified certs [puppet] - 10https://gerrit.wikimedia.org/r/540469 (https://phabricator.wikimedia.org/T209515)
[19:16:32] <wikibugs>	 (03PS1) 10BBlack: Deploy inactive digicert 2019 unified certs [puppet] - 10https://gerrit.wikimedia.org/r/540470 (https://phabricator.wikimedia.org/T209515)
[19:18:04] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Add Digicert 2019 unified certs [puppet] - 10https://gerrit.wikimedia.org/r/540469 (https://phabricator.wikimedia.org/T209515) (owner: 10BBlack)
[19:20:07] <wikibugs>	 (03PS1) 10Dduvall: group1 wikis to 1.34.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540471
[19:20:09] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.34.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540471 (owner: 10Dduvall)
[19:21:01] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.34.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540471 (owner: 10Dduvall)
[19:21:43] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ge...
[19:22:18] <logmsgbot>	 !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.34.0-wmf.25
[19:22:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:17] <logmsgbot>	 !log dduvall@deploy1001 Synchronized php: group1 wikis to 1.34.0-wmf.25 (duration: 00m 59s)
[19:23:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:26:48] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics cluster for Djellel Difallah - https://phabricator.wikimedia.org/T234473 (10DED)
[19:33:41] <marxarelli>	 !log 1.34.0-wmf.25 promoted to group1, cc: T220750. no rise in relevant error rates
[19:33:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:50] <stashbot>	 T220750: 1.34.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T220750
[19:34:54] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[19:34:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:04] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[19:37:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:13] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['gerrit1001.wikimedia.org'] `  Of which those **FA...
[19:43:11] <James_F>	 marxarelli: LGTM.
[19:43:12] <wikibugs>	 (03PS1) 10Jgreen: Adjust nsca_frack.cfg.erb for new approach to monitoring endpoints. [puppet] - 10https://gerrit.wikimedia.org/r/540472 (https://phabricator.wikimedia.org/T212252)
[19:43:27] <marxarelli>	 James_F: same!
[19:43:52] <wikibugs>	 (03CR) 10Dzahn: "> Patch Set 2: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/540252 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn)
[19:44:30] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Adjust nsca_frack.cfg.erb for new approach to monitoring endpoints. [puppet] - 10https://gerrit.wikimedia.org/r/540472 (https://phabricator.wikimedia.org/T212252) (owner: 10Jgreen)
[19:47:09] <Jeff_Green>	 !log deployed icinga fundraising-nsca collection configuration change
[19:47:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:58:02] <mutante>	 !log puppetmaster1001 - sudo puppet cert clean parsoid.discovery.wmnet (only created yesterday but does not have all the SANs it needs, updating with more SANs) (T233654)
[19:58:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:58:06] <stashbot>	 T233654: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654
[20:00:04] <jouncebot>	 cscott, arlolra, subbu, bearND, halfak, and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191002T2000).
[20:04:22] <subbu>	 no parsoid deploy today
[20:05:05] <mutante>	 subbu: working on creating the needed certificates for parsoid/PHP on appservers
[20:05:39] <subbu>	 ok!
[20:05:41] <mutante>	 we need one that has ALL the alt. names on it used by appservers PLUS parsoid.discovery and parsoid.svc. 
[20:05:55] <mutante>	 that's a lot of names
[20:06:07] <mutante>	 but after that i should be able to apply a change on parsoid2001
[20:06:27] <mutante>	 and turn it into the first "hybrid" one
[20:08:39] <James_F>	 Jeff_Green: Do you know off-hand if any of the FR code is still running on HHVM? It's the last bit of CI with it.
[20:13:42] <James_F>	 OK, I'm going to be mucking with unmerged patches in prod, please no unexpected scaps. ;-)
[20:14:35] <Jeff_Green>	 James_f we never ran HHVM
[20:14:50] <James_F>	 Jeff_Green: Aha, excellent. Thank you!
[20:14:57] <Jeff_Green>	 np!
[20:15:02] <icinga-wm>	 RECOVERY - Host db2112.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 47.57 ms
[20:15:08] <icinga-wm>	 RECOVERY - Host ps1-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 39.23 ms
[20:15:17] <James_F>	 MatmaRex: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/540464 is live on mwdebug1002.
[20:15:48] <icinga-wm>	 RECOVERY - Host db2077.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.40 ms
[20:16:35] <MatmaRex>	 James_F: thanks. well, that seems to get logged
[20:16:56] <MatmaRex>	 (i'm looking at https://logstash.wikimedia.org/goto/78fdce6b25619569e5929c3d60eae4ee)
[20:17:03] <icinga-wm>	 RECOVERY - Host db2087.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.33 ms
[20:17:05] <icinga-wm>	 RECOVERY - Host elastic2031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.26 ms
[20:17:05] <icinga-wm>	 RECOVERY - Host db2125.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.41 ms
[20:17:09] <icinga-wm>	 RECOVERY - Host ganeti2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.61 ms
[20:17:27] <icinga-wm>	 RECOVERY - Host mc2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms
[20:17:33] <icinga-wm>	 RECOVERY - Host mc2027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.70 ms
[20:17:33] <icinga-wm>	 RECOVERY - Host restbase2011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.75 ms
[20:17:53] <icinga-wm>	 RECOVERY - Host restbase2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.15 ms
[20:17:57] <icinga-wm>	 RECOVERY - Host cloudservices2002-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.68 ms
[20:17:59] <icinga-wm>	 RECOVERY - Host cloudcontrol2003-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.73 ms
[20:19:11] <James_F>	 MatmaRex: Both warning and info, yeah.
[20:19:20] <icinga-wm>	 RECOVERY - Host es2012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.65 ms
[20:19:31] <James_F>	 MatmaRex: Are you looking for things getting into kibana or into something else?
[20:19:40] <mutante>	 papaul: ^ mgmt switch ?
[20:19:47] <icinga-wm>	 RECOVERY - Host ganeti2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.05 ms
[20:20:03] <icinga-wm>	 RECOVERY - Host es2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms
[20:20:03] <icinga-wm>	 RECOVERY - Host ores2005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.75 ms
[20:20:03] <icinga-wm>	 RECOVERY - Host pc2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.06 ms
[20:20:29] <MatmaRex>	 James_F: yes. i'm trying to figure out how come i don't see the logs there generated by this code: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/540428/1/includes/ApiVisualEditorEdit.php
[20:20:55] <James_F>	 MatmaRex: Are you sure the if statement is triggering?
[20:21:32] <James_F>	 (And does ApiVisualEditorEdit have the scope for the logger from ApiVisualEditor? Presumably?)
[20:22:18] <MatmaRex>	 as sure as one can be. it worked for me locally (writing logs to $wgDebugLogFile)
[20:22:43] <James_F>	 Hmm.
[20:22:51] <James_F>	 Does it only trigger very rarely?
[20:23:04] <James_F>	 Or with some crappy browser/network proxy?
[20:23:16] <MatmaRex>	 for example this query should trigger it: https://en.wikipedia.org/wiki/Special:ApiSandbox#action=visualeditoredit&format=json&paction=serializeforcache&page=User%3AMatma_Rex%2Fsandbox&token=&html=<b>asdfasdfasdf<%2Fb>&etag=test
[20:23:31] <MatmaRex>	 yes, it shouldn't trigger normally
[20:26:07] <James_F>	 (I'm resetting mwdebug1002 to origin/wmf/1.34.0-wmf.24.)
[20:26:59] <MatmaRex>	 ok
[20:27:29] <James_F>	 Why are some events shown as from the "VisualEditor" channel and some from the "visualeditor" channel?
[20:27:42] <James_F>	 Is it just that we're setting the wrong case-sensitive string in the match?
[20:28:39] <James_F>	 The "called on 'The Fighting Temeraire' with paction: 'parsefragment'" message comes from `visualeditor` but the logger ones go to `VisualEditor`.
[20:29:21] <MatmaRex>	 James_F: the lowercase is generated by `wfDebugLog( 'visualeditor',  .... )`, which i didn't notice there before
[20:29:38] <MatmaRex>	 James_F: and it's only logged with the WikimediaDebug extension with "Log" option enabled, which i tested accidnetally once earlier today
[20:30:04] <MatmaRex>	 i don't know why it shows up in that search, i guess the search is case-insensitive
[20:30:23] <James_F>	 Kibana is.
[20:30:36] <James_F>	 udp2log isn't, I think?
[20:30:41] <wikibugs>	 (03CR) 10MarcoAurelio: Grant autocreateaccount to everyone on closed wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540380 (https://phabricator.wikimedia.org/T222117) (owner: 10Urbanecm)
[20:30:49] <James_F>	 Meh.
[20:33:42] <wikibugs>	 (03PS3) 10Brennen Bearnes: mediawiki-dev: use wikimedia/mediawiki-core:dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T222494)
[20:33:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mediawiki-dev: use wikimedia/mediawiki-core:dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T222494) (owner: 10Brennen Bearnes)
[20:35:22] <wikibugs>	 (03PS4) 10Brennen Bearnes: mediawiki-dev: use wikimedia/mediawiki-core:dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T234391)
[20:35:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mediawiki-dev: use wikimedia/mediawiki-core:dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T234391) (owner: 10Brennen Bearnes)
[20:35:37] <wikibugs>	 (03CR) 10MarcoAurelio: "Closed wikis are candidates to be deleted at some point in the future. I note that one of the steps to delete a wiki, according to <https:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540380 (https://phabricator.wikimedia.org/T222117) (owner: 10Urbanecm)
[20:35:42] <MatmaRex>	 James_F: are you busy or do you have time to fiddle with this? i guess i can add a lot more logging
[20:36:34] <James_F>	 MatmaRex: I'm a bit busy, but I can deploy local hacks every few minutes to help out.
[20:37:11] <James_F>	 I'm just removing all traces of HHVM from CI, no big deal. ;-)
[20:38:14] <Krinkle>	 MatmaRex: Did you see the events that made it through right after we stopped testing?
[20:38:39] <wikibugs>	 (03CR) 10Herron: [C: 03+1] hieradata: bump shard size threshold for logstash [puppet] - 10https://gerrit.wikimedia.org/r/540444 (owner: 10Filippo Giunchedi)
[20:38:50] <MatmaRex>	 Krinkle: yes, they were unrelated, from some old logging code i didn't even know was there
[20:38:56] <Krinkle>	 ok
[20:38:58] <wikibugs>	 (03PS3) 10Dzahn: add certificate for parsoid.discovery/parsoid.svc [puppet] - 10https://gerrit.wikimedia.org/r/540252 (https://phabricator.wikimedia.org/T233654)
[20:39:12] <wikibugs>	 (03PS5) 10Brennen Bearnes: mediawiki-dev: use wikimedia/mediawiki-core:dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T234391)
[20:39:16] <MatmaRex>	 there's a wfDebugLog( 'visualeditor', … ) call elsewhere, and i tested with WikimediaDebug with "Log" enabled, so it got logged
[20:41:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "i followed the docs how to update it and dropped the old cert from puppet CA. then added the new one with all these SANs in private repo." [puppet] - 10https://gerrit.wikimedia.org/r/540252 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn)
[20:41:52] <wikibugs>	 (03PS4) 10Dzahn: add certificate for parsoid.discovery/parsoid.svc [puppet] - 10https://gerrit.wikimedia.org/r/540252 (https://phabricator.wikimedia.org/T233654)
[20:41:56] <MatmaRex>	 James_F: i updated https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/540464 , can you deploy it?
[20:43:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: bump shard size threshold for logstash [puppet] - 10https://gerrit.wikimedia.org/r/540444 (owner: 10Filippo Giunchedi)
[20:43:38] <James_F>	 MatmaRex: Done.
[20:43:45] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hieradata: bump shard size threshold for logstash [puppet] - 10https://gerrit.wikimedia.org/r/540444
[20:44:24] <MatmaRex>	 okay, this is illuminating
[20:44:25] <MatmaRex>	 so
[20:44:38] <MatmaRex>	 if i pass the second parameter to info(), nothing gets logged
[20:44:47] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "Looks good to me! One minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539978 (https://phabricator.wikimedia.org/T207200) (owner: 10Filippo Giunchedi)
[20:47:28] <wikibugs>	 (03CR) 10Krinkle: mediawiki-dev: use wikimedia/mediawiki-core:dev (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T234391) (owner: 10Brennen Bearnes)
[20:47:35] <MatmaRex>	 i don't understand why
[20:49:32] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10Jclark-ctr)  Do we want  9 host Row D only has 2 10g racks    Row A: 1053, 1054 (2 nodes, try to avoid A3, then avoid A6) Row B: 1055 (1 node), Avoid...
[20:51:32] <MatmaRex>	 Krinkle: James_F: do you have any idea why the message would completely disappear when i pass the second argument? (the array with extra data)
[20:52:17] <MatmaRex>	 did i make some insane typo somewhere
[20:53:29] <Krinkle>	 MatmaRex: link or paste of the code?
[20:53:57] <MatmaRex>	 Krinkle: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/540464/2/includes/ApiVisualEditorEdit.php line 280
[20:54:08] <MatmaRex>	 line 281 and 282 *
[20:54:08] <MatmaRex>	 Krinkle: the first one is recorded, the second is not
[20:54:59] <MatmaRex>	 i can see the first one here, but not the second: https://logstash.wikimedia.org/goto/8484a8c94a8851a60d89ef17a1d2d776 "ApiVisualEditorEdit::postData: Testing test"
[20:55:12] <wikibugs>	 (03CR) 10Brennen Bearnes: mediawiki-dev: use wikimedia/mediawiki-core:dev (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T234391) (owner: 10Brennen Bearnes)
[20:56:51] <wikibugs>	 (03PS5) 10Paladox: Gerrit: Tweak replication config [puppet] - 10https://gerrit.wikimedia.org/r/540458
[20:57:00] <wikibugs>	 (03PS5) 10Paladox: Gerrit: Rename "slaves" in replication config to replica [puppet] - 10https://gerrit.wikimedia.org/r/540462
[20:57:40] <wikibugs>	 (03PS6) 10Paladox: Gerrit: Rename "slaves" in replication config to replica_codfw [puppet] - 10https://gerrit.wikimedia.org/r/540462
[20:57:46] <wikibugs>	 (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/540462 (owner: 10Paladox)
[20:58:54] <James_F>	 MatmaRex: What happens if you try to log the array without the string?
[20:58:55] <wikibugs>	 (03PS1) 10Herron: logstash: remove filter/throttle/normalized_message_drop id [puppet] - 10https://gerrit.wikimedia.org/r/540482
[20:59:22] <MatmaRex>	 James_F: i don't know, but we can try if you want to deploy it
[20:59:29] <MatmaRex>	 wait, i'm not sure what you mean
[20:59:37] <MatmaRex>	 $this->logger->info( __METHOD__ . ": Testing {etag}", [] ); ?
[20:59:41] <MatmaRex>	 $this->logger->info( __METHOD__ . ": Testing", [ 'etag' => $etag ] ); ?
[20:59:43] <wikibugs>	 (03PS3) 10Jeena Huneidi: [DNM] Update scaffold template names to use chart name [deployment-charts] - 10https://gerrit.wikimedia.org/r/539220
[20:59:51] <Krinkle>	 MatmaRex: hm. nothing stands out as odd. Maybe it's being deduplicated somehow, but seems unlikely. Another might be to ensure it is a string with 'etag => "$etag", just in case that matters.
[20:59:58] <James_F>	 No, I meant `$this->logger->info( [ 'etag' => $etag ] );`
[21:00:10] <Krinkle>	 first param has to be message string.
[21:00:23] <James_F>	 Oh, yeah, maybe it's type validation failing?
[21:00:40] <Krinkle>	 but we do sometimes pass empty string as first param to send key/value context only
[21:00:50] <Krinkle>	 which works, although not very commonly used
[21:01:29] <wikibugs>	 (03CR) 10Herron: [C: 03+2] logstash: remove filter/throttle/normalized_message_drop id [puppet] - 10https://gerrit.wikimedia.org/r/540482 (owner: 10Herron)
[21:02:06] <MatmaRex>	 James_F: Krinkle: i updated https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/VisualEditor/+/540464 with a few different variants, if you want to try
[21:02:11] <wikibugs>	 (03CR) 10Jeena Huneidi: "> Patch Set 2:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/539220 (owner: 10Jeena Huneidi)
[21:02:52] <MatmaRex>	 i guess the practical solution is to put it all in the message string, since that works
[21:02:56] <MatmaRex>	 that will teach me not to try to be fancy
[21:03:01] <MatmaRex>	 structured logging my ass
[21:03:14] <James_F>	 MatmaRex: On mwdebug1002.
[21:03:56] <mutante>	 !log gerrit1001 changing UID of gerrit2 user to 114 and GID to 119 in /etc/passwd to match cobalt to avoid privilege issues after rsyncing data  (T222391)
[21:03:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:04:02] <stashbot>	 T222391: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391
[21:04:53] <MatmaRex>	 so line 287 and 288 also work, those that have the empty array
[21:05:01] <MatmaRex>	 but that's obviously unhelpful
[21:08:29] <mutante>	 !log gerrit1001 changing GID of gerrit2 user to 119 in /etc/group ; find / -uid 499 -exec chown gerrit2 {} \; find / -gid 1001 -exec chown gerrit2:gerrit2 {} \; (T222391)
[21:08:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:13:50] <mutante>	 !log gerrit1001 - rebooting 
[21:13:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:14:15] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10bd808)
[21:14:29] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 (10bd808)
[21:14:37] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 (10bd808)
[21:15:01] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) >>! In T222391#5542166, @ops-monitoring-bot wrote: > Completed auto-reimage of hosts: > ` > ['gerri...
[21:15:19] <wikibugs>	 (03PS1) 10Herron: logstash: output mediawiki type to logstash-medaiwiki ES index [puppet] - 10https://gerrit.wikimedia.org/r/540486
[21:16:14] <wikibugs>	 (03PS2) 10Herron: logstash: output mediawiki type to logstash-medaiwiki ES index [puppet] - 10https://gerrit.wikimedia.org/r/540486
[21:16:55] <MatmaRex>	 Krinkle: James_F: how do you feel about doing this, then: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/540487 (let's properly merge and deploy this one everywhere)
[21:17:06] <mutante>	 !log cobalt (gerrit) rsyncing /srv/gerrit/git and /srv/gerrit/plugins data to gerrit1001 again after reinstall and fixing gerrit2 UID/GID (T222391)
[21:17:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:17:09] <stashbot>	 T222391: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391
[21:17:35] <James_F>	 MatmaRex: C+2'ed.
[21:18:31] <MatmaRex>	 James_F: actually that has an unmerged parent patch, that was already backported and deployed, do that one too?
[21:18:58] <James_F>	 Oh, sure.
[21:19:10] <James_F>	 Preserve the log.
[21:27:11] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Yes please!" [puppet] - 10https://gerrit.wikimedia.org/r/540486 (owner: 10Herron)
[21:29:23] <wikibugs>	 (03Abandoned) 10Herron: kafka-main: move kafka1001 to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/534633 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron)
[21:29:52] <wikibugs>	 (03Abandoned) 10Herron: prometheus: add per-site systemd failed unit checks [puppet] - 10https://gerrit.wikimedia.org/r/536642 (https://phabricator.wikimedia.org/T230570) (owner: 10Herron)
[21:30:29] <wikibugs>	 (03Abandoned) 10Herron: check_systemd_state: downgrade 'degraded' status to warning [puppet] - 10https://gerrit.wikimedia.org/r/530442 (https://phabricator.wikimedia.org/T230570) (owner: 10Herron)
[21:51:02] <godog>	 looks like mx1001's queue is unhappy, lotsa deferred
[22:03:31] <James_F>	 MatmaRex: Proper patch is now live on mwdebug1002 (both wmf.24 and wmf.25).
[22:04:09] <MatmaRex>	 thanks, testing
[22:04:37] <MatmaRex>	 James_F: nice, it actually finally works
[22:04:42] <James_F>	 Woo-hoo.
[22:04:45] <James_F>	 Good to sync?
[22:05:40] <MatmaRex>	 James_F: yes please
[22:05:52] <James_F>	 Kk.
[22:06:47] <logmsgbot>	 !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.25/extensions/VisualEditor/includes/ApiVisualEditor.php: VE unstructured logging, part I (duration: 01m 00s)
[22:06:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:07:15] <wikibugs>	 (03PS13) 10Dzahn: gerrit: add role on gerrit1001 and remove gerrit::migration [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox)
[22:07:20] <paladox>	 \o/
[22:07:36] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] Gerrit: Rename "slaves" in replication config to replica_codfw [puppet] - 10https://gerrit.wikimedia.org/r/540462 (owner: 10Paladox)
[22:09:17] <godog>	 Jeff_Green dwisehaupt looks like some process on frdev1001 is spamming fr-tech-ops@ with messages, Subject: ALERT: psad DL3 [...]
[22:09:54] <Jeff_Green>	 godog: they're alerts from a service that monitors iptables activity
[22:09:58] <logmsgbot>	 !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.25/extensions/VisualEditor/includes/ApiVisualEditorEdit.php: VE unstructured logging, part II (duration: 00m 58s)
[22:10:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:10:24] <Jeff_Green>	 the alerts are valid, but it's way too noisy
[22:10:46] <godog>	 indeed, google is rate limiting delivering to your addresses atm
[22:11:05] <logmsgbot>	 !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.24/extensions/VisualEditor/includes/ApiVisualEditor.php: VE unstructured logging, part I (duration: 00m 59s)
[22:11:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:11:08] <godog>	 40k+ emails queued, that's noisy alright :)
[22:11:14] <Jeff_Green>	 good lord
[22:11:27] <dwisehaupt>	 all the emails. :)
[22:11:35] <MatmaRex>	 James_F: thank you!
[22:11:57] <Jeff_Green>	 I was hoping it would resolve but I guess we'll have to turn the service off until we can figure out the underlying problem
[22:12:15] <logmsgbot>	 !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.24/extensions/VisualEditor/includes/ApiVisualEditorEdit.php: VE unstructured logging, part II (duration: 00m 58s)
[22:12:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:12:21] <James_F>	 MatmaRex: Should be all done.
[22:13:35] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] "Lowering replication delay from the default seems like a good thing." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540458 (owner: 10Paladox)
[22:15:47] <wikibugs>	 (03CR) 10Paladox: Gerrit: Tweak replication config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540458 (owner: 10Paladox)
[22:17:56] <godog>	 Jeff_Green: LMK when done, I can remove the queued messages from root@frdev1001.frack.eqiad.wmnet
[22:18:08] <James_F>	 Production is clean.
[22:18:22] <Jeff_Green>	 godog: it will take about 10 min to propagate
[22:18:31] <Jeff_Green>	 but it's already in the puppet pipeline
[22:18:44] <Jeff_Green>	 are you going to purge them or flush the queue?
[22:20:21] <godog>	 Jeff_Green: purge, flushing the queue won't do much I think since it is google that's rate limiting delivery
[22:20:35] <wikibugs>	 (03PS1) 10Paladox: Gerrit: Change host to gerrit-new.wikimedia.org on gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/540499
[22:20:52] <Jeff_Green>	 if you purge now, it will probably take care of most of the issue, this has been a day-long accumulation of mail
[22:21:01] <Jeff_Green>	 so 5-10 minutes more won't be a huge quantity
[22:21:30] <godog>	 ok trying that now
[22:21:50] <wikibugs>	 (03PS2) 10Paladox: Gerrit: Change host to gerrit-new.wikimedia.org on gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/540499
[22:22:00] <wikibugs>	 (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/540499 (owner: 10Paladox)
[22:22:19] <wikibugs>	 (03CR) 10Dzahn: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/540499 (owner: 10Paladox)
[22:23:14] <wikibugs>	 (03PS3) 10Dzahn: Gerrit: Change host to gerrit-new.wikimedia.org on gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/540499 (owner: 10Paladox)
[22:24:30] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Gerrit: Change host to gerrit-new.wikimedia.org on gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/540499 (owner: 10Paladox)
[22:25:36] <wikibugs>	 (03PS14) 10Paladox: gerrit: add role on gerrit1001 and remove gerrit::migration [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391)
[22:25:44] <wikibugs>	 (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox)
[22:25:52] <wikibugs>	 (03PS6) 10Paladox: Gerrit: Get cobalt to replicate to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/540164
[22:27:05] <wikibugs>	 (03CR) 10Dzahn: "this was a follow-up to looking at the catalog for applying the gerrit role https://puppet-compiler.wmflabs.org/compiler1002/18718/gerrit1" [puppet] - 10https://gerrit.wikimedia.org/r/540499 (owner: 10Paladox)
[22:28:20] <wikibugs>	 (03PS1) 10Paladox: Gerrit: Add gerrit-new.wikimedia.org to acme [puppet] - 10https://gerrit.wikimedia.org/r/540500
[22:28:42] <wikibugs>	 (03PS2) 10Paladox: Gerrit: Add gerrit-new.wikimedia.org to acme [puppet] - 10https://gerrit.wikimedia.org/r/540500
[22:29:39] <godog>	 !log remove queued messages from mx1001 for fr-tech-ops@, triggering sender rate limit from gmail
[22:29:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:34:26] <wikibugs>	 (03CR) 10Thcipriani: "> > "If pushing to multiple remotes, over differing types of network" [puppet] - 10https://gerrit.wikimedia.org/r/540164 (owner: 10Paladox)
[22:35:55] <wikibugs>	 (03CR) 10Paladox: "> > > "If pushing to multiple remotes, over differing types of" [puppet] - 10https://gerrit.wikimedia.org/r/540164 (owner: 10Paladox)
[22:36:23] <wikibugs>	 (03PS7) 10Paladox: Gerrit: Get cobalt to replicate to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/540164
[22:36:29] <wikibugs>	 (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/540164 (owner: 10Paladox)
[22:39:25] <godog>	 dwisehaupt: I see Jeff is offline, anyways should be better now
[22:40:04] <dwisehaupt>	 godog: wonderful. thanks for catching that, letting us know, and killing the tidal wave. :)
[22:40:40] <godog>	 dwisehaupt: hehe np, it caught my eye while looking at icinga
[22:48:18] <wikibugs>	 (03CR) 10Alex Monk: "I removed the cherry-pick, ran puppet everywhere, and ferm failed on deployment-imagescaler01, deployment-memc0[56], and deployment-ircd. " [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T153468) (owner: 10Hashar)
[23:00:04] <jouncebot>	 MaxSem, RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Evening SWAT (Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191002T2300).
[23:00:04] <jouncebot>	 ebernhardson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:01:30] <icinga-wm>	 RECOVERY - exim queue on mx1001 is OK: OK: Less than 1000 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim
[23:01:30] <ebernhardson>	 \o
[23:01:56] <wikibugs>	 (03PS2) 10Bstorm: dumps distribution: fail labstore1007 back as VPS NFS [puppet] - 10https://gerrit.wikimedia.org/r/540238
[23:02:07] <ebernhardson>	 i'll ship
[23:03:58] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] dumps distribution: fail labstore1007 back as VPS NFS [puppet] - 10https://gerrit.wikimedia.org/r/540238 (owner: 10Bstorm)
[23:06:50] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Gerrit: Add gerrit-new.wikimedia.org to acme [puppet] - 10https://gerrit.wikimedia.org/r/540500 (owner: 10Paladox)
[23:06:57] <paladox>	 \o/
[23:06:58] <wikibugs>	 (03PS3) 10Dzahn: Gerrit: Add gerrit-new.wikimedia.org to acme [puppet] - 10https://gerrit.wikimedia.org/r/540500 (owner: 10Paladox)
[23:07:03] <wikibugs>	 (03PS6) 10Bstorm: toolforge-kubernetes: restructure pod security policies [puppet] - 10https://gerrit.wikimedia.org/r/537732 (https://phabricator.wikimedia.org/T227290)
[23:08:57] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] "I *think* deleting those (which must be done to make tools psp's work) addresses all concerns for now.  I'll merge this, and we can move o" [puppet] - 10https://gerrit.wikimedia.org/r/537732 (https://phabricator.wikimedia.org/T227290) (owner: 10Bstorm)
[23:09:10] <wikibugs>	 (03PS7) 10Bstorm: toolforge-kubernetes: restructure pod security policies [puppet] - 10https://gerrit.wikimedia.org/r/537732 (https://phabricator.wikimedia.org/T227290)
[23:10:31] <wikibugs>	 (03PS15) 10Paladox: gerrit: add role on gerrit1001 and remove gerrit::migration [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391)
[23:21:21] <logmsgbot>	 !log ebernhardson@deploy1001 Synchronized php-1.34.0-wmf.25/extensions/CirrusSearch/: T234445: CirrusSearch: Fix Precondition failed: Must have a resultset set (duration: 01m 02s)
[23:21:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:21:25] <stashbot>	 T234445: Error when searching for exact phrase on English Wikipedia: "Precondition failed: Must have a resultset set" - https://phabricator.wikimedia.org/T234445
[23:21:59] <wikibugs>	 (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox)
[23:22:30] <logmsgbot>	 !log ebernhardson@deploy1001 Synchronized php-1.34.0-wmf.24/extensions/CirrusSearch/: T234445: CirrusSearch: Fix Precondition failed: Must have a resultset set (duration: 01m 00s)
[23:22:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:24:46] <ebernhardson>	 with that, SWAT should be complete
[23:29:06] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10fgiunchedi) >>! In T227867#5536367, @Dzahn wrote: > self-healing?? >  > <+icinga-wm> RECOVERY - Memory correctable errors -EDAC- on mw1239 is OK: (C)4 ge (W)2 ge 1   Not self healing no in...
[23:32:27] <wikibugs>	 (03PS7) 10Bstorm: toolforge-k8s: proposed role for all tools [puppet] - 10https://gerrit.wikimedia.org/r/537755 (https://phabricator.wikimedia.org/T227290)
[23:34:35] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 (10Cmjohnson) 05Open→03Resolved @marostegui yes the board was replaced. Sorry about that I left that to John and the task was not closed.
[23:38:21] <XioNoX>	 !log disable cr2-eqiad:xe-4/0/0 - T234416
[23:38:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:38:24] <stashbot>	 T234416: asw2-a-eqiad <-> cr2-eqiad fiber issue - https://phabricator.wikimedia.org/T234416
[23:38:40] <wikibugs>	 (03PS16) 10Paladox: gerrit: add role on gerrit1001 and remove gerrit::migration [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391)
[23:41:28] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: proposed role for all tools [puppet] - 10https://gerrit.wikimedia.org/r/537755 (https://phabricator.wikimedia.org/T227290) (owner: 10Bstorm)
[23:41:59] <XioNoX>	 !log enable cr2-eqiad:xe-4/0/0 - T234416
[23:42:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:42:28] <wikibugs>	 (03PS4) 10Cwhite: initial commit [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376
[23:43:42] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: asw2-a-eqiad <-> cr2-eqiad fiber issue - https://phabricator.wikimedia.org/T234416 (10ayounsi) 05Open→03Resolved Better!  ` ayounsi@asw2-a-eqiad> show interfaces diagnostics optics xe-7/0/46 | match "rx|receive"      Receiver signal average optical power     :  0.0741...
[23:47:11] <wikibugs>	 (03PS17) 10Dzahn: gerrit: add role on gerrit1001 and remove gerrit::migration [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox)
[23:50:25] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: add role on gerrit1001 and remove gerrit::migration [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox)
[23:51:15] <paladox>	 \o/
[23:53:20] <mutante>	 yolo, but we double checked there should be no more IP conflicts 
[23:53:59] <mutante>	 paladox: jdk8 installs
[23:54:06] <paladox>	 yay!
[23:54:09] <mutante>	 then some cert issue with the backups..we'll see
[23:54:16] <paladox>	 oh
[23:54:19] <mutante>	 scap config gets written..
[23:55:07] <paladox>	 :)
[23:55:50] <mutante>	 on the second run it gets the second v6 IP 
[23:56:05] <mutante>	 and that's when you disconnect
[23:56:47] <paladox>	 oh
[23:56:58] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: asw2-a-eqiad <-> cr2-eqiad fiber issue - https://phabricator.wikimedia.org/T234416 (10Cmjohnson) I swapped both optics
[23:59:27] <wikibugs>	 10Operations, 10ops-eqiad: apply hostname labels for krb1001/WMF5173 - https://phabricator.wikimedia.org/T233642 (10Cmjohnson) 05Open→03Resolved done
[23:59:32] <wikibugs>	 10Operations, 10Analytics, 10User-Elukey: setup/install krb1001/WMF5173 - https://phabricator.wikimedia.org/T233141 (10Cmjohnson)