[00:00:04] (03CR) 10Dzahn: "> Patch Set 10:" [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [00:03:18] (03PS4) 10Bstorm: sssd: Add a whole duplicate hierarchy of sssd images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/536692 (https://phabricator.wikimedia.org/T229058) [00:05:33] (03CR) 10Dzahn: "> Patch Set 10: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [00:05:40] Krinkle: indeed [00:05:53] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.01087 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:07:19] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.00432 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:09:05] RECOVERY - puppet last run on cloudnet2002-dev is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:13:05] (03PS5) 10Bstorm: toolforge-kubernetes: restructure pod security policies [puppet] - 10https://gerrit.wikimedia.org/r/537732 (https://phabricator.wikimedia.org/T227290) [00:13:59] (03PS12) 10Dzahn: parsoid: introduce parameter to use parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) [00:14:06] (03PS6) 10Bstorm: toolforge-k8s: proposed role for all tools [puppet] - 10https://gerrit.wikimedia.org/r/537755 (https://phabricator.wikimedia.org/T227290) [00:16:17] (03CR) 10Dzahn: "> - has_lvs removed" [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [00:18:15] (03PS1) 10Dzahn: add fake key for parsoid.svc, delete fake key for wtp2001 [labs/private] - 10https://gerrit.wikimedia.org/r/540251 (https://phabricator.wikimedia.org/T233654) [00:18:36] (03PS2) 10Dzahn: add fake key for parsoid.svc, delete fake key for wtp2001 [labs/private] - 10https://gerrit.wikimedia.org/r/540251 (https://phabricator.wikimedia.org/T233654) [00:22:46] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake key for parsoid.svc, delete fake key for wtp2001 [labs/private] - 10https://gerrit.wikimedia.org/r/540251 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [00:24:51] (03CR) 10Dzahn: "i needed https://gerrit.wikimedia.org/r/c/labs/private/+/540251 to get parsoid.svc.key to make the compiler work. that was the other cert" [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [00:28:44] (03CR) 10Dzahn: "so yea. this looks alright in compiler but isn't modules/secret/secrets/ssl/parsoid.svc.eqiad.wmnet.key missing in the actually private pr" [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [00:30:05] (03CR) 10Dzahn: "> Patch Set 12:" [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [00:30:52] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.25/skins/Vector/: d30064229f9 (duration: 00m 59s) [00:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:21] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.25/resources/src: 5eb3ae1e888e353 (duration: 01m 00s) [00:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:09] (03PS1) 10Dzahn: add certificate for parsoid.discovery/parsoid.svc [puppet] - 10https://gerrit.wikimedia.org/r/540252 (https://phabricator.wikimedia.org/T233654) [00:51:28] (03PS2) 10Dzahn: add certificate for parsoid.discovery/parsoid.svc [puppet] - 10https://gerrit.wikimedia.org/r/540252 (https://phabricator.wikimedia.org/T233654) [00:52:32] (03CR) 10Dzahn: "also created puppet CA cert for parsoid.svc / parsoid.discovery following" [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [01:09:10] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:10:42] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [03:11:48] PROBLEM - ps1-c1-codfw-infeed-load-tower-A-phase-X on ps1-c1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:11:52] PROBLEM - ps1-c1-codfw-infeed-load-tower-B-phase-Z on ps1-c1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:11:58] PROBLEM - ps1-c1-codfw-infeed-load-tower-A-phase-Y on ps1-c1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:12:36] PROBLEM - ps1-c1-codfw-infeed-load-tower-B-phase-Y on ps1-c1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:14:30] RECOVERY - ps1-c1-codfw-infeed-load-tower-B-phase-Z on ps1-c1-codfw is OK: SNMP OK - ps1-c1-codfw-infeed-load-tower-B-phase-Z 188 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:15:04] PROBLEM - Host db2077.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:17:50] RECOVERY - ps1-c1-codfw-infeed-load-tower-A-phase-Y on ps1-c1-codfw is OK: SNMP OK - ps1-c1-codfw-infeed-load-tower-A-phase-Y 405 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:20:56] RECOVERY - Host db2077.mgmt is UP: PING WARNING - Packet loss = 86%, RTA = 36.76 ms [03:21:14] PROBLEM - Host elastic2031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:21:34] PROBLEM - ps1-c1-codfw-infeed-load-tower-B-phase-Z on ps1-c1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:22:28] PROBLEM - Host db2112.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:22:38] PROBLEM - Host restbase2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:22:54] PROBLEM - ps1-c1-codfw-infeed-load-tower-A-phase-Z on ps1-c1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:23:18] PROBLEM - ps1-c1-codfw-infeed-load-tower-A-phase-Y on ps1-c1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:23:44] PROBLEM - ps1-c1-codfw-infeed-load-tower-B-phase-X on ps1-c1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:24:40] PROBLEM - Host db2087.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:24:58] PROBLEM - Host mc2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:25:16] PROBLEM - Host cloudservices2002-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:25:22] PROBLEM - Host restbase2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:25:48] PROBLEM - Host cloudcontrol2003-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:26:18] PROBLEM - Host ps1-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [03:26:50] PROBLEM - Host es2012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:27:54] PROBLEM - Host ganeti2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:27:56] PROBLEM - Host db2125.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:27:58] PROBLEM - Host es2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:28:02] PROBLEM - Host ores2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:28:08] PROBLEM - Host mc2027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:28:12] PROBLEM - Host pc2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:28:30] RECOVERY - Host restbase2011.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 36.78 ms [03:28:42] PROBLEM - Host db2077.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:35:18] PROBLEM - Host ganeti2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:36:16] PROBLEM - Host restbase2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:03:33] 10Operations, 10Acme-chief, 10Traffic: Memory leak on acme-chief 0.21 - https://phabricator.wikimedia.org/T234131 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [04:07:16] !log restarting trafficserver-tls on cp5007 [04:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:08:40] RECOVERY - traffic_server tls process restarted on cp5007 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin+prometheus/ops&var-instance=cp5007&var-layer=tls [04:46:16] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:51:08] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:53:54] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:58:48] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:14:03] 10Operations, 10ops-codfw, 10netops: msw-c1 down? - https://phabricator.wikimedia.org/T234411 (10faidon) p:05Triage→03High [05:16:44] (03CR) 10Giuseppe Lavagetto: "> > Patch Set 4: Verified-1" [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [05:49:18] RECOVERY - Host mc2027.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 36.78 ms [05:55:36] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I tried to think about this, and given what you are trying to do with the prometheus exporter there is no way around it, and you will need" [software/service-checker] - 10https://gerrit.wikimedia.org/r/538711 (owner: 10Cwhite) [05:55:40] (03PS4) 10Giuseppe Lavagetto: change EndpointMetrics from static to instance variable [software/service-checker] - 10https://gerrit.wikimedia.org/r/538711 (owner: 10Cwhite) [05:55:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] change EndpointMetrics from static to instance variable [software/service-checker] - 10https://gerrit.wikimedia.org/r/538711 (owner: 10Cwhite) [05:56:55] (03Merged) 10jenkins-bot: change EndpointMetrics from static to instance variable [software/service-checker] - 10https://gerrit.wikimedia.org/r/538711 (owner: 10Cwhite) [05:57:02] PROBLEM - Host mc2027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:04:23] 10Operations, 10ops-eqiad, 10DBA: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 (10Marostegui) Ping @Cmjohnson @Jclark-ctr was the mainboard replaced? Thanks! [06:09:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [06:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [06:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1073.eqiad.wmnet` - db1073.eqiad.wmnet (**PASS**) - Downtimed host on Ic... [06:10:06] (03PS1) 10Marostegui: site.pp: Remove db1073 references. [puppet] - 10https://gerrit.wikimedia.org/r/540256 (https://phabricator.wikimedia.org/T231892) [06:10:45] (03PS1) 10Marostegui: wmnet: Remove production DNS entries from db1073 [dns] - 10https://gerrit.wikimedia.org/r/540257 (https://phabricator.wikimedia.org/T231892) [06:11:07] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove db1073 references. [puppet] - 10https://gerrit.wikimedia.org/r/540256 (https://phabricator.wikimedia.org/T231892) (owner: 10Marostegui) [06:11:20] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production DNS entries from db1073 [dns] - 10https://gerrit.wikimedia.org/r/540257 (https://phabricator.wikimedia.org/T231892) (owner: 10Marostegui) [06:12:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10Marostegui) a:05RobH→03Cmjohnson [06:12:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10Marostegui) Host ready for on-site steps + switch port disablement [06:17:11] !log Fix replication on labsdb1011:s1 - T233986 [06:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:01] !log Fix replication on labsdb1011:s7 - T233986 [06:23:05] !log Fix replication on labsdb1011:s7 - T233986 [06:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:50] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 (10elukey) [06:31:47] 10Operations, 10netops: Telia IC-314534 (eqord/eqdfw 10Gbps wave) down - https://phabricator.wikimedia.org/T234335 (10elukey) p:05Triage→03Normal [06:32:26] 10Operations, 10netops: Telia IC-314534 (eqord/eqdfw 10Gbps wave) down - https://phabricator.wikimedia.org/T234335 (10elukey) Link is down again as far as I can see from icinga and: ` elukey@re0.cr2-codfw> show interfaces descriptions |match down` xe-5/2/1 up down Transport: cr2-eqord:xe-0/1/0 (Teli... [06:32:29] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet fail on deployment-mediawiki-07, missing private hiera variable - https://phabricator.wikimedia.org/T210497 (10mobrovac) 05Open→03Resolved a:03fgiunchedi Puppet is running fine there, closing. [06:35:03] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Traffic: Puppet fails on deployment-cache-text05 - https://phabricator.wikimedia.org/T234412 (10mobrovac) [06:47:12] 10Operations, 10netops: Telia IC-314534 (eqord/eqdfw 10Gbps wave) down - https://phabricator.wikimedia.org/T234335 (10elukey) ` elukey@re0.cr2-codfw> show interfaces diagnostics optics xe-5/2/1 Physical interface: xe-5/2/1 Laser bias current : 46.512 mA Laser output power... [07:06:53] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, one typo inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540207 (https://phabricator.wikimedia.org/T233636) (owner: 10Herron) [07:07:34] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/539973 (owner: 10Dzahn) [07:07:55] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/540161 (owner: 10Elukey) [07:09:22] (03PS2) 10Elukey: profile::kerberos::replication: add AAAA ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/540161 [07:12:47] (03CR) 10jerkins-bot: [V: 04-1] profile::kerberos::replication: add AAAA ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/540161 (owner: 10Elukey) [07:12:58] ufff [07:14:24] not clear to me why it failed [07:15:33] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/540161 (owner: 10Elukey) [07:20:29] (03CR) 10Elukey: [C: 03+2] profile::kerberos::replication: add AAAA ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/540161 (owner: 10Elukey) [07:23:08] (03CR) 10ArielGlenn: [C: 03+1] "Fine by me." [puppet] - 10https://gerrit.wikimedia.org/r/540238 (owner: 10Bstorm) [07:39:44] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [07:42:56] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [07:46:04] !log upgrading remaining stretch hosts to ferm 2.4.2pre [07:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:37] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/540189 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron) [07:53:42] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add toggle for rsyslog udp logback compat include [puppet] - 10https://gerrit.wikimedia.org/r/540189 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron) [07:54:29] 10Operations, 10netops: Telia IC-314534 (eqord/eqdfw 10Gbps wave) down - https://phabricator.wikimedia.org/T234335 (10elukey) I missed an email from Telia, they are replacing a faulty card that apparently caused flaps and the impact that we saw. Hopefully we'll see recovery soon. [07:57:16] (03CR) 10Hashar: "I have raised the CI job timeout from 3 minutes to 5 minutes, even though it should usually take less than 2 minutes. In those specific f" [puppet] - 10https://gerrit.wikimedia.org/r/540189 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron) [08:00:22] RECOVERY - Host cloudservices2002-dev.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 36.72 ms [08:00:24] RECOVERY - Host es2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.75 ms [08:00:26] RECOVERY - Host ps1-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.89 ms [08:00:26] RECOVERY - ps1-c1-codfw-infeed-load-tower-A-phase-X on ps1-c1-codfw is OK: SNMP OK - ps1-c1-codfw-infeed-load-tower-A-phase-X 512 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:00:28] RECOVERY - ps1-c1-codfw-infeed-load-tower-A-phase-Y on ps1-c1-codfw is OK: SNMP OK - ps1-c1-codfw-infeed-load-tower-A-phase-Y 384 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:00:28] RECOVERY - ps1-c1-codfw-infeed-load-tower-B-phase-Z on ps1-c1-codfw is OK: SNMP OK - ps1-c1-codfw-infeed-load-tower-B-phase-Z 177 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:00:28] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:00:30] RECOVERY - Host cloudcontrol2003-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.35 ms [08:00:44] RECOVERY - ps1-c1-codfw-infeed-load-tower-B-phase-X on ps1-c1-codfw is OK: SNMP OK - ps1-c1-codfw-infeed-load-tower-B-phase-X 272 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:00:56] RECOVERY - ps1-c1-codfw-infeed-load-tower-A-phase-Z on ps1-c1-codfw is OK: SNMP OK - ps1-c1-codfw-infeed-load-tower-A-phase-Z 678 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:01:30] RECOVERY - Host es2012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 42.47 ms [08:01:30] RECOVERY - Host elastic2031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.88 ms [08:02:22] RECOVERY - Host ganeti2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 42.51 ms [08:02:22] RECOVERY - Host ores2005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.59 ms [08:02:26] RECOVERY - Host db2112.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.99 ms [08:02:36] RECOVERY - Host pc2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 43.56 ms [08:03:06] RECOVERY - Host db2077.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.68 ms [08:03:26] RECOVERY - Host ganeti2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.08 ms [08:03:26] RECOVERY - Host db2125.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.01 ms [08:03:40] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:04:32] RECOVERY - Host mc2027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.71 ms [08:04:44] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:04:46] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:04:50] RECOVERY - Host restbase2011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.34 ms [08:04:54] RECOVERY - Host db2087.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.81 ms [08:05:06] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:05:34] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:05:34] RECOVERY - Host mc2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.76 ms [08:05:42] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:05:44] RECOVERY - Host restbase2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.05 ms [08:05:46] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:06:10] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:06:10] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:06:22] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:06:30] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:06:44] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:06:56] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:07:10] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:07:18] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:07:22] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:07:44] mmmm api appservers slowness from the red dashboard afaics [08:07:46] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:07:58] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:08:06] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:08:06] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200&from=now-1h&to=now [08:08:32] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:09:07] Cc: vgutierrez _joe_ --^ [08:09:21] should be on its way to recover fully in theory [08:09:22] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:09:34] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:09:34] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:10:08] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:10:15] <_joe_> we had a spike of 5xx on the appserver side a few minutes ago [08:10:30] there are fluctuations in latency for apis afaics [08:10:50] <_joe_> yes [08:10:53] <_joe_> pretty severe even [08:11:06] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:11:10] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:11:23] <_joe_> ok [08:11:34] <_joe_> can someone check the logstash for 5xx for finding patterns? [08:11:44] seems correlated with the mgmt issue? [08:11:46] <_joe_> I will try to look at an api server to get what's happening there [08:11:55] so perhaps it wasn't just a management switch then? [08:11:59] <_joe_> paravoid: I doubt it, this is the api being slow in eqiad [08:12:00] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:12:12] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:12:15] <_joe_> unless the problem is much much deeper [08:12:46] <_joe_> we have some latency in the memcached response times [08:12:46] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:12:48] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:13:13] <_joe_> the api servers are not overloaded either [08:13:34] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:13:48] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:14:10] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:14:13] <_joe_> I can't find anything significant in running api processes [08:14:27] <_joe_> but I think we got here after the worst had passed [08:14:38] <_joe_> it was less than 5 minutes [08:15:34] would cross-DC networking issues explain any of that? [08:15:37] sorry I was afk, checking [08:15:48] <_joe_> paravoid: only intra-eqiad ones would [08:16:14] <_joe_> as if for instance getting data from memcached took 5 ms instead of 1, that can explain partially this [08:16:34] <_joe_> but if that was due to network errors, I'd see in the mcrouter logs [08:16:37] <_joe_> lemme see [08:17:29] from the memcached dashboard there doesn't seem to be anything big ongoing [08:18:11] <_joe_> no the last tko was tonight at 6 am [08:18:21] <_joe_> sorry yesterday night [08:18:41] <_joe_> elukey: there is something wrong with the mcrouter unit file btw, sigh [08:19:37] the issue seems to have subsided right? the appservers latency looks better now [08:19:41] <_joe_> yes [08:19:56] <_joe_> https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&panelId=9&fullscreen&from=now-30m&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200 [08:21:37] <_joe_> I'll take some time to dig into this in a few [08:21:50] ack [08:24:45] so [08:24:48] Oct 2 08:04:25 asw2-a-eqiad fpc7 sfp-7/0/46 link 46 SFP receive power low warning set [08:25:03] with xe-7/0/46 being the cr2-eqiad <-> asw2-a-eqiad 10G [08:25:24] so that could explain some networking issues like lost packets [08:25:25] but [08:25:44] this seems to be timestamped at :04 which is after the issues started [08:29:27] 10Operations, 10ops-eqiad, 10netops: asw2-a-eqiad <-> cr2-eqiad fiber issue - https://phabricator.wikimedia.org/T234416 (10faidon) p:05Triage→03High [08:29:32] (filed above) [08:29:44] PROBLEM - Host restbase2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:29:46] PROBLEM - Host db2087.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:29:48] the wtf again [08:29:57] the timing of this is pretty weird [08:30:02] PROBLEM - Host ps1-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [08:30:26] PROBLEM - Host mc2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:30:38] PROBLEM - Host restbase2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:31:02] PROBLEM - Host cloudservices2002-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:31:08] PROBLEM - Host cloudcontrol2003-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:32:00] PROBLEM - Host es2012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:32:00] PROBLEM - Host elastic2031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:32:34] nice catch for the fiber issue [08:32:50] PROBLEM - Host ganeti2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:32:50] PROBLEM - Host es2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:32:50] PROBLEM - Host ores2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:32:56] PROBLEM - Host db2112.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:33:04] PROBLEM - Host pc2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:33:11] yeah I'm looking closer at the C1 mgmt issue, it seems entirely unrelated [08:33:38] PROBLEM - Host db2077.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:33:56] PROBLEM - Host db2125.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:34:38] PROBLEM - Host ganeti2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:35:20] PROBLEM - Host mc2027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:36:24] just one of these days I guess :) [08:36:59] (03CR) 10Gehel: [C: 04-1] "See comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540189 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron) [08:38:03] <_joe_> yeah :/ [08:48:44] RECOVERY - Host cloudcontrol2003-dev.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 36.76 ms [08:48:48] RECOVERY - Host ps1-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.30 ms [08:49:26] RECOVERY - Host es2012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 39.04 ms [08:49:26] RECOVERY - Host elastic2031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 42.06 ms [08:50:10] RECOVERY - Host es2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.38 ms [08:50:14] RECOVERY - Host ganeti2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.99 ms [08:50:14] RECOVERY - Host ores2005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.73 ms [08:50:22] RECOVERY - Host db2112.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.42 ms [08:50:30] RECOVERY - Host pc2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.06 ms [08:50:31] 10Operations, 10Beta-Cluster-Infrastructure, 10DNS, 10Traffic, and 4 others: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10MoritzMuehlenhoff) 05Open... [08:51:02] RECOVERY - Host db2077.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.42 ms [08:51:22] RECOVERY - Host db2125.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.04 ms [08:52:04] RECOVERY - Host ganeti2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.80 ms [08:52:46] RECOVERY - Host mc2027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.06 ms [08:53:02] RECOVERY - Host restbase2011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.81 ms [08:53:04] RECOVERY - Host db2087.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.79 ms [08:53:44] RECOVERY - Host mc2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.70 ms [08:53:54] RECOVERY - Host restbase2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.08 ms [08:54:18] RECOVERY - Host cloudservices2002-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.69 ms [08:58:19] (03CR) 10Gehel: [C: 04-1] "Good first draft! A few issues inline." (0310 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe) [09:17:40] PROBLEM - Host mc2027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:17:52] PROBLEM - Host restbase2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:17:54] PROBLEM - Host db2087.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:18:10] PROBLEM - Host ps1-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [09:18:32] PROBLEM - Host mc2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:18:42] PROBLEM - Host restbase2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:19:04] PROBLEM - Host cloudservices2002-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:19:10] PROBLEM - Host cloudcontrol2003-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:19:58] PROBLEM - Host es2012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:19:58] PROBLEM - Host elastic2031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:20:40] PROBLEM - Host es2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:20:44] PROBLEM - Host ganeti2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:20:44] PROBLEM - Host ores2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:20:52] PROBLEM - Host db2112.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:21:00] PROBLEM - Host pc2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:21:34] PROBLEM - Host db2077.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:21:54] PROBLEM - Host db2125.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:22:40] PROBLEM - Host ganeti2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:28:50] (03PS1) 10Giuseppe Lavagetto: Bump version number. [software/service-checker] - 10https://gerrit.wikimedia.org/r/540364 [09:29:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Bump version number. [software/service-checker] - 10https://gerrit.wikimedia.org/r/540364 (owner: 10Giuseppe Lavagetto) [09:34:05] (03CR) 10Ema: [C: 03+1] fifo-log-tailer: Retry on errors (031 comment) [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/539312 (owner: 10Vgutierrez) [09:41:29] !log rebalancing Ganeti/codfw Row A after rolling reboot of Ganeti nodes [09:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:49] (03PS1) 10Alexandros Kosiaris: restrouter: Revert the initialDelay seconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/540365 (https://phabricator.wikimedia.org/T223953) [09:54:58] 10Operations, 10DC-Ops, 10decommission: decommission elastic1017 - https://phabricator.wikimedia.org/T234045 (10Gehel) a:05Gehel→03RobH [09:55:18] (03PS1) 10Alexandros Kosiaris: Add a commit message guide [puppet] - 10https://gerrit.wikimedia.org/r/540366 [09:55:56] (03CR) 10jerkins-bot: [V: 04-1] Add a commit message guide [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris) [09:56:51] (03CR) 10Alexandros Kosiaris: [C: 03+1] "@mobrovac, I am guessing this is fine to revert now ? Wanna do the honors?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/540365 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [10:01:43] (03PS2) 10Alexandros Kosiaris: Add a commit message guide [puppet] - 10https://gerrit.wikimedia.org/r/540366 [10:02:43] akosiaris: how is that different from https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines ? [10:05:43] 10Operations, 10DC-Ops, 10decommission: decommission elastic1017 - https://phabricator.wikimedia.org/T234045 (10Gehel) Host is shutdown and has networking issues. As such the role(system::spare) was not applied and it is still in puppetdb. Since that cleanup step is part of the "uninterruptible" steps, it ha... [10:08:14] paravoid: it can be used as a template [10:08:26] that's the idea, git config commit.template .gitmessage [10:08:34] (03CR) 10Giuseppe Lavagetto: Add a commit message guide (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris) [10:08:46] and every time you git commit, you get a ready to edit thing [10:09:13] I may have diverged a bit, I 'll try to reconcile with https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines [10:13:39] (03PS1) 10Jbond: puppetmaster1001: migrate esams puppet traffic to codfw [dns] - 10https://gerrit.wikimedia.org/r/540367 (https://phabricator.wikimedia.org/T234315) [10:13:41] (03PS1) 10Jbond: puppetmaster1001: move eqaid puppet to codfw [dns] - 10https://gerrit.wikimedia.org/r/540368 (https://phabricator.wikimedia.org/T234315) [10:13:54] paravoid: essentially, just trying to give people a way to implement https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines a bit easier [10:15:06] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/540367 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [10:15:54] (03CR) 10Mobrovac: "Yup, but it'll have to go together with a new version of the chart that bumps RR's tag. I'll prepare that and merge both once all is ready" [deployment-charts] - 10https://gerrit.wikimedia.org/r/540365 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [10:17:11] !log gehel@cumin1001 START - Cookbook sre.hosts.decommission [10:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:27] !log gehel@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [10:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:29] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/540368 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [10:17:31] 10Operations, 10DC-Ops, 10decommission: decommission elastic1017 - https://phabricator.wikimedia.org/T234045 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by gehel@cumin1001 for hosts: `elastic1017.eqiad.wmnet` - elastic1017.eqiad.wmnet (**FAIL**) - Downtimed host on Icinga - Downtime... [10:17:53] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Isn't this it ?https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/540158/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/540365 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [10:18:00] (03PS1) 10Giuseppe Lavagetto: Only build the package for python3 [software/service-checker] - 10https://gerrit.wikimedia.org/r/540369 [10:18:03] (03PS1) 10Giuseppe Lavagetto: Allow properly running tests while using pybuild [software/service-checker] - 10https://gerrit.wikimedia.org/r/540370 [10:18:32] (03CR) 10Alexandros Kosiaris: [C: 03+1] "oh, you mean the image version tag? sorry, my mistake. Yup, fine by me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/540365 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [10:20:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Only build the package for python3 [software/service-checker] - 10https://gerrit.wikimedia.org/r/540369 (owner: 10Giuseppe Lavagetto) [10:20:47] 10Operations, 10DC-Ops, 10decommission: decommission elastic1017 - https://phabricator.wikimedia.org/T234045 (10Gehel) After discussion with @Volans, I ran the [[ https://wikitech.wikimedia.org/wiki/Decom_script | decom script ]]. Host is now properly removed from icinga and puppetdb. [10:21:04] 10Operations, 10DC-Ops, 10decommission: decommission elastic1017 - https://phabricator.wikimedia.org/T234045 (10Gehel) [10:21:42] (03Merged) 10jenkins-bot: Only build the package for python3 [software/service-checker] - 10https://gerrit.wikimedia.org/r/540369 (owner: 10Giuseppe Lavagetto) [10:29:10] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Traffic: Puppet fails on deployment-cache-text05 - https://phabricator.wikimedia.org/T234412 (10ema) p:05Triage→03Normal [10:33:33] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Allow properly running tests while using pybuild [software/service-checker] - 10https://gerrit.wikimedia.org/r/540370 (owner: 10Giuseppe Lavagetto) [10:34:13] (03CR) 10Lucas Werkmeister (WMDE): "No problem, thanks for fixing it :)" [puppet] - 10https://gerrit.wikimedia.org/r/513759 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [10:34:30] (03Merged) 10jenkins-bot: Allow properly running tests while using pybuild [software/service-checker] - 10https://gerrit.wikimedia.org/r/540370 (owner: 10Giuseppe Lavagetto) [10:36:44] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Traffic: Puppet fails on deployment-cache-text05 - https://phabricator.wikimedia.org/T234412 (10Vgutierrez) This is caused by adding the ATS-TLS instance to the text cluster. So you need to provide a valid configuration for the ats-tls profile. See:... [10:44:58] (03PS2) 10Muehlenhoff: Add repo sync for buster/grafana [puppet] - 10https://gerrit.wikimedia.org/r/540113 [10:46:46] (03CR) 10Muehlenhoff: [C: 03+2] Add repo sync for buster/grafana [puppet] - 10https://gerrit.wikimedia.org/r/540113 (owner: 10Muehlenhoff) [10:49:48] (03PS1) 10Mathew.onipe: elasticsearch: move rsyslog profile to cirrus profile [puppet] - 10https://gerrit.wikimedia.org/r/540373 [10:59:04] (03PS1) 10Giuseppe Lavagetto: Further fixes to debianization [software/service-checker] - 10https://gerrit.wikimedia.org/r/540375 [11:00:04] Amir1, Lucas_WMDE, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191002T1100). [11:00:04] raynor: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:17] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Traffic, 10Core Platform Team Workboards (Clinic Duty Team): Puppet fails on deployment-cache-text05 - https://phabricator.wikimedia.org/T234412 (10mobrovac) 05Open→03Resolved a:03mobrovac As per @Vgutierrez' instructions, I looked up the ATS... [11:00:28] o/ [11:00:53] so looks like I'm the only one with two changes, I can deploy by my own [11:00:57] can I proceed? [11:01:14] raynor: No problem from my side [11:01:35] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:01:36] (03CR) 10Pmiazga: [C: 03+2] Do not set wgMFNoindexPages config flag in mobile.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540193 (https://phabricator.wikimedia.org/T206497) (owner: 10Pmiazga) [11:02:18] raynor: let me know once you're done, please [11:03:00] kk [11:03:30] (03Merged) 10jenkins-bot: Do not set wgMFNoindexPages config flag in mobile.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540193 (https://phabricator.wikimedia.org/T206497) (owner: 10Pmiazga) [11:03:51] (03PS1) 10Urbanecm: Enable partial blocks at ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540377 (https://phabricator.wikimedia.org/T233754) [11:06:44] (03PS2) 10Giuseppe Lavagetto: Further fixes to debianization [software/service-checker] - 10https://gerrit.wikimedia.org/r/540375 [11:09:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Further fixes to debianization [software/service-checker] - 10https://gerrit.wikimedia.org/r/540375 (owner: 10Giuseppe Lavagetto) [11:11:43] (03PS3) 10Pmiazga: Set new MFMobileFormatterOptions config using old config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540107 (https://phabricator.wikimedia.org/T232690) [11:12:11] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:12:35] !log pmiazga@deploy1001 Synchronized wmf-config/mobile.php: SWAT: [[gerrit:540193|Do not set wgMFNoindexPages config flag in mobile.php (T206497)]] (duration: 01m 14s) [11:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:39] T206497: Enable $wgMFNoindexPages for: Italian, Dutch, Korean, Arabic, Chinese, and Hindi Wikipedias - https://phabricator.wikimedia.org/T206497 [11:13:51] (03CR) 10Alexandros Kosiaris: Add a commit message guide (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris) [11:14:03] (03PS1) 10Urbanecm: Grant autocreateaccount to everyone on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540380 (https://phabricator.wikimedia.org/T222117) [11:14:04] (03CR) 10Pmiazga: [C: 03+2] Set new MFMobileFormatterOptions config using old config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540107 (https://phabricator.wikimedia.org/T232690) (owner: 10Pmiazga) [11:14:18] <_joe_> !log uploaded service-checker 0.2.0 to stretch-wikimedia [11:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:43] (03PS3) 10Alexandros Kosiaris: Add a commit message guide [puppet] - 10https://gerrit.wikimedia.org/r/540366 [11:15:05] (03CR) 10Jbond: [C: 03+2] puppetmaster1001: migrate esams puppet traffic to codfw [dns] - 10https://gerrit.wikimedia.org/r/540367 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [11:15:07] (03Merged) 10jenkins-bot: Set new MFMobileFormatterOptions config using old config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540107 (https://phabricator.wikimedia.org/T232690) (owner: 10Pmiazga) [11:15:23] (03CR) 10Alexandros Kosiaris: "I guess that this could benefit from a wider audience. While voluntary and not enforced in any way, it would be nice to inform the rest of" [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris) [11:15:28] <_joe_> !log testing the package on restbase-dev1006 [11:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:14] <_joe_> heh as I feared [11:17:20] <_joe_> I need the puppet change ASAP [11:18:19] (03PS2) 10Urbanecm: Enable partial blocks at ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540377 (https://phabricator.wikimedia.org/T233754) [11:20:50] !log pmiazga@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:540107|Set new MFMobileFormatterOptions config using old config (T232690)]] (duration: 01m 01s) [11:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:53] T232690: Skip Expensive MobileFormatter Transformations On Pages With Extremely High Image/Heading counts - https://phabricator.wikimedia.org/T232690 [11:22:17] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:22:35] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:23:13] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:23:15] Urbanecm, I'm done, do you want to push sth more or can I close the SWAT? [11:23:33] raynor: I have one patch [11:23:36] <_joe_> uh what are those ripe alerts? [11:23:37] (03CR) 10Urbanecm: [C: 03+2] Enable partial blocks at ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540377 (https://phabricator.wikimedia.org/T233754) (owner: 10Urbanecm) [11:23:45] kk, Urbanecm so SWAT is yours [11:23:49] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:23:55] thx raynor [11:24:21] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:24:27] !log update puppet.esams.wmnet to puppetmaster2001 [11:24:28] _joe_, no idea, most probably not related to my patches [11:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:47] (03Merged) 10jenkins-bot: Enable partial blocks at ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540377 (https://phabricator.wikimedia.org/T233754) (owner: 10Urbanecm) [11:24:47] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:24:53] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:25:23] (03CR) 10Jbond: [C: 03+2] puppetmaster1001: move eqaid puppet to codfw [dns] - 10https://gerrit.wikimedia.org/r/540368 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [11:25:59] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:26:09] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:26:36] !log update puppet.eqiad.wmnet to puppetmaster2001 [11:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:48] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 01711d5: Enable partial blocks at ptwiki (T233754) (duration: 00m 55s) [11:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:51] T233754: Enable Partial Blocks on Portuguese Wikipedia - https://phabricator.wikimedia.org/T233754 [11:26:56] !log EU SWAT done [11:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:03] !log installing cryptsetup bugfix from buster 10.1 point release [11:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:19] <_joe_> raynor: surely not [11:41:24] (03PS1) 10Giuseppe Lavagetto: service-checker: bump to python3 version on stretch+ [puppet] - 10https://gerrit.wikimedia.org/r/540386 [11:44:58] (03PS2) 10Giuseppe Lavagetto: service-checker: bump to python3 version on stretch+ [puppet] - 10https://gerrit.wikimedia.org/r/540386 [11:45:35] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:47:39] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/18714/ LGTM, will merge later." [puppet] - 10https://gerrit.wikimedia.org/r/540386 (owner: 10Giuseppe Lavagetto) [11:50:45] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 3 probes of 507 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:51:27] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 2 probes of 507 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:51:59] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 2 probes of 507 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:52:53] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 21 probes of 459 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:53:01] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 2 probes of 507 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:54:01] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 25 probes of 459 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:54:11] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 21 probes of 459 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:55:57] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 24 probes of 459 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:59:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] service-checker: bump to python3 version on stretch+ [puppet] - 10https://gerrit.wikimedia.org/r/540386 (owner: 10Giuseppe Lavagetto) [12:11:05] (03PS1) 10Jbond: puppetmaster1001: update config-manager to prepare for reimage [dns] - 10https://gerrit.wikimedia.org/r/540390 (https://phabricator.wikimedia.org/T234315) [12:11:55] (03PS2) 10Jbond: puppetmaster1001: update config-master to prepare for reimage [dns] - 10https://gerrit.wikimedia.org/r/540390 (https://phabricator.wikimedia.org/T234315) [12:18:41] (03PS1) 10Jbond: puppetmaster1001: move ca to puppetmaster2001 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/540392 (https://phabricator.wikimedia.org/T234315) [12:19:23] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster1001: move ca to puppetmaster2001 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/540392 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [12:22:09] (03PS1) 10Jbond: pybal_config: remove puppetmaster1001 from pybal_config backend [puppet] - 10https://gerrit.wikimedia.org/r/540393 (https://phabricator.wikimedia.org/T234315) [12:26:49] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:27:11] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:40:47] 10Operations, 10SRE-Access-Requests: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10CorinnaHillebrand_WMDE) [12:53:02] (03PS2) 10Gehel: elasticsearch: move rsyslog profile to cirrus profile [puppet] - 10https://gerrit.wikimedia.org/r/540373 (owner: 10Mathew.onipe) [12:55:09] (03CR) 10Kwanele22: "free vpn" [puppet] - 10https://gerrit.wikimedia.org/r/509595 (https://phabricator.wikimedia.org/T221654) (owner: 10Alex Monk) [12:55:28] (03PS1) 10Giuseppe Lavagetto: Add Conflicts: line to debian/control [software/service-checker] - 10https://gerrit.wikimedia.org/r/540399 [12:56:10] (03CR) 10Gehel: [C: 03+2] elasticsearch: move rsyslog profile to cirrus profile [puppet] - 10https://gerrit.wikimedia.org/r/540373 (owner: 10Mathew.onipe) [12:56:19] onimisionipe: ^ [12:57:20] (03CR) 10Muehlenhoff: puppetmaster1001: move ca to puppetmaster2001 for reimage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540392 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [12:57:57] (03PS1) 10Kwanele22: vpn [puppet] - 10https://gerrit.wikimedia.org/r/540400 [12:58:02] (03PS1) 10Jbond: puppetmaster1001: update puppet.wikimedia.org to point to codfw [dns] - 10https://gerrit.wikimedia.org/r/540401 [12:59:38] (03PS2) 10Jbond: puppetmaster1001: move ca to puppetmaster2001 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/540392 (https://phabricator.wikimedia.org/T234315) [12:59:50] (03CR) 10Jbond: puppetmaster1001: move ca to puppetmaster2001 for reimage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540392 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [13:00:28] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/540392 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [13:02:32] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/540393 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [13:03:33] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add Conflicts: line to debian/control [software/service-checker] - 10https://gerrit.wikimedia.org/r/540399 (owner: 10Giuseppe Lavagetto) [13:06:05] (03PS1) 10Pmiazga: Remove Minerva EventLogging error tracking configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540404 (https://phabricator.wikimedia.org/T233663) [13:14:19] (03PS3) 10Jbond: puppetmaster1001: move ca to puppetmaster2001 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/540392 (https://phabricator.wikimedia.org/T234315) [13:15:14] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Sadly those SANs won't be enough. You also need to include in the SAN the same domains that are included in the certs for the appservers a" [puppet] - 10https://gerrit.wikimedia.org/r/540252 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [13:16:10] (03CR) 10Jbond: [C: 03+2] puppetmaster1001: move ca to puppetmaster2001 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/540392 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [13:16:14] (03CR) 10Giuseppe Lavagetto: [C: 03+1] puppetmaster1001: update config-master to prepare for reimage [dns] - 10https://gerrit.wikimedia.org/r/540390 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [13:16:39] !log installing console-setup bugfix update from buster point release [13:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:54] (03CR) 10Jbond: [C: 03+2] puppetmaster1001: update config-master to prepare for reimage [dns] - 10https://gerrit.wikimedia.org/r/540390 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [13:18:47] !log installing mariabd-10.3 update from buster point release (just client side libs, tools) [13:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:51] (03PS2) 10Jbond: puppetmaster1001: update puppet.wikimedia.org to point to codfw [dns] - 10https://gerrit.wikimedia.org/r/540401 [13:20:58] 10Operations, 10DC-Ops, 10decommission: decommission elastic1017 - https://phabricator.wikimedia.org/T234045 (10MoritzMuehlenhoff) [13:23:29] PROBLEM - puppetmaster backend https on puppetmaster2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [13:24:00] (03PS1) 10Muehlenhoff: elastic1017: Remove puppet references [puppet] - 10https://gerrit.wikimedia.org/r/540407 (https://phabricator.wikimedia.org/T234045) [13:24:45] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.7464 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:25:39] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.02954 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:30:02] (03PS1) 10Muehlenhoff: Cumin alias updates [puppet] - 10https://gerrit.wikimedia.org/r/540408 [13:30:59] PROBLEM - puppetmaster https on puppetmaster2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [13:31:13] (03CR) 10Jbond: [C: 03+2] puppetmaster1001: update puppet.wikimedia.org to point to codfw [dns] - 10https://gerrit.wikimedia.org/r/540401 (owner: 10Jbond) [13:33:29] ^^ looking [13:35:53] RECOVERY - puppetmaster https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 398 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [13:39:14] (03CR) 10Elukey: [C: 03+2] Cumin alias updates [puppet] - 10https://gerrit.wikimedia.org/r/540408 (owner: 10Muehlenhoff) [13:40:18] jbond42: just to avoid issues, where should I puppet-merge? [13:40:50] elukey: you can still merge on 1001 [13:41:19] (03PS1) 10Jbond: Revert "puppetmaster1001: move ca to puppetmaster2001 for reimage" [puppet] - 10https://gerrit.wikimedia.org/r/540409 [13:42:44] ack [13:44:15] (03PS2) 10Jbond: Revert "puppetmaster1001: move ca to puppetmaster2001 for reimage" [puppet] - 10https://gerrit.wikimedia.org/r/540409 [13:44:41] <_joe_> jbond42: what is going wrong? [13:45:00] (03CR) 10Jbond: [C: 03+2] Revert "puppetmaster1001: move ca to puppetmaster2001 for reimage" [puppet] - 10https://gerrit.wikimedia.org/r/540409 (owner: 10Jbond) [13:45:35] 10Operations, 10SRE-Access-Requests: Requesting access to 'analytics-privatedata-users' and 'researchers' for Jerrie Kumalah - https://phabricator.wikimedia.org/T234433 (10jkumalah) [13:45:39] _joe_: passanger keeps dieing [13:45:57] <_joe_> just too much load? [13:46:04] i also notice that the socket utilasation shot up shortly after i change the puppet_ca server so im reverting that [13:46:07] yes could be [13:46:24] <_joe_> lmk if you need me to look into it [13:46:58] i also change the confi-master to point to puppetmaster2001 what affacte would that have [13:47:23] (I am rebooting an-conf1001 for tests) [13:48:35] PROBLEM - Host an-conf1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:48:47] 10Operations, 10SRE-Access-Requests: Requesting access to 'analytics-privatedata-users' and 'researchers' for Jerrie Kumalah - https://phabricator.wikimedia.org/T234433 (10jkumalah) ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDJpUZ9XpqGc8Oz2rFbYdfiJHTp3JYDnOthxVwRfv+aIjFQzq7bPHsTqbCi8g/thANeCwk32l4PQaMOXpRezW/de+MZh... [13:48:48] <_joe_> jbond42: null [13:48:59] <_joe_> it should have no effect, it's very low traffic [13:49:15] ack thanks [13:56:29] (03PS1) 10Jbond: puppet.wikimedia.org: move iot back to puppetmaster1001 [dns] - 10https://gerrit.wikimedia.org/r/540412 [13:56:58] 10Operations: Integrate Buster 10.1 point update - https://phabricator.wikimedia.org/T232310 (10MoritzMuehlenhoff) [13:57:25] (03CR) 10Jbond: [C: 03+2] puppet.wikimedia.org: move iot back to puppetmaster1001 [dns] - 10https://gerrit.wikimedia.org/r/540412 (owner: 10Jbond) [13:58:41] RECOVERY - Host an-conf1001 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [14:06:05] PROBLEM - Host an-conf1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:33] RECOVERY - puppetmaster backend https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 398 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [14:10:07] RECOVERY - Host an-conf1001 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [14:15:19] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1006.eqiad.wmnet, wdqs1004.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1006.eqiad.wmnet, wdqs1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:15:51] PROBLEM - puppetmaster backend https on puppetmaster2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [14:16:14] !log installing babeltrace bugfix update from buster point release [14:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:43] PROBLEM - puppetmaster https on puppetmaster2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [14:17:02] onimisionipe, gehel --^ [14:17:19] RECOVERY - puppetmaster backend https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 398 bytes in 1.079 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [14:17:21] 10Operations: Integrate Buster 10.1 point update - https://phabricator.wikimedia.org/T232310 (10MoritzMuehlenhoff) [14:17:33] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1006.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [14:17:35] 10Operations: Integrate Buster 10.1 point update - https://phabricator.wikimedia.org/T232310 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done [14:18:02] (03PS2) 10Muehlenhoff: elastic1017: Remove puppet references [puppet] - 10https://gerrit.wikimedia.org/r/540407 (https://phabricator.wikimedia.org/T234045) [14:18:33] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:21:13] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/540407 (https://phabricator.wikimedia.org/T234045) (owner: 10Muehlenhoff) [14:23:09] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:23:51] (03PS1) 10Muehlenhoff: Remove DNS entry for elastic1017 [dns] - 10https://gerrit.wikimedia.org/r/540415 (https://phabricator.wikimedia.org/T234045) [14:23:52] wdqs seems fine now (at least from pybal's point of view) [14:23:56] should we follow up? [14:23:57] PROBLEM - puppetmaster backend https on puppetmaster2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [14:24:01] gehel: --^ [14:24:18] it's still overloaded, looks related to edits, so not much we can do [14:24:31] it's the external endpoint, instabilities are somewhat expected [14:24:40] yep yep [14:25:03] architectural issues... [14:25:08] as curiosiry - you mentioned that it is related to edits, what does it mean? [14:26:06] the edit rate on wikidata will impact the load on wdqs (as edits are ingested by wdqs) [14:26:32] a common problem is some bot doing a high number of edits and wdqs having issues in catching up [14:26:50] honestly, we should throttle our update process to protect the reads, at the cost of lag... [14:27:40] (03CR) 10Muehlenhoff: [C: 03+2] Remove DNS entry for elastic1017 [dns] - 10https://gerrit.wikimedia.org/r/540415 (https://phabricator.wikimedia.org/T234045) (owner: 10Muehlenhoff) [14:27:47] ahh okok [14:27:56] and ingesting the edits is done via the updater? [14:28:03] what makes me think this is edit related, is that I can also see minor lag on the internal cluster, which has a very low and consistent read load [14:28:24] of course, in the end the contention comes from the cumulated effect of read and writes [14:29:33] elukey: yep, the updater is the process that consumes edits on wikidata and push them into blazegraph [14:31:50] !log installing libxslt security updates on stretch [14:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:14] gehel: super thanks for the explanation [14:35:15] (03CR) 10Volans: backends: add Netbox backend (035 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [14:35:25] elukey: always a pleasure! [14:41:06] (03CR) 10Volans: "I had a couple of old comments, plus one post-merge." (034 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/537242 (https://phabricator.wikimedia.org/T230449) (owner: 10CRusnov) [14:44:24] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for andrew-wmde - https://phabricator.wikimedia.org/T233202 (10Tobi_WMDE_SW) >>! In T233202#5538406, @Lucas_Werkmeister_WMDE wrote: >> Please reach out to fellow WMDE deployers for a training or if not possible, RelEng team members. > > I’d... [14:53:09] (03CR) 10Herron: admin: add expiry_date and expiry_contact for user eyener (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540207 (https://phabricator.wikimedia.org/T233636) (owner: 10Herron) [14:56:35] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:56:52] 10Operations, 10ops-codfw: Degraded RAID on db2050 - https://phabricator.wikimedia.org/T234444 (10ops-monitoring-bot) [14:57:51] exceptions issue known? [15:07:09] (03PS2) 10Herron: admin: add expiry_date and expiry_contact for user eyener [puppet] - 10https://gerrit.wikimedia.org/r/540207 (https://phabricator.wikimedia.org/T233636) [15:07:12] (03PS1) 10Mforns: analytics:refinery:job:data_purge: Re-enable deletion of geoeditors [puppet] - 10https://gerrit.wikimedia.org/r/540421 (https://phabricator.wikimedia.org/T234238) [15:07:25] (03PS1) 10Jbond: puppetmaster1001: switch puppet.eqiad to eqiad [dns] - 10https://gerrit.wikimedia.org/r/540422 [15:07:59] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:08:11] (03CR) 10Mforns: "I tested this and checked the checksum, good to go!" [puppet] - 10https://gerrit.wikimedia.org/r/540421 (https://phabricator.wikimedia.org/T234238) (owner: 10Mforns) [15:08:30] (03CR) 10Jbond: [C: 03+2] puppetmaster1001: switch puppet.eqiad to eqiad [dns] - 10https://gerrit.wikimedia.org/r/540422 (owner: 10Jbond) [15:08:48] (03PS2) 10Jbond: puppetmaster1001: switch puppet.eqiad to eqiad [dns] - 10https://gerrit.wikimedia.org/r/540422 [15:09:07] (03CR) 10Muehlenhoff: [C: 03+2] elastic1017: Remove puppet references [puppet] - 10https://gerrit.wikimedia.org/r/540407 (https://phabricator.wikimedia.org/T234045) (owner: 10Muehlenhoff) [15:10:46] (03PS3) 10Herron: admin: add expiry_date and expiry_contact for user eyener [puppet] - 10https://gerrit.wikimedia.org/r/540207 (https://phabricator.wikimedia.org/T233636) [15:11:58] (03CR) 10Muehlenhoff: [C: 03+1] admin: add expiry_date and expiry_contact for user eyener [puppet] - 10https://gerrit.wikimedia.org/r/540207 (https://phabricator.wikimedia.org/T233636) (owner: 10Herron) [15:12:18] (03PS1) 10Muehlenhoff: Simplify partman config for elastic* [puppet] - 10https://gerrit.wikimedia.org/r/540423 [15:12:49] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:12:51] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/540423 (owner: 10Muehlenhoff) [15:13:34] (03CR) 10Herron: [C: 03+2] admin: add expiry_date and expiry_contact for user eyener [puppet] - 10https://gerrit.wikimedia.org/r/540207 (https://phabricator.wikimedia.org/T233636) (owner: 10Herron) [15:13:35] !log run swiftrepl eqiad -> codfw on ms-fe1005 (no deletes) [15:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:52] (03PS2) 10Muehlenhoff: Simplify partman config for elastic* [puppet] - 10https://gerrit.wikimedia.org/r/540423 [15:14:22] 10Operations, 10DC-Ops, 10decommission: decommission elastic1017 - https://phabricator.wikimedia.org/T234045 (10MoritzMuehlenhoff) [15:15:13] RECOVERY - puppetmaster https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 399 bytes in 1.086 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [15:15:37] (03CR) 10Muehlenhoff: [C: 03+2] Simplify partman config for elastic* [puppet] - 10https://gerrit.wikimedia.org/r/540423 (owner: 10Muehlenhoff) [15:15:59] RECOVERY - puppetmaster backend https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 398 bytes in 0.166 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [15:19:37] (03PS1) 10CDanis: grafana: remove obsolete deleteDatasources stanza [puppet] - 10https://gerrit.wikimedia.org/r/540424 [15:19:59] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.005076 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:22:23] (03CR) 10Herron: elasticsearch: add toggle for rsyslog udp logback compat include (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540189 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron) [15:24:11] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:24:20] (03CR) 10Mathew.onipe: elasticsearch: add toggle for rsyslog udp logback compat include (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540189 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron) [15:26:11] (03CR) 10Herron: elasticsearch: add toggle for rsyslog udp logback compat include (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540189 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron) [15:26:30] 10Operations, 10ops-codfw, 10netops: msw-c1 down? - https://phabricator.wikimedia.org/T234411 (10ayounsi) a:03Papaul @papaul, can you check the LED status, cables (all properly connected), then power cycle the device? [15:26:58] (03Abandoned) 10Herron: elasticsearch: add toggle for rsyslog udp logback compat include [puppet] - 10https://gerrit.wikimedia.org/r/540189 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron) [15:29:56] !log swift codfw-prod: add ms-be2051 T233638 [15:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:00] T233638: rack/setup/install ms-be205[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T233638 [15:30:35] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:31:10] !log correction, add ms-be2052 [15:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:42] (03PS6) 10Herron: logstash: throttle duplicate normalized_message with level:ERR* [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739) [15:34:44] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: remove obsolete deleteDatasources stanza [puppet] - 10https://gerrit.wikimedia.org/r/540424 (owner: 10CDanis) [15:34:56] (03CR) 10Herron: logstash: throttle duplicate normalized_message with level:ERR* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739) (owner: 10Herron) [15:35:26] (03PS2) 10CDanis: grafana: remove obsolete deleteDatasources stanza [puppet] - 10https://gerrit.wikimedia.org/r/540424 [15:35:39] 10Puppet, 10Patch-For-Review: ensure additional puppetmaster files are managed by puppet - https://phabricator.wikimedia.org/T234332 (10jbond) p:05Triage→03Normal [15:36:12] 10Puppet, 10Patch-For-Review: upgrade puppet master frontends servers - https://phabricator.wikimedia.org/T234315 (10jbond) Tagging https://phabricator.wikimedia.org/T184562#4069362 as a usefull refrence [15:36:40] (03PS1) 10Jbond: puppetmaster_ca: move ca functions to puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/540431 (https://phabricator.wikimedia.org/T234315) [15:37:37] (03CR) 10CDanis: [V: 03+2] grafana: remove obsolete deleteDatasources stanza [puppet] - 10https://gerrit.wikimedia.org/r/540424 (owner: 10CDanis) [15:38:16] 10Puppet, 10Patch-For-Review: upgrade puppet master frontends servers - https://phabricator.wikimedia.org/T234315 (10jbond) tagging: * https://wikitech.wikimedia.org/wiki/Puppet_CA_replacement * https://phabricator.wikimedia.org/T184562#4069362 [15:38:22] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739) (owner: 10Herron) [15:38:24] (03CR) 10Jbond: [C: 03+2] puppetmaster_ca: move ca functions to puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/540431 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [15:38:30] 10Operations, 10ops-eqiad, 10netops: asw2-a-eqiad <-> cr2-eqiad fiber issue - https://phabricator.wikimedia.org/T234416 (10ayounsi) a:03Cmjohnson Related to T203719. @Cmjohnson same as when there are interfaces errors, but here monitor for new: `sfp-7/0/46 link 46 SFP receive power low warning set` in `sh... [15:38:33] (03PS2) 10Jbond: puppetmaster_ca: move ca functions to puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/540431 (https://phabricator.wikimedia.org/T234315) [15:40:29] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:40:38] shush [15:43:23] 10Operations, 10ops-eqiad: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10fgiunchedi) p:05Normal→03High Please note that putting these systems in production is becoming urgent, is there a status update and/or ETA? [15:45:11] PROBLEM - puppetmaster backend https on puppetmaster2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [15:46:53] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.01513 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:47:33] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.6952 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:48:19] (03PS1) 10Jbond: Revert "puppetmaster_ca: move ca functions to puppetmaster2001" [puppet] - 10https://gerrit.wikimedia.org/r/540434 [15:49:12] (03CR) 10Jbond: [C: 03+2] Revert "puppetmaster_ca: move ca functions to puppetmaster2001" [puppet] - 10https://gerrit.wikimedia.org/r/540434 (owner: 10Jbond) [15:50:18] (03CR) 10Krinkle: Add a commit message guide (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris) [15:51:05] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:51:38] puppet master in trouble but no irc spam from puppet failures -> great win https://i.imgur.com/28wUHI0.gifv [15:51:41] RECOVERY - puppetmaster backend https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 398 bytes in 0.168 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [15:52:16] yes should be fixed again now, just wanted to pin down what cause the last error [15:52:39] godog: this is great [15:53:06] (03CR) 10Brennen Bearnes: "Just in case they contain any ideas of interest, Tyler's version of this:" [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris) [15:53:10] jbond42: *nod* thanks for taking care of that [15:53:11] yesindeed irc would have been unusable for the last few hours [15:53:19] 10Operations, 10netops: Telia IC-314534 (eqord/eqdfw 10Gbps wave) down - https://phabricator.wikimedia.org/T234335 (10ayounsi) 05Open→03Resolved a:03elukey Work completed, everything is up, thank to you two! [15:53:42] (03CR) 10Volans: "We need to add more validation steps, not only on the gdnsd side, but on the content, to ensure that each $ORIGIN has likely good data (at" (034 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/539013 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [15:56:57] herron: https://66.media.tumblr.com/8ff56d9711d8702e77585b5896f3cb4b/tumblr_nfwyq36Pkg1qh9nffo1_400.gif [15:58:25] hah pssssht [15:58:45] (03CR) 10Herron: "Krinkle do you have any reservations about deploying this?" [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739) (owner: 10Herron) [15:59:10] (03CR) 10Volans: "What's the status of this? I think we'll need to re-run it at some point once we'll be ready to switch to the generated ones." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/529977 (https://phabricator.wikimedia.org/T228670) (owner: 10CRusnov) [16:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191002T1600). [16:00:04] MatmaRex and Lucas_WMDE: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:08] (03CR) 10CRusnov: "> Patch Set 2:" (034 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/539013 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [16:00:27] ACKNOWLEDGEMENT - Host restbase2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411 [16:00:27] ACKNOWLEDGEMENT - Host restbase2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411 [16:00:27] ACKNOWLEDGEMENT - Host ps1-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411 [16:00:27] ACKNOWLEDGEMENT - Host pc2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411 [16:00:27] ACKNOWLEDGEMENT - Host ores2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411 [16:00:27] ACKNOWLEDGEMENT - Host mc2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411 [16:00:27] ACKNOWLEDGEMENT - Host mc2027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411 [16:00:28] ACKNOWLEDGEMENT - Host ganeti2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411 [16:00:28] ACKNOWLEDGEMENT - Host ganeti2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411 [16:00:29] ACKNOWLEDGEMENT - Host es2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411 [16:00:29] ACKNOWLEDGEMENT - Host es2012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411 [16:00:30] ACKNOWLEDGEMENT - Host elastic2031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411 [16:00:30] ACKNOWLEDGEMENT - Host db2125.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411 [16:00:31] ACKNOWLEDGEMENT - Host db2112.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411 [16:00:31] ACKNOWLEDGEMENT - Host db2087.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411 [16:00:32] ACKNOWLEDGEMENT - Host db2077.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411 [16:00:32] ACKNOWLEDGEMENT - Host cloudservices2002-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411 [16:00:33] ACKNOWLEDGEMENT - Host cloudcontrol2003-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T234411 [16:00:49] (03CR) 10CRusnov: "> Patch Set 3:" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/529977 (https://phabricator.wikimedia.org/T228670) (owner: 10CRusnov) [16:01:01] hi [16:01:02] 10Puppet, 10Patch-For-Review: upgrade puppet master frontends servers - https://phabricator.wikimedia.org/T234315 (10jbond) during the upgrade of the puppet master i attempted to move the puppetmaster_ca from puppetmaster1001 to puppetmaster2001. however when i attempted this we saw a whole skew of errors· L... [16:01:20] o/ [16:01:27] I’ll be ready in a second [16:03:02] (03PS3) 10CRusnov: Add script to generate DNS records from Netbox [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/539013 (https://phabricator.wikimedia.org/T233183) [16:03:02] MatmaRex: do you want to go ahead and start merging your changes? [16:03:03] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.002161 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:03:08] they’ll take a while to go through CI anyways, probably [16:03:19] Lucas_WMDE: i don't have +2 rights in wmf branches [16:03:24] oh, ok [16:03:29] then I can do it [16:03:37] :) [16:03:58] …the version on master hasn’t been merged yet? [16:07:22] yeah [16:07:54] I don’t think I’m supposed to deploy backports if they’re not merged in master yet… [16:08:04] is it okay if I deploy my backport first and come back to you? [16:08:15] (or, start deploying it, will take a while in CI anyways) [16:08:25] sure [16:09:20] i can try again in the next SWAT slot too, after hopefully other VE folks wake up and review the patches [16:10:59] (03CR) 10Krinkle: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739) (owner: 10Herron) [16:13:07] well, I also left a review on one of the patches d) [16:13:10] * :) [16:15:32] oh, and one of them is already backported to .24? [16:15:40] I was wondering why that one seemed to be missing [16:15:40] 10Operations: Puppet breakage in automation-feedback VMs - https://phabricator.wikimedia.org/T234452 (10Andrew) [16:16:06] 10Operations: Puppet breakage in automation-feedback VMs - https://phabricator.wikimedia.org/T234452 (10Andrew) I dug a little deeper, and the primary issue is local diffs in /var/lib/git/operations/puppet on af-puppetmaster02.automation-framework.eqiad.wmflabs. If you commit those and are able to get a sensibl... [16:16:36] 10Operations: Puppet breakage in automation-feedback VMs - https://phabricator.wikimedia.org/T234452 (10crusnov) THanks for the heads up, we'll loop around to fix these up. [16:16:45] Lucas_WMDE: Landing now, maybe good to go if other stuff is done with swat [16:16:48] (in master) [16:16:57] ok [16:17:00] I also have a late addition to the swat - https://gerrit.wikimedia.org/r/#/c/mediawiki/skins/Vector/+/540396/ [16:17:00] then I’ll just +2 the lot [16:17:09] but I'm okay doing it afterwards if other stuff is still ahead, no problem. [16:18:27] MatmaRex: is this VisualEditor channel registered already? I don’t see it being used elsewhere [16:19:08] Lucas_WMDE: it should be, as of yesterday evening [16:19:17] * Lucas_WMDE pulls mediawiki-config again [16:19:22] unless i did something wrong [16:19:24] ah, indeed :) [16:19:27] ok [16:19:30] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/540235 [16:22:37] ah damn, the Wikibase gate-and-submit is failing [16:23:05] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.002882 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:34:34] (03PS2) 10Elukey: analytics:refinery:job:data_purge: Re-enable deletion of geoeditors [puppet] - 10https://gerrit.wikimedia.org/r/540421 (https://phabricator.wikimedia.org/T234238) (owner: 10Mforns) [16:35:06] (03CR) 10Elukey: [C: 03+2] analytics:refinery:job:data_purge: Re-enable deletion of geoeditors [puppet] - 10https://gerrit.wikimedia.org/r/540421 (https://phabricator.wikimedia.org/T234238) (owner: 10Mforns) [16:39:06] okay, let’s merge those VE backports [16:39:07] (03CR) 10Alexandros Kosiaris: Add a commit message guide (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris) [16:39:10] MatmaRex: can they be tested at all? [16:39:16] (probably not?) [16:39:29] Lucas_WMDE: not really. it's logging for a rare problem that we're trying to debug [16:39:35] yeah, ok [16:39:46] then I’ll just sync them and rely on the scap canaries, I think [16:39:57] skipping mwdebug [16:40:10] alright. thanks [16:42:17] (03PS4) 10Ottomata: Revert "Disable public revision-score events until we figure out a good schema" [puppet] - 10https://gerrit.wikimedia.org/r/475818 (owner: 10Ppchelko) [16:42:36] Krinkle: I’ve +2ed your change, let’s hope there’s enough time for it to go through CI [16:42:42] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.34.0-wmf.24/extensions/VisualEditor/: SWAT: [[gerrit:540428|ApiVisualEditorEdit: Add logging for funny etags (T233320)]] (duration: 01m 03s) [16:42:45] (the Wikibase change definitely won’t make it, I’m giving up on that one) [16:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:47] T233320: VisualEditor <-> RESTBase communication and ETags - https://phabricator.wikimedia.org/T233320 [16:46:24] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.34.0-wmf.25/extensions/VisualEditor/: SWAT: [[gerrit:540427|ApiVisualEditor: Add logging for RESTBase HTTP errors (T233127)]] + [[gerrit:540429|ApiVisualEditorEdit: Add logging for funny etags (T233320)]] (duration: 01m 04s) [16:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:30] T233127: HTTP 404 error in VE possibly when confronted with an edit conflict - https://phabricator.wikimedia.org/T233127 [16:47:01] Lucas_WMDE: thx [16:47:08] adding it to the SWAT calendar now [16:47:41] (03CR) 10Ottomata: [C: 03+2] Revert "Disable public revision-score events until we figure out a good schema" [puppet] - 10https://gerrit.wikimedia.org/r/475818 (owner: 10Ppchelko) [16:50:29] !log otto@deploy1001 Started deploy [eventstreams/deploy@dbc9bbb]: (no justification provided) [16:50:30] !log otto@deploy1001 Finished deploy [eventstreams/deploy@dbc9bbb]: (no justification provided) (duration: 00m 01s) [16:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:40] !log otto@deploy1001 Started restart [eventstreams/deploy@dbc9bbb]: (no justification provided) [16:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:36] (thanks for deploying) [16:51:45] (no problem) [16:53:39] !log otto@deploy1001 Started restart [eventstreams/deploy@dbc9bbb]: Enabling revision-score stream in eventstreams [16:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:23] 10Operations, 10hardware-requests: eqiad+codfw: 6x hardware request for swift backend (each site) - https://phabricator.wikimedia.org/T227314 (10RobH) [16:59:42] Krinkle: can your change be tested? [16:59:53] (I would guess no, looks like just an optimization?) [17:01:32] well, I’ll do a quick test on mwdebug [17:02:45] nothing looks broken, syncing [17:02:56] 10Operations, 10hardware-requests: eqiad+codfw: 6x hardware request for swift backend (each site) - https://phabricator.wikimedia.org/T227314 (10RobH) 05Open→03Resolved a:03RobH all hardware requested was ordered so this is resolved [17:03:46] Lucas_WMDE: testing now [17:03:53] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.34.0-wmf.25/skins/Vector/: SWAT: [[gerrit:540396|vector.js: Remove eager calculation of p-cactions width on page load]] (duration: 01m 00s) [17:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:52] (03PS1) 10Mforns: web::fetches::stats.pp Absent mediawiki history rsync [puppet] - 10https://gerrit.wikimedia.org/r/540442 (https://phabricator.wikimedia.org/T208612) [17:05:18] (I just realized I tested on wikidatawiki, which is still on .24 🤦) [17:05:35] (but testwikidatawiki also looks ok) [17:05:58] (03CR) 10Dzahn: [C: 03+2] "not used in prod DB but fixes something on cloud" [puppet] - 10https://gerrit.wikimedia.org/r/539973 (owner: 10Dzahn) [17:06:11] (03PS6) 10Dzahn: mariadb::packages: support buster, drop libmariadbclient18 [puppet] - 10https://gerrit.wikimedia.org/r/539973 [17:07:18] (03PS1) 10Filippo Giunchedi: hieradata: bump shard size threshold for logstash [puppet] - 10https://gerrit.wikimedia.org/r/540444 [17:08:09] !log Morning SWAT done [17:08:11] Lucas_WMDE: getting cache issue, one min.. [17:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:17] (03CR) 10Ottomata: [C: 03+2] web::fetches::stats.pp Absent mediawiki history rsync [puppet] - 10https://gerrit.wikimedia.org/r/540442 (https://phabricator.wikimedia.org/T208612) (owner: 10Mforns) [17:08:24] (03PS2) 10Ottomata: web::fetches::stats.pp Absent mediawiki history rsync [puppet] - 10https://gerrit.wikimedia.org/r/540442 (https://phabricator.wikimedia.org/T208612) (owner: 10Mforns) [17:08:27] (03CR) 10Ottomata: [V: 03+2 C: 03+2] web::fetches::stats.pp Absent mediawiki history rsync [puppet] - 10https://gerrit.wikimedia.org/r/540442 (https://phabricator.wikimedia.org/T208612) (owner: 10Mforns) [17:08:34] Lucas_WMDE: anyway, np, will follow up if needed. [17:08:36] I have to leave pretty soon :/ [17:08:39] ok [17:09:14] (03CR) 10Dzahn: "what is using this: quarry, wikistats (cloud), cloud VPS using simplelamp in the future (to remove mysql module)" [puppet] - 10https://gerrit.wikimedia.org/r/539973 (owner: 10Dzahn) [17:09:55] OK, all good in prod. [17:09:58] Thanks Lucas_WMDE [17:10:07] 10Operations, 10Patch-For-Review, 10User-fgiunchedi: rsyslog's in:imtcp thread stuck on recvfrom loop from down/rebooted hosts - https://phabricator.wikimedia.org/T199406 (10fgiunchedi) 05Open→03Stalled Setting as stalled for now, the immediate issue has been bandaided [17:11:38] great! :) [17:12:15] (03PS1) 10Ottomata: Add ensure parameter to dumps::web::fetches [puppet] - 10https://gerrit.wikimedia.org/r/540445 [17:13:02] (03CR) 10Ottomata: [C: 03+2] Add ensure parameter to dumps::web::fetches [puppet] - 10https://gerrit.wikimedia.org/r/540445 (owner: 10Ottomata) [17:13:09] (03PS2) 10Ottomata: Add ensure parameter to dumps::web::fetches [puppet] - 10https://gerrit.wikimedia.org/r/540445 [17:13:11] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add ensure parameter to dumps::web::fetches [puppet] - 10https://gerrit.wikimedia.org/r/540445 (owner: 10Ottomata) [17:17:06] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10Jdforrester-WMF) [17:20:20] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Goal, and 2 others: Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759 (10dduvall) 05Open→03Stalled Work on this has stalled. I've uninstalled the Prometheus plugin from Jenki... [17:20:24] 10Operations, 10Goal, 10User-fgiunchedi: Export Prometheus-compatible JVM metrics from JVMs in production - https://phabricator.wikimedia.org/T177197 (10dduvall) [17:21:43] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Goal, and 2 others: Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759 (10dduvall) 05Stalled→03Declined Marking this as "declined" to remove the task from view. We can always... [17:21:47] 10Operations, 10Goal, 10User-fgiunchedi: Export Prometheus-compatible JVM metrics from JVMs in production - https://phabricator.wikimedia.org/T177197 (10dduvall) [17:31:15] (03PS1) 10Dzahn: wikistats (cloud): add deployment info to motd [puppet] - 10https://gerrit.wikimedia.org/r/540447 [17:32:30] 10Operations, 10Puppet, 10Patch-For-Review: Puppet: get row/rack info from Netbox - https://phabricator.wikimedia.org/T229397 (10crusnov) a:03crusnov [17:35:38] (03PS2) 10Dzahn: wikistats (cloud): add deployment info to motd [puppet] - 10https://gerrit.wikimedia.org/r/540447 [17:37:00] (03CR) 10Dzahn: [C: 03+2] "https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikistats#How_to_deploy_latest_code" [puppet] - 10https://gerrit.wikimedia.org/r/540447 (owner: 10Dzahn) [17:43:18] (03CR) 10Hashar: "> Package[openjdk-8-dbg]/ensure: created on gerrit2001." [puppet] - 10https://gerrit.wikimedia.org/r/540094 (https://phabricator.wikimedia.org/T231872) (owner: 10Hashar) [17:44:10] 10Puppet: Allow variables without hiera calls as lookup() default parameters - https://phabricator.wikimedia.org/T234459 (10fgiunchedi) [17:45:07] (03PS5) 10Filippo Giunchedi: profile: sanity checks for cluster [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) [17:45:09] (03PS2) 10Filippo Giunchedi: DNM Revert "hieradata: add acmechief cluster" [puppet] - 10https://gerrit.wikimedia.org/r/540246 [17:46:35] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [17:49:41] (03CR) 10Nuria: "Thanks Marcel for taking care of this." [puppet] - 10https://gerrit.wikimedia.org/r/540421 (https://phabricator.wikimedia.org/T234238) (owner: 10Mforns) [17:52:34] (03PS1) 10Dzahn: wikistats (cloud): fix template path, refactor db, httpd to profiles [puppet] - 10https://gerrit.wikimedia.org/r/540450 [17:59:45] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18717/wikistats-greyhound.wikistats.eqiad.wmflabs/change.wikistats-greyhound.wikistats.eq" [puppet] - 10https://gerrit.wikimedia.org/r/540450 (owner: 10Dzahn) [17:59:58] (03PS2) 10Dzahn: wikistats (cloud): fix template path, refactor db, httpd to profiles [puppet] - 10https://gerrit.wikimedia.org/r/540450 [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191002T1800) [18:01:14] (03PS7) 10Herron: logstash: throttle duplicate normalized_message with level:ERR* [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739) [18:03:03] (03CR) 10Paladox: [C: 03+1] gerrit: install openjdk dbg package (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/540094 (https://phabricator.wikimedia.org/T231872) (owner: 10Hashar) [18:09:05] Krinkle: hey, do you have a minute? the logging patches i had swatted earlier don't seem to be actually logging anything. can you help me figure out if i did the logging wrong, or if i'm looking at the wrong place for them? [18:09:31] (03CR) 10Krinkle: Grant autocreateaccount to everyone on closed wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540380 (https://phabricator.wikimedia.org/T222117) (owner: 10Urbanecm) [18:10:09] (03PS1) 10Dzahn: wikistats (cloud): fix duplicate declaration of mariabd-server pkg [puppet] - 10https://gerrit.wikimedia.org/r/540454 [18:10:36] (03PS2) 10Dzahn: wikistats (cloud): fix duplicate declaration of mariabd-server pkg [puppet] - 10https://gerrit.wikimedia.org/r/540454 [18:10:38] (03CR) 10jerkins-bot: [V: 04-1] wikistats (cloud): fix duplicate declaration of mariabd-server pkg [puppet] - 10https://gerrit.wikimedia.org/r/540454 (owner: 10Dzahn) [18:10:43] MatmaRex: ok, where are you looking? [18:10:43] i go to logstash.wikimedia.org, i add a filter for "channel is VisualEditor", adjust the time range, and get "No results found" [18:11:12] i'd give you a link, but the "Share" feature doesn't seem to work [18:11:32] Share -> bottom right panel -> Short URL [18:11:34] should work [18:11:43] I'm looking at https://logstash.wikimedia.org/goto/5f4d3be5e6279601de970da1fc46df88 [18:11:47] which is showing no results indeed [18:12:12] MatmaRex: Can you repro as user a logger call? [18:12:23] If so, try with XWD and []Log enabled. [18:12:27] (03PS1) 10Dzahn: wikistats (cloud): fix duplicate declare of mariadb-server package [puppet] - 10https://gerrit.wikimedia.org/r/540455 [18:12:33] which should rule out any issue with filtering and config settings [18:12:34] /win 11 [18:12:34] Krinkle: thanks, that works, https://logstash.wikimedia.org/goto/78fdce6b25619569e5929c3d60eae4ee (i was taking the wrong link) [18:12:36] (03PS8) 10Herron: logstash: throttle duplicate normalized_message with level:ERR* [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739) [18:13:32] 10Operations, 10ops-codfw: Degraded RAID on db2050 - https://phabricator.wikimedia.org/T234444 (10Marostegui) 05Open→03Invalid Ignore this, this host will be powered off tomorrow: {T230391} [18:13:56] (03CR) 10Dzahn: [C: 03+2] wikistats (cloud): fix duplicate declare of mariadb-server package [puppet] - 10https://gerrit.wikimedia.org/r/540455 (owner: 10Dzahn) [18:14:25] 10Operations, 10netops: configure BGP route damping on IX sessions - https://phabricator.wikimedia.org/T222424 (10ayounsi) Updated change with the above feedbacks: `lang=diff [edit protocols bgp group IX4] + damping; [edit protocols bgp group IX6] + damping; [edit policy-options policy-statement BGP_IXP_... [18:14:43] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Marostegui) [18:14:46] Krinkle: it should be possible to trigger this by opening VE anywhere, doing `ve.init.target.etag = 'test'` in the console, and opening the save dialog (without saving changes) [18:15:12] !log add BGP route damping on IX sessions - ulsfo - T222424 [18:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:16] T222424: configure BGP route damping on IX sessions - https://phabricator.wikimedia.org/T222424 [18:15:19] MatmaRex: ok, try that with XWD/mwdebug/Log and check https://logstash.wikimedia.org/app/kibana#/dashboard/mwdebug1002 [18:15:21] Krinkle: what do you mena by " XWD and []Log"? [18:15:28] (03PS9) 10Herron: logstash: throttle duplicate normalized_message with level:ERR* [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739) [18:15:34] MatmaRex: the WikimediaDebug browser extension [18:15:41] X-Wikimedia-Debug [18:15:53] https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Debug_logging [18:15:56] (03Abandoned) 10Dzahn: wikistats (cloud): fix duplicate declaration of mariabd-server pkg [puppet] - 10https://gerrit.wikimedia.org/r/540454 (owner: 10Dzahn) [18:16:35] Krinkle: right, i just tried it, and i don't see any logged entries [18:16:49] (there's an unrelated AbuseFilter one, actually) [18:16:55] (03PS17) 10Thcipriani: Add restbase chart (port from local-charts) [deployment-charts] - 10https://gerrit.wikimedia.org/r/517557 (https://phabricator.wikimedia.org/T228910) (owner: 10Jeena Huneidi) [18:17:55] (03CR) 10Herron: [C: 03+2] logstash: throttle duplicate normalized_message with level:ERR* [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739) (owner: 10Herron) [18:18:57] MatmaRex: OK, so that would mean either the code isn't actually being run the way you think, or the Logger might be a NullLogger? [18:19:16] but from the commit I reviewed, that wasn't the case as you get one from the factory directly [18:19:37] $this->logger = LoggerFactory::getInstance( 'VisualEditor' ); [18:20:46] Krinkle: can we deploy a patch where i just log anything in execute() and see if that gets recorded? [18:21:45] https://logstash.wikimedia.org/goto/df0a315d76a3bb27f7a88410155da141 [18:21:52] I've run it from eval.php to see if it works [18:21:59] I did debug(), info() and warning() [18:22:01] only warning() made it [18:22:17] nope info() made it as well [18:23:05] but debug() didn't [18:23:07] tried it a few more times [18:23:09] 10Operations, 10netops: configure BGP route damping on IX sessions - https://phabricator.wikimedia.org/T222424 (10ayounsi) For the record: ` cr4-ulsfo> show bgp neighbor | match "Suppressed due to damping"| except " 0" Suppressed due to damping: 1 Suppressed due to damping:... [18:23:10] Krinkle: hm, okay. so i guess the bug is in my code [18:25:04] !log add BGP route damping on IX sessions - eqdfw - T222424 [18:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:07] T222424: configure BGP route damping on IX sessions - https://phabricator.wikimedia.org/T222424 [18:26:34] PROBLEM - exim queue on mx1001 is CRITICAL: CRITICAL: 4145 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim [18:27:35] MatmaRex: See the comment in IS.php above the wmf var you modified [18:27:46] Looks like there is a special rule about level 'debug' and Logstash by default. [18:27:55] you can override that, but might be better to use info() instead. [18:28:06] !log add BGP route damping on IX sessions - eqord - T222424 [18:28:08] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:57] (03PS1) 10Herron: logstash: fix ordering in filter-throttle-errors [puppet] - 10https://gerrit.wikimedia.org/r/540456 [18:29:00] Krinkle: i am using info() though. (and warning() for another message) [18:29:27] Krinkle: the conf for 'AbuseFilter' channel looks the same, and that one works as expected [18:29:31] The one I saw in CR used debug(). If it's using info() or above already, then yeah, might be a code issue. [18:29:40] e.g. the condition not being reached [18:29:59] gotta go now :) hope this helps [18:30:34] thanks [18:30:43] some messages just appeared on my query btw MatmaRex [18:30:52] so maybe it's working now :) [18:30:53] bye! [18:31:14] (03CR) 10Herron: [C: 03+2] logstash: fix ordering in filter-throttle-errors [puppet] - 10https://gerrit.wikimedia.org/r/540456 (owner: 10Herron) [18:33:07] (03CR) 10Andrew Bogott: [C: 03+1] dumps-misc.sh.erb: Remove designate_pool_manager from backups [puppet] - 10https://gerrit.wikimedia.org/r/539839 (https://phabricator.wikimedia.org/T233978) (owner: 10Marostegui) [18:33:30] (03PS2) 10Andrew Bogott: deployment-prep: Fix purge_host_regex [puppet] - 10https://gerrit.wikimedia.org/r/536794 (owner: 10Alex Monk) [18:34:22] (03CR) 10Andrew Bogott: [C: 03+2] deployment-prep: Fix purge_host_regex [puppet] - 10https://gerrit.wikimedia.org/r/536794 (owner: 10Alex Monk) [18:35:05] 10Operations, 10netops: configure BGP route damping on IX sessions - https://phabricator.wikimedia.org/T222424 (10ayounsi) Eqord: ` Suppressed due to damping: 4 Suppressed due to damping: 4 Suppressed due to damping: 1 Suppressed due to damping: 1 ` eqdfw: ` Suppressed due to da... [18:35:56] (03PS1) 10Ottomata: Run hdfs balancer with threshold of 5% [puppet] - 10https://gerrit.wikimedia.org/r/540457 (https://phabricator.wikimedia.org/T231828) [18:37:20] chaomodus: ^ (Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL) [18:38:27] 10Operations, 10DC-Ops, 10SRE-tools: Host decommission improvements - https://phabricator.wikimedia.org/T231066 (10RobH) [18:38:38] (03CR) 10Ottomata: [C: 03+1] mariadb backups: Include extra valid sections on checking script [puppet] - 10https://gerrit.wikimedia.org/r/538885 (https://phabricator.wikimedia.org/T231208) (owner: 10Jcrespo) [18:38:42] 10Operations, 10DC-Ops, 10SRE-tools: Host decommission improvements - https://phabricator.wikimedia.org/T231066 (10RobH) updated: - Update the [[ https://phabricator.wikimedia.org/maniphest/task/edit/form/52/ | Phabricator template for host decommissioning ]] -- Done by @robh on 2019-10-02 - using https://e... [18:38:42] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:38:58] (03PS1) 10Paladox: Gerrit: Tweek replication config [puppet] - 10https://gerrit.wikimedia.org/r/540458 [18:43:25] paladox: Tweak maybe? [18:43:34] hashar that one, yup! [18:43:37] err [18:43:41] wrong ping i ment hauskater :) [18:44:39] 10Operations, 10Analytics, 10Analytics-EventLogging, 10EventBus, and 3 others: Public EventGate endpoint for analytics event intake - https://phabricator.wikimedia.org/T233629 (10Ottomata) [18:46:00] (03PS1) 10Dzahn: install_server: reinstall gerrit1001 with buster [puppet] - 10https://gerrit.wikimedia.org/r/540460 (https://phabricator.wikimedia.org/T222391) [18:46:57] (03PS2) 10Dzahn: install_server: reinstall gerrit1001 with buster [puppet] - 10https://gerrit.wikimedia.org/r/540460 (https://phabricator.wikimedia.org/T222391) [18:47:37] (03PS2) 10Paladox: Gerrit: Tweak replication config [puppet] - 10https://gerrit.wikimedia.org/r/540458 [18:48:24] (03PS3) 10Paladox: Gerrit: Tweak replication config [puppet] - 10https://gerrit.wikimedia.org/r/540458 [18:48:30] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/540458 (owner: 10Paladox) [18:48:54] (03PS1) 10Paladox: Gerrit: Rename "slaves" in replication config to replica [puppet] - 10https://gerrit.wikimedia.org/r/540462 [18:49:12] (03PS2) 10Paladox: Gerrit: Rename "slaves" in replication config to replica [puppet] - 10https://gerrit.wikimedia.org/r/540462 [18:49:35] (03PS3) 10Paladox: Gerrit: Rename "slaves" in replication config to replica [puppet] - 10https://gerrit.wikimedia.org/r/540462 [18:50:46] (03CR) 10Dzahn: [C: 03+2] install_server: reinstall gerrit1001 with buster [puppet] - 10https://gerrit.wikimedia.org/r/540460 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [18:50:50] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/540462 (owner: 10Paladox) [18:50:58] (03PS3) 10Dzahn: install_server: reinstall gerrit1001 with buster [puppet] - 10https://gerrit.wikimedia.org/r/540460 (https://phabricator.wikimedia.org/T222391) [18:55:34] mutante \o/ [18:55:50] ain't "buster" an insult? :? [18:56:02] (03CR) 10Dzahn: ""If pushing to multiple remotes, over differing types of network connections (e.g. LAN and also public Internet), its a good idea to put t" [puppet] - 10https://gerrit.wikimedia.org/r/540164 (owner: 10Paladox) [18:58:03] mutante it should only be temp [18:58:28] we will be reverting as soon as we are ready to make gerrit1001 the primary master [18:59:26] (03PS2) 10Ottomata: Add missing an-worker1088 to hadoop net_topology [puppet] - 10https://gerrit.wikimedia.org/r/540149 (https://phabricator.wikimedia.org/T209929) [18:59:35] (03CR) 10Ottomata: [C: 03+2] Add missing an-worker1088 to hadoop net_topology [puppet] - 10https://gerrit.wikimedia.org/r/540149 (https://phabricator.wikimedia.org/T209929) (owner: 10Ottomata) [19:00:04] marxarelli: How many deployers does it take to do MediaWiki train - American version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191002T1900). [19:01:02] (03CR) 10Paladox: "> "If pushing to multiple remotes, over differing types of network" [puppet] - 10https://gerrit.wikimedia.org/r/540164 (owner: 10Paladox) [19:02:21] (03CR) 10Dzahn: "2 threads - seems to make sense if we have 2 remotes per " Each thread can push one project at a time, to one destination URL. "" [puppet] - 10https://gerrit.wikimedia.org/r/540458 (owner: 10Paladox) [19:03:25] (03PS4) 10Paladox: Gerrit: Tweak replication config [puppet] - 10https://gerrit.wikimedia.org/r/540458 [19:03:34] 10Operations, 10decommission: Decommission analytics1032 - https://phabricator.wikimedia.org/T233080 (10elukey) ` elukey@asw2-c-eqiad# show | compare [edit interfaces interface-range disabled] member ge-7/0/34 { ... } + member ge-3/0/12; [edit interfaces] - ge-3/0/12 { - description "analytics1... [19:04:15] (03CR) 10Paladox: "> 2 threads - seems to make sense if we have 2 remotes per " Each" [puppet] - 10https://gerrit.wikimedia.org/r/540458 (owner: 10Paladox) [19:04:24] (03PS4) 10Paladox: Gerrit: Rename "slaves" in replication config to replica [puppet] - 10https://gerrit.wikimedia.org/r/540462 [19:05:33] 10Operations, 10decommission: Decommission analytics1032 - https://phabricator.wikimedia.org/T233080 (10elukey) [19:08:30] (03PS1) 10Herron: logstash: add id to drop action in filter-throttle-errors [puppet] - 10https://gerrit.wikimedia.org/r/540465 (https://phabricator.wikimedia.org/T233739) [19:09:50] (03CR) 10Herron: "in retrospect this should be more useful in terms of metrics/monitoring than the throttle filters themselves" [puppet] - 10https://gerrit.wikimedia.org/r/540465 (https://phabricator.wikimedia.org/T233739) (owner: 10Herron) [19:12:59] MatmaRex: I should wait for marxarelli to be done deploying the train before screwing around in production, if that's OK. :-) [19:13:23] oh right. heh [19:14:47] James_F, MatmaRex: rolling it now [19:14:55] * James_F stands well back. ;-) [19:15:06] (03CR) 10Herron: [C: 03+2] logstash: add id to drop action in filter-throttle-errors [puppet] - 10https://gerrit.wikimedia.org/r/540465 (https://phabricator.wikimedia.org/T233739) (owner: 10Herron) [19:16:30] (03PS1) 10BBlack: Add Digicert 2019 unified certs [puppet] - 10https://gerrit.wikimedia.org/r/540469 (https://phabricator.wikimedia.org/T209515) [19:16:32] (03PS1) 10BBlack: Deploy inactive digicert 2019 unified certs [puppet] - 10https://gerrit.wikimedia.org/r/540470 (https://phabricator.wikimedia.org/T209515) [19:18:04] (03CR) 10BBlack: [C: 03+2] Add Digicert 2019 unified certs [puppet] - 10https://gerrit.wikimedia.org/r/540469 (https://phabricator.wikimedia.org/T209515) (owner: 10BBlack) [19:20:07] (03PS1) 10Dduvall: group1 wikis to 1.34.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540471 [19:20:09] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.34.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540471 (owner: 10Dduvall) [19:21:01] (03Merged) 10jenkins-bot: group1 wikis to 1.34.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540471 (owner: 10Dduvall) [19:21:43] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ge... [19:22:18] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.34.0-wmf.25 [19:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:17] !log dduvall@deploy1001 Synchronized php: group1 wikis to 1.34.0-wmf.25 (duration: 00m 59s) [19:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:48] 10Operations, 10SRE-Access-Requests: Requesting access to analytics cluster for Djellel Difallah - https://phabricator.wikimedia.org/T234473 (10DED) [19:33:41] !log 1.34.0-wmf.25 promoted to group1, cc: T220750. no rise in relevant error rates [19:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:50] T220750: 1.34.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T220750 [19:34:54] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:04] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:13] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['gerrit1001.wikimedia.org'] ` Of which those **FA... [19:43:11] marxarelli: LGTM. [19:43:12] (03PS1) 10Jgreen: Adjust nsca_frack.cfg.erb for new approach to monitoring endpoints. [puppet] - 10https://gerrit.wikimedia.org/r/540472 (https://phabricator.wikimedia.org/T212252) [19:43:27] James_F: same! [19:43:52] (03CR) 10Dzahn: "> Patch Set 2: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/540252 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [19:44:30] (03CR) 10Jgreen: [C: 03+2] Adjust nsca_frack.cfg.erb for new approach to monitoring endpoints. [puppet] - 10https://gerrit.wikimedia.org/r/540472 (https://phabricator.wikimedia.org/T212252) (owner: 10Jgreen) [19:47:09] !log deployed icinga fundraising-nsca collection configuration change [19:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:02] !log puppetmaster1001 - sudo puppet cert clean parsoid.discovery.wmnet (only created yesterday but does not have all the SANs it needs, updating with more SANs) (T233654) [19:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:06] T233654: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 [20:00:04] cscott, arlolra, subbu, bearND, halfak, and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191002T2000). [20:04:22] no parsoid deploy today [20:05:05] subbu: working on creating the needed certificates for parsoid/PHP on appservers [20:05:39] ok! [20:05:41] we need one that has ALL the alt. names on it used by appservers PLUS parsoid.discovery and parsoid.svc. [20:05:55] that's a lot of names [20:06:07] but after that i should be able to apply a change on parsoid2001 [20:06:27] and turn it into the first "hybrid" one [20:08:39] Jeff_Green: Do you know off-hand if any of the FR code is still running on HHVM? It's the last bit of CI with it. [20:13:42] OK, I'm going to be mucking with unmerged patches in prod, please no unexpected scaps. ;-) [20:14:35] James_f we never ran HHVM [20:14:50] Jeff_Green: Aha, excellent. Thank you! [20:14:57] np! [20:15:02] RECOVERY - Host db2112.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 47.57 ms [20:15:08] RECOVERY - Host ps1-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 39.23 ms [20:15:17] MatmaRex: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/540464 is live on mwdebug1002. [20:15:48] RECOVERY - Host db2077.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.40 ms [20:16:35] James_F: thanks. well, that seems to get logged [20:16:56] (i'm looking at https://logstash.wikimedia.org/goto/78fdce6b25619569e5929c3d60eae4ee) [20:17:03] RECOVERY - Host db2087.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.33 ms [20:17:05] RECOVERY - Host elastic2031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.26 ms [20:17:05] RECOVERY - Host db2125.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.41 ms [20:17:09] RECOVERY - Host ganeti2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.61 ms [20:17:27] RECOVERY - Host mc2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms [20:17:33] RECOVERY - Host mc2027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.70 ms [20:17:33] RECOVERY - Host restbase2011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.75 ms [20:17:53] RECOVERY - Host restbase2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.15 ms [20:17:57] RECOVERY - Host cloudservices2002-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.68 ms [20:17:59] RECOVERY - Host cloudcontrol2003-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.73 ms [20:19:11] MatmaRex: Both warning and info, yeah. [20:19:20] RECOVERY - Host es2012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.65 ms [20:19:31] MatmaRex: Are you looking for things getting into kibana or into something else? [20:19:40] papaul: ^ mgmt switch ? [20:19:47] RECOVERY - Host ganeti2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.05 ms [20:20:03] RECOVERY - Host es2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms [20:20:03] RECOVERY - Host ores2005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.75 ms [20:20:03] RECOVERY - Host pc2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.06 ms [20:20:29] James_F: yes. i'm trying to figure out how come i don't see the logs there generated by this code: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/540428/1/includes/ApiVisualEditorEdit.php [20:20:55] MatmaRex: Are you sure the if statement is triggering? [20:21:32] (And does ApiVisualEditorEdit have the scope for the logger from ApiVisualEditor? Presumably?) [20:22:18] as sure as one can be. it worked for me locally (writing logs to $wgDebugLogFile) [20:22:43] Hmm. [20:22:51] Does it only trigger very rarely? [20:23:04] Or with some crappy browser/network proxy? [20:23:16] for example this query should trigger it: https://en.wikipedia.org/wiki/Special:ApiSandbox#action=visualeditoredit&format=json&paction=serializeforcache&page=User%3AMatma_Rex%2Fsandbox&token=&html=asdfasdfasdf<%2Fb>&etag=test [20:23:31] yes, it shouldn't trigger normally [20:26:07] (I'm resetting mwdebug1002 to origin/wmf/1.34.0-wmf.24.) [20:26:59] ok [20:27:29] Why are some events shown as from the "VisualEditor" channel and some from the "visualeditor" channel? [20:27:42] Is it just that we're setting the wrong case-sensitive string in the match? [20:28:39] The "called on 'The Fighting Temeraire' with paction: 'parsefragment'" message comes from `visualeditor` but the logger ones go to `VisualEditor`. [20:29:21] James_F: the lowercase is generated by `wfDebugLog( 'visualeditor', .... )`, which i didn't notice there before [20:29:38] James_F: and it's only logged with the WikimediaDebug extension with "Log" option enabled, which i tested accidnetally once earlier today [20:30:04] i don't know why it shows up in that search, i guess the search is case-insensitive [20:30:23] Kibana is. [20:30:36] udp2log isn't, I think? [20:30:41] (03CR) 10MarcoAurelio: Grant autocreateaccount to everyone on closed wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540380 (https://phabricator.wikimedia.org/T222117) (owner: 10Urbanecm) [20:30:49] Meh. [20:33:42] (03PS3) 10Brennen Bearnes: mediawiki-dev: use wikimedia/mediawiki-core:dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T222494) [20:33:50] (03CR) 10jerkins-bot: [V: 04-1] mediawiki-dev: use wikimedia/mediawiki-core:dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T222494) (owner: 10Brennen Bearnes) [20:35:22] (03PS4) 10Brennen Bearnes: mediawiki-dev: use wikimedia/mediawiki-core:dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T234391) [20:35:29] (03CR) 10jerkins-bot: [V: 04-1] mediawiki-dev: use wikimedia/mediawiki-core:dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T234391) (owner: 10Brennen Bearnes) [20:35:37] (03CR) 10MarcoAurelio: "Closed wikis are candidates to be deleted at some point in the future. I note that one of the steps to delete a wiki, according to James_F: are you busy or do you have time to fiddle with this? i guess i can add a lot more logging [20:36:34] MatmaRex: I'm a bit busy, but I can deploy local hacks every few minutes to help out. [20:37:11] I'm just removing all traces of HHVM from CI, no big deal. ;-) [20:38:14] MatmaRex: Did you see the events that made it through right after we stopped testing? [20:38:39] (03CR) 10Herron: [C: 03+1] hieradata: bump shard size threshold for logstash [puppet] - 10https://gerrit.wikimedia.org/r/540444 (owner: 10Filippo Giunchedi) [20:38:50] Krinkle: yes, they were unrelated, from some old logging code i didn't even know was there [20:38:56] ok [20:38:58] (03PS3) 10Dzahn: add certificate for parsoid.discovery/parsoid.svc [puppet] - 10https://gerrit.wikimedia.org/r/540252 (https://phabricator.wikimedia.org/T233654) [20:39:12] (03PS5) 10Brennen Bearnes: mediawiki-dev: use wikimedia/mediawiki-core:dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T234391) [20:39:16] there's a wfDebugLog( 'visualeditor', … ) call elsewhere, and i tested with WikimediaDebug with "Log" enabled, so it got logged [20:41:32] (03CR) 10Dzahn: [C: 03+2] "i followed the docs how to update it and dropped the old cert from puppet CA. then added the new one with all these SANs in private repo." [puppet] - 10https://gerrit.wikimedia.org/r/540252 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [20:41:52] (03PS4) 10Dzahn: add certificate for parsoid.discovery/parsoid.svc [puppet] - 10https://gerrit.wikimedia.org/r/540252 (https://phabricator.wikimedia.org/T233654) [20:41:56] James_F: i updated https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/540464 , can you deploy it? [20:43:37] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: bump shard size threshold for logstash [puppet] - 10https://gerrit.wikimedia.org/r/540444 (owner: 10Filippo Giunchedi) [20:43:38] MatmaRex: Done. [20:43:45] (03PS2) 10Filippo Giunchedi: hieradata: bump shard size threshold for logstash [puppet] - 10https://gerrit.wikimedia.org/r/540444 [20:44:24] okay, this is illuminating [20:44:25] so [20:44:38] if i pass the second parameter to info(), nothing gets logged [20:44:47] (03CR) 10Herron: [C: 03+1] "Looks good to me! One minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539978 (https://phabricator.wikimedia.org/T207200) (owner: 10Filippo Giunchedi) [20:47:28] (03CR) 10Krinkle: mediawiki-dev: use wikimedia/mediawiki-core:dev (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T234391) (owner: 10Brennen Bearnes) [20:47:35] i don't understand why [20:49:32] 10Operations, 10ops-eqiad, 10Patch-For-Review: (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10Jclark-ctr) Do we want 9 host Row D only has 2 10g racks Row A: 1053, 1054 (2 nodes, try to avoid A3, then avoid A6) Row B: 1055 (1 node), Avoid... [20:51:32] Krinkle: James_F: do you have any idea why the message would completely disappear when i pass the second argument? (the array with extra data) [20:52:17] did i make some insane typo somewhere [20:53:29] MatmaRex: link or paste of the code? [20:53:57] Krinkle: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/540464/2/includes/ApiVisualEditorEdit.php line 280 [20:54:08] line 281 and 282 * [20:54:08] Krinkle: the first one is recorded, the second is not [20:54:59] i can see the first one here, but not the second: https://logstash.wikimedia.org/goto/8484a8c94a8851a60d89ef17a1d2d776 "ApiVisualEditorEdit::postData: Testing test" [20:55:12] (03CR) 10Brennen Bearnes: mediawiki-dev: use wikimedia/mediawiki-core:dev (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T234391) (owner: 10Brennen Bearnes) [20:56:51] (03PS5) 10Paladox: Gerrit: Tweak replication config [puppet] - 10https://gerrit.wikimedia.org/r/540458 [20:57:00] (03PS5) 10Paladox: Gerrit: Rename "slaves" in replication config to replica [puppet] - 10https://gerrit.wikimedia.org/r/540462 [20:57:40] (03PS6) 10Paladox: Gerrit: Rename "slaves" in replication config to replica_codfw [puppet] - 10https://gerrit.wikimedia.org/r/540462 [20:57:46] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/540462 (owner: 10Paladox) [20:58:54] MatmaRex: What happens if you try to log the array without the string? [20:58:55] (03PS1) 10Herron: logstash: remove filter/throttle/normalized_message_drop id [puppet] - 10https://gerrit.wikimedia.org/r/540482 [20:59:22] James_F: i don't know, but we can try if you want to deploy it [20:59:29] wait, i'm not sure what you mean [20:59:37] $this->logger->info( __METHOD__ . ": Testing {etag}", [] ); ? [20:59:41] $this->logger->info( __METHOD__ . ": Testing", [ 'etag' => $etag ] ); ? [20:59:43] (03PS3) 10Jeena Huneidi: [DNM] Update scaffold template names to use chart name [deployment-charts] - 10https://gerrit.wikimedia.org/r/539220 [20:59:51] MatmaRex: hm. nothing stands out as odd. Maybe it's being deduplicated somehow, but seems unlikely. Another might be to ensure it is a string with 'etag => "$etag", just in case that matters. [20:59:58] No, I meant `$this->logger->info( [ 'etag' => $etag ] );` [21:00:10] first param has to be message string. [21:00:23] Oh, yeah, maybe it's type validation failing? [21:00:40] but we do sometimes pass empty string as first param to send key/value context only [21:00:50] which works, although not very commonly used [21:01:29] (03CR) 10Herron: [C: 03+2] logstash: remove filter/throttle/normalized_message_drop id [puppet] - 10https://gerrit.wikimedia.org/r/540482 (owner: 10Herron) [21:02:06] James_F: Krinkle: i updated https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/VisualEditor/+/540464 with a few different variants, if you want to try [21:02:11] (03CR) 10Jeena Huneidi: "> Patch Set 2:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/539220 (owner: 10Jeena Huneidi) [21:02:52] i guess the practical solution is to put it all in the message string, since that works [21:02:56] that will teach me not to try to be fancy [21:03:01] structured logging my ass [21:03:14] MatmaRex: On mwdebug1002. [21:03:56] !log gerrit1001 changing UID of gerrit2 user to 114 and GID to 119 in /etc/passwd to match cobalt to avoid privilege issues after rsyncing data (T222391) [21:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:02] T222391: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 [21:04:53] so line 287 and 288 also work, those that have the empty array [21:05:01] but that's obviously unhelpful [21:08:29] !log gerrit1001 changing GID of gerrit2 user to 119 in /etc/group ; find / -uid 499 -exec chown gerrit2 {} \; find / -gid 1001 -exec chown gerrit2:gerrit2 {} \; (T222391) [21:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:50] !log gerrit1001 - rebooting [21:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:15] 10Operations, 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10bd808) [21:14:29] 10Operations, 10cloud-services-team (Kanban): Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 (10bd808) [21:14:37] 10Operations, 10cloud-services-team (Kanban): Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 (10bd808) [21:15:01] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) >>! In T222391#5542166, @ops-monitoring-bot wrote: > Completed auto-reimage of hosts: > ` > ['gerri... [21:15:19] (03PS1) 10Herron: logstash: output mediawiki type to logstash-medaiwiki ES index [puppet] - 10https://gerrit.wikimedia.org/r/540486 [21:16:14] (03PS2) 10Herron: logstash: output mediawiki type to logstash-medaiwiki ES index [puppet] - 10https://gerrit.wikimedia.org/r/540486 [21:16:55] Krinkle: James_F: how do you feel about doing this, then: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/540487 (let's properly merge and deploy this one everywhere) [21:17:06] !log cobalt (gerrit) rsyncing /srv/gerrit/git and /srv/gerrit/plugins data to gerrit1001 again after reinstall and fixing gerrit2 UID/GID (T222391) [21:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:09] T222391: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 [21:17:35] MatmaRex: C+2'ed. [21:18:31] James_F: actually that has an unmerged parent patch, that was already backported and deployed, do that one too? [21:18:58] Oh, sure. [21:19:10] Preserve the log. [21:27:11] (03CR) 10Cwhite: [C: 03+1] "Yes please!" [puppet] - 10https://gerrit.wikimedia.org/r/540486 (owner: 10Herron) [21:29:23] (03Abandoned) 10Herron: kafka-main: move kafka1001 to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/534633 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [21:29:52] (03Abandoned) 10Herron: prometheus: add per-site systemd failed unit checks [puppet] - 10https://gerrit.wikimedia.org/r/536642 (https://phabricator.wikimedia.org/T230570) (owner: 10Herron) [21:30:29] (03Abandoned) 10Herron: check_systemd_state: downgrade 'degraded' status to warning [puppet] - 10https://gerrit.wikimedia.org/r/530442 (https://phabricator.wikimedia.org/T230570) (owner: 10Herron) [21:51:02] looks like mx1001's queue is unhappy, lotsa deferred [22:03:31] MatmaRex: Proper patch is now live on mwdebug1002 (both wmf.24 and wmf.25). [22:04:09] thanks, testing [22:04:37] James_F: nice, it actually finally works [22:04:42] Woo-hoo. [22:04:45] Good to sync? [22:05:40] James_F: yes please [22:05:52] Kk. [22:06:47] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.25/extensions/VisualEditor/includes/ApiVisualEditor.php: VE unstructured logging, part I (duration: 01m 00s) [22:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:15] (03PS13) 10Dzahn: gerrit: add role on gerrit1001 and remove gerrit::migration [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox) [22:07:20] \o/ [22:07:36] (03CR) 10Thcipriani: [C: 03+1] Gerrit: Rename "slaves" in replication config to replica_codfw [puppet] - 10https://gerrit.wikimedia.org/r/540462 (owner: 10Paladox) [22:09:17] Jeff_Green dwisehaupt looks like some process on frdev1001 is spamming fr-tech-ops@ with messages, Subject: ALERT: psad DL3 [...] [22:09:54] godog: they're alerts from a service that monitors iptables activity [22:09:58] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.25/extensions/VisualEditor/includes/ApiVisualEditorEdit.php: VE unstructured logging, part II (duration: 00m 58s) [22:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:24] the alerts are valid, but it's way too noisy [22:10:46] indeed, google is rate limiting delivering to your addresses atm [22:11:05] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.24/extensions/VisualEditor/includes/ApiVisualEditor.php: VE unstructured logging, part I (duration: 00m 59s) [22:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:08] 40k+ emails queued, that's noisy alright :) [22:11:14] good lord [22:11:27] all the emails. :) [22:11:35] James_F: thank you! [22:11:57] I was hoping it would resolve but I guess we'll have to turn the service off until we can figure out the underlying problem [22:12:15] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.24/extensions/VisualEditor/includes/ApiVisualEditorEdit.php: VE unstructured logging, part II (duration: 00m 58s) [22:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:21] MatmaRex: Should be all done. [22:13:35] (03CR) 10Thcipriani: [C: 03+1] "Lowering replication delay from the default seems like a good thing." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540458 (owner: 10Paladox) [22:15:47] (03CR) 10Paladox: Gerrit: Tweak replication config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540458 (owner: 10Paladox) [22:17:56] Jeff_Green: LMK when done, I can remove the queued messages from root@frdev1001.frack.eqiad.wmnet [22:18:08] Production is clean. [22:18:22] godog: it will take about 10 min to propagate [22:18:31] but it's already in the puppet pipeline [22:18:44] are you going to purge them or flush the queue? [22:20:21] Jeff_Green: purge, flushing the queue won't do much I think since it is google that's rate limiting delivery [22:20:35] (03PS1) 10Paladox: Gerrit: Change host to gerrit-new.wikimedia.org on gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/540499 [22:20:52] if you purge now, it will probably take care of most of the issue, this has been a day-long accumulation of mail [22:21:01] so 5-10 minutes more won't be a huge quantity [22:21:30] ok trying that now [22:21:50] (03PS2) 10Paladox: Gerrit: Change host to gerrit-new.wikimedia.org on gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/540499 [22:22:00] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/540499 (owner: 10Paladox) [22:22:19] (03CR) 10Dzahn: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/540499 (owner: 10Paladox) [22:23:14] (03PS3) 10Dzahn: Gerrit: Change host to gerrit-new.wikimedia.org on gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/540499 (owner: 10Paladox) [22:24:30] (03CR) 10Dzahn: [C: 03+2] Gerrit: Change host to gerrit-new.wikimedia.org on gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/540499 (owner: 10Paladox) [22:25:36] (03PS14) 10Paladox: gerrit: add role on gerrit1001 and remove gerrit::migration [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) [22:25:44] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox) [22:25:52] (03PS6) 10Paladox: Gerrit: Get cobalt to replicate to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/540164 [22:27:05] (03CR) 10Dzahn: "this was a follow-up to looking at the catalog for applying the gerrit role https://puppet-compiler.wmflabs.org/compiler1002/18718/gerrit1" [puppet] - 10https://gerrit.wikimedia.org/r/540499 (owner: 10Paladox) [22:28:20] (03PS1) 10Paladox: Gerrit: Add gerrit-new.wikimedia.org to acme [puppet] - 10https://gerrit.wikimedia.org/r/540500 [22:28:42] (03PS2) 10Paladox: Gerrit: Add gerrit-new.wikimedia.org to acme [puppet] - 10https://gerrit.wikimedia.org/r/540500 [22:29:39] !log remove queued messages from mx1001 for fr-tech-ops@, triggering sender rate limit from gmail [22:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:26] (03CR) 10Thcipriani: "> > "If pushing to multiple remotes, over differing types of network" [puppet] - 10https://gerrit.wikimedia.org/r/540164 (owner: 10Paladox) [22:35:55] (03CR) 10Paladox: "> > > "If pushing to multiple remotes, over differing types of" [puppet] - 10https://gerrit.wikimedia.org/r/540164 (owner: 10Paladox) [22:36:23] (03PS7) 10Paladox: Gerrit: Get cobalt to replicate to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/540164 [22:36:29] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/540164 (owner: 10Paladox) [22:39:25] dwisehaupt: I see Jeff is offline, anyways should be better now [22:40:04] godog: wonderful. thanks for catching that, letting us know, and killing the tidal wave. :) [22:40:40] dwisehaupt: hehe np, it caught my eye while looking at icinga [22:48:18] (03CR) 10Alex Monk: "I removed the cherry-pick, ran puppet everywhere, and ferm failed on deployment-imagescaler01, deployment-memc0[56], and deployment-ircd. " [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T153468) (owner: 10Hashar) [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Evening SWAT (Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191002T2300). [23:00:04] ebernhardson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:30] RECOVERY - exim queue on mx1001 is OK: OK: Less than 1000 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim [23:01:30] \o [23:01:56] (03PS2) 10Bstorm: dumps distribution: fail labstore1007 back as VPS NFS [puppet] - 10https://gerrit.wikimedia.org/r/540238 [23:02:07] i'll ship [23:03:58] (03CR) 10Bstorm: [C: 03+2] dumps distribution: fail labstore1007 back as VPS NFS [puppet] - 10https://gerrit.wikimedia.org/r/540238 (owner: 10Bstorm) [23:06:50] (03CR) 10Dzahn: [C: 03+2] Gerrit: Add gerrit-new.wikimedia.org to acme [puppet] - 10https://gerrit.wikimedia.org/r/540500 (owner: 10Paladox) [23:06:57] \o/ [23:06:58] (03PS3) 10Dzahn: Gerrit: Add gerrit-new.wikimedia.org to acme [puppet] - 10https://gerrit.wikimedia.org/r/540500 (owner: 10Paladox) [23:07:03] (03PS6) 10Bstorm: toolforge-kubernetes: restructure pod security policies [puppet] - 10https://gerrit.wikimedia.org/r/537732 (https://phabricator.wikimedia.org/T227290) [23:08:57] (03CR) 10Bstorm: [C: 03+2] "I *think* deleting those (which must be done to make tools psp's work) addresses all concerns for now. I'll merge this, and we can move o" [puppet] - 10https://gerrit.wikimedia.org/r/537732 (https://phabricator.wikimedia.org/T227290) (owner: 10Bstorm) [23:09:10] (03PS7) 10Bstorm: toolforge-kubernetes: restructure pod security policies [puppet] - 10https://gerrit.wikimedia.org/r/537732 (https://phabricator.wikimedia.org/T227290) [23:10:31] (03PS15) 10Paladox: gerrit: add role on gerrit1001 and remove gerrit::migration [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) [23:21:21] !log ebernhardson@deploy1001 Synchronized php-1.34.0-wmf.25/extensions/CirrusSearch/: T234445: CirrusSearch: Fix Precondition failed: Must have a resultset set (duration: 01m 02s) [23:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:25] T234445: Error when searching for exact phrase on English Wikipedia: "Precondition failed: Must have a resultset set" - https://phabricator.wikimedia.org/T234445 [23:21:59] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox) [23:22:30] !log ebernhardson@deploy1001 Synchronized php-1.34.0-wmf.24/extensions/CirrusSearch/: T234445: CirrusSearch: Fix Precondition failed: Must have a resultset set (duration: 01m 00s) [23:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:46] with that, SWAT should be complete [23:29:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10fgiunchedi) >>! In T227867#5536367, @Dzahn wrote: > self-healing?? > > <+icinga-wm> RECOVERY - Memory correctable errors -EDAC- on mw1239 is OK: (C)4 ge (W)2 ge 1 Not self healing no in... [23:32:27] (03PS7) 10Bstorm: toolforge-k8s: proposed role for all tools [puppet] - 10https://gerrit.wikimedia.org/r/537755 (https://phabricator.wikimedia.org/T227290) [23:34:35] 10Operations, 10ops-eqiad, 10DBA: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 (10Cmjohnson) 05Open→03Resolved @marostegui yes the board was replaced. Sorry about that I left that to John and the task was not closed. [23:38:21] !log disable cr2-eqiad:xe-4/0/0 - T234416 [23:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:24] T234416: asw2-a-eqiad <-> cr2-eqiad fiber issue - https://phabricator.wikimedia.org/T234416 [23:38:40] (03PS16) 10Paladox: gerrit: add role on gerrit1001 and remove gerrit::migration [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) [23:41:28] (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: proposed role for all tools [puppet] - 10https://gerrit.wikimedia.org/r/537755 (https://phabricator.wikimedia.org/T227290) (owner: 10Bstorm) [23:41:59] !log enable cr2-eqiad:xe-4/0/0 - T234416 [23:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:28] (03PS4) 10Cwhite: initial commit [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376 [23:43:42] 10Operations, 10ops-eqiad, 10netops: asw2-a-eqiad <-> cr2-eqiad fiber issue - https://phabricator.wikimedia.org/T234416 (10ayounsi) 05Open→03Resolved Better! ` ayounsi@asw2-a-eqiad> show interfaces diagnostics optics xe-7/0/46 | match "rx|receive" Receiver signal average optical power : 0.0741... [23:47:11] (03PS17) 10Dzahn: gerrit: add role on gerrit1001 and remove gerrit::migration [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox) [23:50:25] (03CR) 10Dzahn: [C: 03+2] gerrit: add role on gerrit1001 and remove gerrit::migration [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox) [23:51:15] \o/ [23:53:20] yolo, but we double checked there should be no more IP conflicts [23:53:59] paladox: jdk8 installs [23:54:06] yay! [23:54:09] then some cert issue with the backups..we'll see [23:54:16] oh [23:54:19] scap config gets written.. [23:55:07] :) [23:55:50] on the second run it gets the second v6 IP [23:56:05] and that's when you disconnect [23:56:47] oh [23:56:58] 10Operations, 10ops-eqiad, 10netops: asw2-a-eqiad <-> cr2-eqiad fiber issue - https://phabricator.wikimedia.org/T234416 (10Cmjohnson) I swapped both optics [23:59:27] 10Operations, 10ops-eqiad: apply hostname labels for krb1001/WMF5173 - https://phabricator.wikimedia.org/T233642 (10Cmjohnson) 05Open→03Resolved done [23:59:32] 10Operations, 10Analytics, 10User-Elukey: setup/install krb1001/WMF5173 - https://phabricator.wikimedia.org/T233141 (10Cmjohnson)