[00:00:04] RECOVERY - Host mw2327 is UP: PING OK - Packet loss = 0%, RTA = 36.28 ms [00:00:05] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200207T0000). [00:00:05] Kemayo and MatmaRex: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:25] oh. this is already done [00:00:35] sorry for the pings, i forgot to remove it [00:00:36] Sorry, yes, James got it already. [00:03:37] PROBLEM - Disk space on mw2327 is CRITICAL: connect to address 10.192.16.196 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mw2327&var-datasource=codfw+prometheus/ops [00:03:39] PROBLEM - MD RAID on mw2327 is CRITICAL: connect to address 10.192.16.196 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:03:55] PROBLEM - DPKG on mw2327 is CRITICAL: connect to address 10.192.16.196 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [00:03:59] PROBLEM - configured eth on mw2327 is CRITICAL: connect to address 10.192.16.196 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [00:03:59] PROBLEM - dhclient process on mw2327 is CRITICAL: connect to address 10.192.16.196 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [00:04:29] PROBLEM - Check systemd state on mw2327 is CRITICAL: connect to address 10.192.16.196 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:58] looks at 2327 [00:07:24] ah. one of the new installs..nvm [00:08:58] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [00:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:17] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [00:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:53] RECOVERY - DPKG on mw2327 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [00:10:59] RECOVERY - configured eth on mw2327 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [00:10:59] RECOVERY - dhclient process on mw2327 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [00:11:15] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:50] RECOVERY - Disk space on mw2327 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mw2327&var-datasource=codfw+prometheus/ops [00:13:28] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:59] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2326.codfw.wmnet'] ` and were **ALL** successful. [00:18:11] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2327.codfw.wmnet'] ` and were **ALL** successful. [00:22:19] 10Operations, 10Beta-Cluster-Infrastructure: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster) - https://phabricator.wikimedia.org/T243226 (10Krenair) >>! In T243226#5855852, @jbond wrote: > This is similar to productions which still has `issuer=CN = Puppet CA: palladium.eqiad.wmnet`. Tha... [00:22:50] RECOVERY - Check systemd state on mw2327 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:10:44] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2328.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [01:13:17] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2329.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [01:24:47] !log eqsin pdu work ongoing starting now. ps1-603 swapping per T242250 [01:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:24:50] T242250: rack/setup/install ps[12]-60[34]-eqsin - https://phabricator.wikimedia.org/T242250 [01:25:44] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [01:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:59] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [01:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:04] 10Operations, 10Beta-Cluster-Infrastructure: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster) - https://phabricator.wikimedia.org/T243226 (10Krenair) Got puppet running on -cache-text05, whole beta cluster broke, fixed acme-chief and ATS, going to sleep. [01:29:33] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:29:33] PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:29:49] PROBLEM - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [01:30:33] PROBLEM - Juniper alarms on cr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - 4 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [01:30:59] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:31:37] PROBLEM - Router interfaces on cr1-eqsin is CRITICAL: CRITICAL: host 103.102.166.129, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:31:41] PROBLEM - OSPF status on cr1-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:32:41] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2328.codfw.wmnet'] ` and were **ALL** successful. [01:33:12] expected [01:33:17] mr1-eqsin is a single power feed device [01:33:29] and the other is alarming that it lost a power feed as a router [01:33:43] maybe... [01:34:07] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10ArthurPSmith) >>! In T243701#5855439, @ArielGlenn wrote: >>>! In T243701#5855352, @Lea_Lacroix_WMDE wro... [01:34:42] XioNoX: You about? [01:35:41] robh: what's up? [01:35:47] see dcops [01:36:52] looking [01:42:55] RECOVERY - Router interfaces on cr1-eqsin is OK: OK: host 103.102.166.129, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:43:43] RECOVERY - Juniper alarms on cr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [01:46:52] so the mr1-eqsin is booting back up [01:48:35] RECOVERY - Host asw1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 260.90 ms [01:48:37] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 234.88 ms [01:48:37] RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 258.71 ms [01:48:59] RECOVERY - OSPF status on cr1-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:56:19] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:03:29] PROBLEM - Juniper alarms on cr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - 2 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [02:13:32] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2329.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2329.codfw.wmnet'] ` [02:14:39] RECOVERY - Juniper alarms on cr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [02:15:37] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2329.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [02:15:45] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 46 probes of 521 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:20:49] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 32 probes of 521 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:27:54] expected [02:28:08] the atlas is a single power feed device and wont stay up for power maint [02:28:28] i dont silence these since i want to see them come up and down [02:28:50] (i silence pages if i expect pages but not irc echos) [02:32:01] PROBLEM - Host re0.cr2-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [02:32:19] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2330.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [02:32:53] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 84, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:33:46] PROBLEM - Host cp5004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:33:46] PROBLEM - Host cp5006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:33:46] PROBLEM - Host cp5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:33:53] PROBLEM - Host cp5010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:33:53] PROBLEM - Host cp5012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:33:53] PROBLEM - Host cp5008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:34:09] PROBLEM - Host dns5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:34:11] PROBLEM - Host ganeti5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:34:11] PROBLEM - Host lvs5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:34:11] PROBLEM - Host lvs5003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:35:58] expected [02:36:04] the msw is losing power in 604 [02:37:31] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 86, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:37:41] RECOVERY - Host re0.cr2-eqsin is UP: PING OK - Packet loss = 0%, RTA = 256.04 ms [02:39:29] RECOVERY - Host cp5004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 236.02 ms [02:39:29] RECOVERY - Host cp5006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 236.00 ms [02:39:29] RECOVERY - Host cp5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 260.67 ms [02:39:35] RECOVERY - Host cp5010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 235.93 ms [02:39:36] RECOVERY - Host cp5012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 235.93 ms [02:39:36] RECOVERY - Host cp5008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 260.63 ms [02:39:53] RECOVERY - Host dns5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 260.87 ms [02:39:55] RECOVERY - Host ganeti5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 236.29 ms [02:39:55] RECOVERY - Host lvs5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 235.79 ms [02:39:55] RECOVERY - Host lvs5003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 235.75 ms [02:44:36] PROBLEM - Host cr2-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [02:46:41] robh: that last one paged, confirming it's still expected and no action needed? [02:46:49] the last one is not expected [02:46:53] not necessarily ideal [02:46:54] we're investigating and discussiong in -dcops [02:46:58] but we're not down yet [02:46:58] ack [02:47:02] robh: not expected but not critical for now [02:47:04] thank you for checking [02:47:04] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:47:07] need anything? [02:47:09] PROBLEM - LVS HTTP IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:47:17] ospf on cr4-ulsfo is just related to the cr2-eqsin death [02:47:25] here. scheduled work? [02:47:31] got it [02:47:32] and honestly the text-lb flap above, we had this during the planned cr2-eqsin earlier too [02:47:54] Scheduled work, unintended but not unexpected alarm. [02:47:58] PROBLEM - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is CRITICAL: /v4/marker/pin-m+ffffff@2x.png (Untitled test) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received: /_info (Untitled test) timed out before a response was received: /v4/marker/pin-m+ffffff.png (Untitled test) timed out before a response was received: /priva [02:47:58] (private tile service info for osm-intl) timed out before a response was received: /osm-intl/info.json (tile service info for osm-intl) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps/RunBook [02:48:00] PROBLEM - OSPF status on mr1-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:48:09] cr1 is still up, so it should be fine [02:48:20] well should be, but this is like earlier right? [02:48:26] automatically failed over? [02:48:30] we're getting text-lb pages we shouldn't be getting if everything's fine [02:48:37] bblack: correct [02:48:38] RECOVERY - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps/RunBook [02:48:40] but last time it recovered quickly too [02:48:54] PROBLEM - Host ripe-atlas-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [02:48:54] PROBLEM - Host ripe-atlas-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [02:49:02] the atlas is expected. [02:49:03] RECOVERY - LVS HTTP IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 0.511 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:49:16] PROBLEM - Host cr2-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [02:49:17] right, atlas and mr1 are single-psu [02:49:26] yep [02:49:55] that LVS flap is not expected indeed as nothing between the LVS and icinga should go through cr2 [02:50:19] yeah it'd be nice to get to the bottom of that sometime soon [02:50:28] clearly something is not as we think it is, with all that [02:50:37] but not tonight :) [02:51:07] RECOVERY - Host cr2-eqsin is UP: PING OK - Packet loss = 0%, RTA = 248.54 ms [02:51:12] RECOVERY - OSPF status on mr1-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:51:20] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:51:26] great [02:51:41] bblack: can we see if it had an actual user impact or if it was just monitoring? [02:52:08] I suspect it's real in some sense, briefly [02:52:15] hi [02:52:23] even if it's just transport affecting icinga, that's our backhaul transport too for sure [02:52:28] I am a bit late to the party [02:52:34] cdanis: we're ok, no need to party [02:53:27] bblack: yeah but even the transport is on cr1 [02:53:38] anwyay yeah, for tomorrow [02:54:32] RECOVERY - Host cr2-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 246.19 ms [02:55:46] so it was just the text-lb ipv6 that went down? [02:56:10] well cr2-eqsin got powered off unexpectedly during physical planned maintenance in eqsin [02:56:15] ah [02:56:42] and that flapped some of these other things when really it shouldn't, much like during the planned cr2-eqsin reboot this morning for software [02:56:42] we did lose some traffic https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&var-site=eqsin&var-cache_type=text&var-cache_type=upload&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&from=1581042442213&to=1581044742701 [02:56:56] right, I saw those pages later but wasn't awake yet at the time :) [02:57:00] so that's something to investigate later, as we don't expect a loss of cr2 to cause these things, by design [02:57:16] I know I've heard that cr1-eqsin is.. special [02:57:35] right, but also cr1-eqsin has our primary/active transport link [02:57:46] right [02:58:21] anyways, it sounds like dc work is wrapping up now, and we're "stable" [02:58:56] XioNoX: https://librenms.wikimedia.org/graphs/type=device_processor/device=159/from=1581022500/to=1581044100/ haven't we seen this before? [02:59:09] one of the routing engines in cr1-eqsin is 100% cpu? [02:59:36] this is familiar somehow, I know we've seen this before [02:59:52] yeah something like that probably [02:59:54] looking [03:00:13] we lose cr2's transits, more transit traffic comes over to cr1, saturates something or other, then we lose some traffic from transport and icinga notices [03:00:17] something along those lines [03:00:38] (03PS1) 10CDanis: depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/570776 [03:00:39] prepped this just in case [03:00:49] cdanis: we have one already ;D [03:00:50] (not planning to push it atm) [03:00:51] maybe we should move our primary transport port to cr2 and move the tunnel to cr1 [03:00:53] oh lol ok [03:00:53] bblack made it [03:01:46] I can't see the 100% CPU on the cli https://www.irccloud.com/pastebin/dhUkd1nD/ [03:02:16] but load averages: 0.38, 0.92, 0.90 [03:02:20] hm [03:02:46] ah ok [03:03:05] librenms now says: 15.42% current [03:03:10] (but 100% max) [03:03:21] https://librenms.wikimedia.org/graphs/type=device_processor/device=159/from=1581037200/to=1581044700/?_token=bqkMLH01cDLuYDop1y2PEtP2zz6AiZtE9huttvpL [03:03:33] PEM0 is back to online [03:03:38] so that lines up with the load average [03:04:47] cr1 freaked out a bit when cr2 went down, time to re-converge everything [03:04:58] but I think now everyting routing wise is back to normal [03:05:24] RECOVERY - Host ripe-atlas-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 255.58 ms [03:05:32] RECOVERY - Host ripe-atlas-eqsin is UP: PING OK - Packet loss = 0%, RTA = 259.93 ms [03:06:59] yeah, I don't remember when, but I'm convinced we saw that before in a similar event (and that it also co-occurred with some routing loss that shouldn't've happened) [03:08:29] anyway, seems ok now [03:11:16] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 54 probes of 525 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:15:53] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2329.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2329.codfw.wmnet'] ` [03:17:08] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 37 probes of 525 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:25:14] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2329.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [03:32:34] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2330.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2330.codfw.wmnet'] ` [03:34:23] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2330.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [03:40:14] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [03:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:42:21] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [03:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:47:05] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2329.codfw.wmnet'] ` and were **ALL** successful. [03:49:22] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [03:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:51:04] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2321.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [03:51:35] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [03:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:52:34] PROBLEM - Host mw2321 is DOWN: PING CRITICAL - Packet loss = 100% [03:53:22] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2321.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2321.codfw.wmnet'] ` [03:56:07] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2331.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [03:57:04] RECOVERY - Host mw2321 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [03:57:20] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2330.codfw.wmnet'] ` and were **ALL** successful. [03:59:28] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2332.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [04:04:09] (03CR) 10Jdlrobson: [C: 03+1] "When will this be deployed? LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562486 (https://phabricator.wikimedia.org/T242030) (owner: 10Ammarpad) [04:11:05] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [04:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:13:26] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [04:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:25] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [04:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:16:41] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [04:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:10] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2331.codfw.wmnet'] ` and were **ALL** successful. [04:21:25] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2332.codfw.wmnet'] ` and were **ALL** successful. [04:24:42] (03PS3) 10BryanDavis: Add "migrate" action for 2020 Kubernetes migration [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570702 (https://phabricator.wikimedia.org/T244293) [04:31:06] (03CR) 10BryanDavis: [V: 03+2 C: 03+2] "Tested by manually migrating the admin-beta tool back to the legacy k8s cluster and then moving it to the 2020 k8s cluster with this chang" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570702 (https://phabricator.wikimedia.org/T244293) (owner: 10BryanDavis) [04:31:28] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2333.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [04:31:35] (03Merged) 10jenkins-bot: Add "migrate" action for 2020 Kubernetes migration [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570702 (https://phabricator.wikimedia.org/T244293) (owner: 10BryanDavis) [04:33:20] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2334.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [04:41:17] (03CR) 10Ammarpad: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570680 (https://phabricator.wikimedia.org/T244488) (owner: 10Superzerocool) [04:42:26] (03CR) 10jerkins-bot: [V: 04-1] Throttle rule for National Gallery of Canada Library and Archives edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570680 (https://phabricator.wikimedia.org/T244488) (owner: 10Superzerocool) [04:46:26] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [04:46:37] (03PS1) 10BryanDavis: d/changelog: Prepare for 0.59 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570785 [04:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:47:34] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:47:55] (03PS2) 10BryanDavis: d/changelog: Prepare for 0.59 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570785 [04:49:14] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [04:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:32] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [04:49:50] (03CR) 10BryanDavis: "Are there any other pending changes that we want to get in before we cut the next release?" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570785 (owner: 10BryanDavis) [04:53:58] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [04:56:02] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2333.codfw.wmnet'] ` and were **ALL** successful. [05:06:18] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:15:10] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 36 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:20:10] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 32 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:33:52] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2334.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2334.codfw.wmnet'] ` [05:40:58] (03PS1) 10Reedy: Set Kartographer servers to Wikimedia servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570787 (https://phabricator.wikimedia.org/T244561) [05:42:00] (03CR) 10jerkins-bot: [V: 04-1] Set Kartographer servers to Wikimedia servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570787 (https://phabricator.wikimedia.org/T244561) (owner: 10Reedy) [05:42:38] (03CR) 10Reedy: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570787 (https://phabricator.wikimedia.org/T244561) (owner: 10Reedy) [06:11:56] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Marostegui) From what I can see this maintenance did not happen yesterday in the end - as the host is still off and its IPMI is unreachable. And as the IPMI is involved, I cannot power this host b... [06:18:08] (03PS1) 10Marostegui: db1105: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570789 (https://phabricator.wikimedia.org/T239453) [06:19:43] (03CR) 10Marostegui: [C: 03+2] db1105: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570789 (https://phabricator.wikimedia.org/T239453) (owner: 10Marostegui) [06:20:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1105:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10328 and previous config saved to /var/cache/conftool/dbconfig/20200207-062043-marostegui.json [06:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:47] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [06:23:10] (03PS1) 10Marostegui: db1126: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570790 (https://phabricator.wikimedia.org/T232446) [06:23:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1105:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10329 and previous config saved to /var/cache/conftool/dbconfig/20200207-062345-marostegui.json [06:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:17] (03CR) 10Marostegui: [C: 03+2] db1126: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570790 (https://phabricator.wikimedia.org/T232446) (owner: 10Marostegui) [06:25:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1126 T232446', diff saved to https://phabricator.wikimedia.org/P10330 and previous config saved to /var/cache/conftool/dbconfig/20200207-062502-marostegui.json [06:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:06] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [06:26:31] !log Reboot db1107 for update - T242702 [06:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:34] T242702: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 [06:27:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1105:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10331 and previous config saved to /var/cache/conftool/dbconfig/20200207-062731-marostegui.json [06:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:35] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [06:29:12] PROBLEM - dhclient process on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:29:20] PROBLEM - DPKG on ores1008 is CRITICAL: connect to address 10.64.48.27 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:29:24] PROBLEM - Disk space on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1004&var-datasource=eqiad+prometheus/ops [06:29:26] PROBLEM - Check whether ferm is active by checking the default input chain on ores1008 is CRITICAL: connect to address 10.64.48.27 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:38] PROBLEM - MD RAID on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:42] PROBLEM - configured eth on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:29:48] PROBLEM - Check whether ferm is active by checking the default input chain on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:50] PROBLEM - Check systemd state on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:54] PROBLEM - Check size of conntrack table on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:54] PROBLEM - Check systemd state on ores1008 is CRITICAL: connect to address 10.64.48.27 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:57] again? [06:30:00] PROBLEM - Disk space on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2002&var-datasource=codfw+prometheus/ops [06:30:02] PROBLEM - DPKG on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:08] (03PS1) 10Elukey: Remove failed drives on analytics1035 from its Hadoop config [puppet] - 10https://gerrit.wikimedia.org/r/570791 [06:30:10] PROBLEM - Check size of conntrack table on ores1008 is CRITICAL: connect to address 10.64.48.27 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:30:12] PROBLEM - Check systemd state on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:24] PROBLEM - configured eth on ores1008 is CRITICAL: connect to address 10.64.48.27 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:30:30] PROBLEM - puppet last run on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:30:36] PROBLEM - MD RAID on ores1008 is CRITICAL: connect to address 10.64.48.27 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:30:36] PROBLEM - MD RAID on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:30:37] PROBLEM - Disk space on ores1008 is CRITICAL: connect to address 10.64.48.27 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1008&var-datasource=eqiad+prometheus/ops [06:30:40] PROBLEM - Check whether ferm is active by checking the default input chain on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:30:48] PROBLEM - DPKG on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:48] PROBLEM - configured eth on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:30:50] PROBLEM - ores uWSGI web app on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:54] PROBLEM - dhclient process on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:31:00] PROBLEM - ores uWSGI web app on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:31:02] PROBLEM - Check size of conntrack table on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:31:12] RECOVERY - DPKG on ores1008 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:31:18] RECOVERY - Check whether ferm is active by checking the default input chain on ores1008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:31:38] PROBLEM - puppet last run on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:31:41] yep celery oom errors [06:31:45] sigh [06:31:48] RECOVERY - Check systemd state on ores1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:49] !log force a puppet run on all ores[12] nodes [06:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:04] RECOVERY - Check size of conntrack table on ores1008 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:32:18] RECOVERY - configured eth on ores1008 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:32:22] (03PS1) 10Marostegui: realm.pp: Add watchlist_expiry to private tables [puppet] - 10https://gerrit.wikimedia.org/r/570792 (https://phabricator.wikimedia.org/T240094) [06:32:30] RECOVERY - MD RAID on ores1008 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:32:30] RECOVERY - Disk space on ores1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1008&var-datasource=eqiad+prometheus/ops [06:32:30] RECOVERY - MD RAID on ores2002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:32:42] RECOVERY - DPKG on ores2002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:32:56] RECOVERY - Check size of conntrack table on ores2002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:32:59] RECOVERY - dhclient process on ores2002 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:33:12] RECOVERY - Disk space on ores1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1004&var-datasource=eqiad+prometheus/ops [06:33:26] RECOVERY - MD RAID on ores1004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:33:30] RECOVERY - configured eth on ores2002 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:33:36] RECOVERY - Check whether ferm is active by checking the default input chain on ores2002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:33:38] RECOVERY - Check systemd state on ores2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:42] RECOVERY - Check size of conntrack table on ores1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:33:48] RECOVERY - Disk space on ores2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2002&var-datasource=codfw+prometheus/ops [06:33:50] RECOVERY - DPKG on ores1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:34:00] RECOVERY - Check systemd state on ores1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1105:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10332 and previous config saved to /var/cache/conftool/dbconfig/20200207-063402-marostegui.json [06:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:05] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [06:34:28] RECOVERY - Check whether ferm is active by checking the default input chain on ores1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:34:36] RECOVERY - configured eth on ores1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:34:42] RECOVERY - dhclient process on ores1004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:35:39] (03CR) 10Elukey: [C: 03+2] Remove failed drives on analytics1035 from its Hadoop config [puppet] - 10https://gerrit.wikimedia.org/r/570791 (owner: 10Elukey) [06:36:24] RECOVERY - puppet last run on ores2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:37:32] RECOVERY - puppet last run on ores1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:38:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1126 T232446', diff saved to https://phabricator.wikimedia.org/P10333 and previous config saved to /var/cache/conftool/dbconfig/20200207-063831-marostegui.json [06:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:35] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [06:38:54] (03PS1) 10Marostegui: db1101: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570793 (https://phabricator.wikimedia.org/T239453) [06:40:37] (03CR) 10DannyS712: realm.pp: Add watchlist_expiry to private tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570792 (https://phabricator.wikimedia.org/T240094) (owner: 10Marostegui) [06:42:36] (03PS2) 10Marostegui: realm.pp: Add watchlist_expiry to private tables [puppet] - 10https://gerrit.wikimedia.org/r/570792 (https://phabricator.wikimedia.org/T240094) [06:54:48] 10Operations, 10ORES, 10Scoring-platform-team (Current): Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10elukey) It happened again this morning, even if the deployment was rolled back.. from dmesg I can see again celery killed by the OOM. ` [Fri Feb 7 06:33:29 2020] Out of mem... [07:16:07] (03PS1) 10Vgutierrez: ATS: Provide a slowlog for ats-tls and ats-backend [puppet] - 10https://gerrit.wikimedia.org/r/570802 (https://phabricator.wikimedia.org/T244538) [07:18:16] (03CR) 10jerkins-bot: [V: 04-1] ATS: Provide a slowlog for ats-tls and ats-backend [puppet] - 10https://gerrit.wikimedia.org/r/570802 (https://phabricator.wikimedia.org/T244538) (owner: 10Vgutierrez) [07:19:27] (03PS2) 10Vgutierrez: ATS: Provide a slowlog for ats-tls and ats-backend [puppet] - 10https://gerrit.wikimedia.org/r/570802 (https://phabricator.wikimedia.org/T244538) [07:27:36] PROBLEM - SSH cp5003.mgmt on cp5003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:28:22] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [07:28:50] (03PS3) 10Vgutierrez: ATS: Provide a slowlog for ats-tls and ats-backend [puppet] - 10https://gerrit.wikimedia.org/r/570802 (https://phabricator.wikimedia.org/T244538) [07:29:55] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:30:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1126 T232446', diff saved to https://phabricator.wikimedia.org/P10334 and previous config saved to /var/cache/conftool/dbconfig/20200207-073026-marostegui.json [07:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:33] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [07:30:49] (03CR) 10Marostegui: [C: 03+2] db1101: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570793 (https://phabricator.wikimedia.org/T239453) (owner: 10Marostegui) [07:31:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1101:3317 T239453', diff saved to https://phabricator.wikimedia.org/P10335 and previous config saved to /var/cache/conftool/dbconfig/20200207-073130-marostegui.json [07:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:33] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [07:34:57] 10Operations, 10SRE-tools, 10tox-wikimedia, 10Patch-For-Review: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10Legoktm) [07:35:26] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 34 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:37:00] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [07:38:48] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:40:12] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [07:40:39] (03PS1) 10Marostegui: report_users: Add dbproxy1012 IP [software] - 10https://gerrit.wikimedia.org/r/570809 [07:41:24] (03CR) 10Marostegui: [C: 03+2] report_users: Add dbproxy1012 IP [software] - 10https://gerrit.wikimedia.org/r/570809 (owner: 10Marostegui) [07:42:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1101:3317 T239453', diff saved to https://phabricator.wikimedia.org/P10336 and previous config saved to /var/cache/conftool/dbconfig/20200207-074258-marostegui.json [07:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:01] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [07:44:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2085:3318 T239453', diff saved to https://phabricator.wikimedia.org/P10337 and previous config saved to /var/cache/conftool/dbconfig/20200207-074407-marostegui.json [07:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fullyy repool db1126 T232446', diff saved to https://phabricator.wikimedia.org/P10338 and previous config saved to /var/cache/conftool/dbconfig/20200207-074511-marostegui.json [07:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:14] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [07:46:30] I am going to do some tests on ores2002, if you see alarms is me [07:47:03] (03PS1) 10Marostegui: db2085: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570810 (https://phabricator.wikimedia.org/T239453) [07:48:04] (03CR) 10Marostegui: [C: 03+2] db2085: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570810 (https://phabricator.wikimedia.org/T239453) (owner: 10Marostegui) [07:48:31] !log Remove revision partitions from db2085:3318 T239453 [07:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:34] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [07:52:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1101:3317 T239453', diff saved to https://phabricator.wikimedia.org/P10339 and previous config saved to /var/cache/conftool/dbconfig/20200207-075234-marostegui.json [07:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase base weight for db1126', diff saved to https://phabricator.wikimedia.org/P10340 and previous config saved to /var/cache/conftool/dbconfig/20200207-075323-marostegui.json [07:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:53] 10Operations, 10ORES, 10Scoring-platform-team (Current): Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10elukey) Simply running `systemctl reload uwsgi-ores` doesn't seem to trigger the issue anymore. On ores1008 there is a difference in impact though: ` Feb 05 06:25:12 ores10... [08:03:33] 10Operations, 10ORES, 10Scoring-platform-team (Current): Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10elukey) Might not be related, but https://github.com/unbit/uwsgi/issues/296 seems interesting, pasting here as reference. A lot of people report `--ignore-sigpipe` as useful... [08:13:13] (03CR) 10Muehlenhoff: [C: 04-2] switch apt.wikimedia.org from install1002 to install1003 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/569682 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [08:16:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] wikifeeds: Redefine CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/570726 (https://phabricator.wikimedia.org/T244535) (owner: 10Alexandros Kosiaris) [08:16:47] (03PS3) 10Alexandros Kosiaris: wikifeeds: Redefine CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/570726 (https://phabricator.wikimedia.org/T244535) [08:17:22] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:18:27] (03PS1) 10Alexandros Kosiaris: wikifeeds: package wikifeeds-0.0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/570825 [08:18:59] (03CR) 10Alexandros Kosiaris: [C: 03+2] wikifeeds: package wikifeeds-0.0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/570825 (owner: 10Alexandros Kosiaris) [08:19:08] 10Operations, 10ops-codfw, 10serviceops-radar: codfw: new mw servers not getting an IP when default to Stretch - https://phabricator.wikimedia.org/T244438 (10MoritzMuehlenhoff) 05Open→03Resolved This is confirmed working by Papaul when using the stretch-bootif tftpboot environment, closing. [08:19:15] (03Merged) 10jenkins-bot: wikifeeds: package wikifeeds-0.0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/570825 (owner: 10Alexandros Kosiaris) [08:21:42] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' . [08:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:55] !log deploy https://gerrit.wikimedia.org/r/570726 T244535 to avoid CPU throttling of wikifeeds [08:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:57] T244535: wikifeeds - fix the CPU limits so that it doesn't get starved - https://phabricator.wikimedia.org/T244535 [08:44:33] !log installing libexif security updates [08:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1101:3317 T239453', diff saved to https://phabricator.wikimedia.org/P10341 and previous config saved to /var/cache/conftool/dbconfig/20200207-084447-marostegui.json [08:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:50] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [08:45:50] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 37 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:50:44] (03CR) 10ArielGlenn: [C: 03+1] "Looks fine to me, thumbs up specifically to adding it to the end of the line" [puppet] - 10https://gerrit.wikimedia.org/r/570735 (https://phabricator.wikimedia.org/T244545) (owner: 10Dzahn) [08:51:34] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 35 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:51:47] (03PS1) 10Alexandros Kosiaris: admin: Remove the limitrange overrides for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/570830 [08:51:49] (03PS1) 10Alexandros Kosiaris: wikifeeds: slightly lower the CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/570831 (https://phabricator.wikimedia.org/T244535) [08:51:54] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 521 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:52:40] (03PS1) 10Alexandros Kosiaris: wikifeeds: package wikifeeds-0.0.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/570832 [08:53:19] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Remove the limitrange overrides for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/570830 (owner: 10Alexandros Kosiaris) [08:53:37] (03Merged) 10jenkins-bot: admin: Remove the limitrange overrides for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/570830 (owner: 10Alexandros Kosiaris) [08:53:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] wikifeeds: slightly lower the CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/570831 (https://phabricator.wikimedia.org/T244535) (owner: 10Alexandros Kosiaris) [08:54:03] (03CR) 10Alexandros Kosiaris: [C: 03+2] wikifeeds: package wikifeeds-0.0.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/570832 (owner: 10Alexandros Kosiaris) [08:54:06] (03Merged) 10jenkins-bot: wikifeeds: slightly lower the CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/570831 (https://phabricator.wikimedia.org/T244535) (owner: 10Alexandros Kosiaris) [08:54:19] (03Merged) 10jenkins-bot: wikifeeds: package wikifeeds-0.0.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/570832 (owner: 10Alexandros Kosiaris) [08:54:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1090:3312, db1090:3317 for upgrade', diff saved to https://phabricator.wikimedia.org/P10342 and previous config saved to /var/cache/conftool/dbconfig/20200207-085432-marostegui.json [08:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:45] !log Upgrade db1090:3312, db1090:3317 [08:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:39] (03CR) 10Filippo Giunchedi: [C: 03+2] cassandra: restbase-dev logs to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/569564 (https://phabricator.wikimedia.org/T242585) (owner: 10Filippo Giunchedi) [08:57:57] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [08:58:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1090:3312, db1090:3317', diff saved to https://phabricator.wikimedia.org/P10343 and previous config saved to /var/cache/conftool/dbconfig/20200207-085846-marostegui.json [08:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:38] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' . [08:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:55] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [09:01:24] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [09:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:09] !log restart cassandra on restbase-dev1004 to test logging pipeline onboard [09:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:31] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [09:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:28] 10Operations, 10DBA, 10Phabricator, 10Release-Engineering-Team: Upgrade and restart m3 (phabricator) master (db1128) - https://phabricator.wikimedia.org/T244566 (10Marostegui) [09:05:35] (03Abandoned) 10WMDE-leszek: Set wmgUseEntitySourceBasedFederation to true for all wikibase-enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569262 (https://phabricator.wikimedia.org/T241971) (owner: 10WMDE-leszek) [09:05:39] 10Operations, 10DBA, 10Phabricator, 10Release-Engineering-Team: Upgrade and restart m3 (phabricator) master (db1128) - https://phabricator.wikimedia.org/T244566 (10Marostegui) p:05Triage→03Medium [09:07:06] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:08:36] 10Operations, 10ORES, 10Scoring-platform-team (Current): Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10elukey) Ok so I did some tests (even with `--ignore-sigpipe`), but nothing good to report. I also straced master and worker processes of uwsgi right after reload, and I can s... [09:09:56] !log roll restart cassandra instance on restbase-dev [09:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:51] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on mw1239 is CRITICAL: 8.001 ge 4 Muehlenhoff Server will be decommed in T239054 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1239&var-datasource=eqiad+prometheus/ops [09:12:56] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 35 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:15:16] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 34 probes of 521 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:20:56] (03PS1) 10Filippo Giunchedi: cassandra: default to sending logs to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/570862 (https://phabricator.wikimedia.org/T242585) [09:31:23] 10Operations, 10observability, 10serviceops, 10vm-requests: Provision grafana VM in codfw - https://phabricator.wikimedia.org/T244357 (10fgiunchedi) [09:31:50] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [09:32:03] 10Operations, 10observability, 10serviceops, 10vm-requests: Provision grafana VM in codfw - https://phabricator.wikimedia.org/T244357 (10fgiunchedi) >>! In T244357#5853220, @Dzahn wrote: > added vm-requests tag and pasted vm-request form. please add the missing data above. Done, thank you! [09:36:36] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 38 probes of 521 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:37:24] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [09:37:27] (03CR) 10Giuseppe Lavagetto: [C: 03+1] profile::mediawiki::webserver: set keepalive_requests to 200 [puppet] - 10https://gerrit.wikimedia.org/r/570677 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema) [09:42:28] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 521 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:45:42] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:48:20] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:50:33] (03CR) 10Ema: [C: 03+2] profile::mediawiki::webserver: set keepalive_requests to 200 [puppet] - 10https://gerrit.wikimedia.org/r/570677 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema) [09:50:40] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 35 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:52:13] (03PS1) 10Alexandros Kosiaris: wikifeeds: Bump capacity by 50% [deployment-charts] - 10https://gerrit.wikimedia.org/r/570864 (https://phabricator.wikimedia.org/T244535) [09:53:46] !log A:mw: increase keepalive_requests from 100 to 200 https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/570670/ T241145 [09:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:51] T241145: Improve ATS backend connection reuse against origin servers - https://phabricator.wikimedia.org/T241145 [09:54:54] (03PS4) 10Vgutierrez: ATS: Provide a milestone log for ats-tls and ats-backend [puppet] - 10https://gerrit.wikimedia.org/r/570802 (https://phabricator.wikimedia.org/T244538) [09:57:17] (03CR) 10Alexandros Kosiaris: [C: 03+2] wikifeeds: Bump capacity by 50% [deployment-charts] - 10https://gerrit.wikimedia.org/r/570864 (https://phabricator.wikimedia.org/T244535) (owner: 10Alexandros Kosiaris) [09:57:20] (03CR) 10Ema: [C: 04-1] ATS: Provide a milestone log for ats-tls and ats-backend (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/570802 (https://phabricator.wikimedia.org/T244538) (owner: 10Vgutierrez) [09:57:36] (03Merged) 10jenkins-bot: wikifeeds: Bump capacity by 50% [deployment-charts] - 10https://gerrit.wikimedia.org/r/570864 (https://phabricator.wikimedia.org/T244535) (owner: 10Alexandros Kosiaris) [09:58:24] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:01:38] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [10:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:00] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [10:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:18] !log increase capacity for wikifeeds by 50% T244535 [10:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:20] T244535: wikifeeds - fix the CPU limits so that it doesn't get starved - https://phabricator.wikimedia.org/T244535 [10:03:48] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 35 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:06:06] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [10:07:40] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [10:11:17] (03PS1) 10Muehlenhoff: Add library hint for libexif [puppet] - 10https://gerrit.wikimedia.org/r/570871 [10:14:11] (03PS5) 10Vgutierrez: ATS: Provide a milestone log for ats-tls and ats-backend [puppet] - 10https://gerrit.wikimedia.org/r/570802 (https://phabricator.wikimedia.org/T244538) [10:14:33] !log depool & reimage cp4022 as buster - T242093 [10:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:37] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [10:15:50] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4022.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [10:16:21] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libexif [puppet] - 10https://gerrit.wikimedia.org/r/570871 (owner: 10Muehlenhoff) [10:20:03] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:20:55] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 520 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:21:20] (03CR) 10Vgutierrez: "pcc: https://puppet-compiler.wmflabs.org/compiler1001/20674/" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/570802 (https://phabricator.wikimedia.org/T244538) (owner: 10Vgutierrez) [10:23:46] !log depool and reimage ncredir5001 as buster - T243391 [10:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:49] T243391: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 [10:24:31] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [10:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:11] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 34 probes of 520 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:27:23] 10Operations, 10serviceops: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10jijiki) Given we have 5 canary appservers in eqiad + 2 debug servers, I would recommend we add another 2 in codfw [10:28:25] (03CR) 10Ema: [C: 03+1] "The code looks good, shall we cherry-pick and test in labs before merging?" [puppet] - 10https://gerrit.wikimedia.org/r/570802 (https://phabricator.wikimedia.org/T244538) (owner: 10Vgutierrez) [10:30:03] 10Operations, 10Gerrit-Privilege-Requests, 10Release-Engineering-Team, 10SRE-Access-Requests: Request for +2 access to mediawiki-config - https://phabricator.wikimedia.org/T244508 (10jijiki) p:05Triage→03Medium [10:30:37] 10Operations, 10Gerrit-Privilege-Requests, 10Release-Engineering-Team, 10SRE-Access-Requests: Request for +2 access to mediawiki-config - https://phabricator.wikimedia.org/T244508 (10jijiki) @thcipriani is that a task for your end? I am not sure :) [10:31:51] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 524 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:31:56] 10Operations, 10serviceops: Test and deploy mcrouter 0.41 - https://phabricator.wikimedia.org/T244476 (10jijiki) p:05Triage→03Medium [10:34:21] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for CherRaye Glenn - https://phabricator.wikimedia.org/T244410 (10jijiki) @CGlenn Hello! Please let us know which tools you are planning to use, thank you:) [10:34:32] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for CherRaye Glenn - https://phabricator.wikimedia.org/T244410 (10jijiki) p:05Triage→03Medium [10:36:43] !log conduct experiments with stopping/starting uwsgi-ores on ores2001 T242705 [10:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:46] T242705: Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 [10:37:25] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 35 probes of 524 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:37:40] 10Operations, 10observability: Upgrade Grafana to 6.6 - https://phabricator.wikimedia.org/T244208 (10jijiki) p:05Triage→03Medium [10:37:44] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [10:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:03] 10Operations, 10LDAP-Access-Requests: Get access to Superset - https://phabricator.wikimedia.org/T244490 (10jijiki) p:05Triage→03Medium [10:39:32] 10Operations, 10Gerrit-Privilege-Requests, 10Release-Engineering-Team, 10SRE-Access-Requests: Request for +2 access to mediawiki-config - https://phabricator.wikimedia.org/T244508 (10Urbanecm) @jijiki Anyone listed at https://gerrit.wikimedia.org/r/admin/groups/21,members plus members of ops LDAP group can... [10:39:53] 10Operations, 10serviceops: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10Urbanecm) @jijiki Don't we have mwdebug2001 and mwdebug2002 in codfw too? [10:40:01] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:30] 10Operations, 10ORES, 10Scoring-platform-team (Current): Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10akosiaris) Let me add my own finding. Doing `systemctl stop uwsgi-ores` triggers the issue. It's during the stop phase that uwsgi workers go haywire on CPU and memory usage.... [10:42:29] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [10:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:55] 10Operations, 10serviceops: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10jijiki) @Urbanecm they do not get user traffic, so they are good enough for testing, but not good enough for canary deloys [10:44:41] 10Operations, 10serviceops: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10Urbanecm) Is that different from what eqiad debug servers do? I'm trying to understand why you said "Given we have 5 canary appservers in eqiad //+ 2 debug servers//" (emphasis mine). [10:46:59] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4022.ulsfo.wmnet'] ` and were **ALL** successful. [10:47:27] (03PS2) 10Superzerocool: Throttle rule for National Gallery of Canada Library and Archives edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570680 (https://phabricator.wikimedia.org/T244488) [10:51:17] 10Operations, 10serviceops: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10jijiki) @Urbanecm yes, so that is a total of 7 canary app servers in eqiad, of which 5 get real user traffic. Since we will be switching to codfw, it makes sense to have a similar setup in codfw. [10:52:08] 10Operations, 10serviceops: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10Urbanecm) @jijiki Gotit, thanks! [10:53:03] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 36 probes of 524 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:55:06] (03PS6) 10Vgutierrez: ATS: Provide a milestone log for ats-tls and ats-backend [puppet] - 10https://gerrit.wikimedia.org/r/570802 (https://phabricator.wikimedia.org/T244538) [10:56:19] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 37 probes of 524 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:57:47] 10Operations, 10Scap, 10serviceops: Make canary wait time configurable - https://phabricator.wikimedia.org/T217924 (10jijiki) shall we move this forward? [10:57:49] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 520 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:58:39] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 524 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:59:09] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:00:21] (03PS13) 10Jbond: ldap - idp: add ldap helper script for enabling u2f on cas [puppet] - 10https://gerrit.wikimedia.org/r/562852 [11:01:55] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 35 probes of 524 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:07:57] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [11:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:02] !log undo wikifeeds experiments [11:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Thanks for generating the d/changelog update in a separate commit." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570785 (owner: 10BryanDavis) [11:09:19] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 520 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:12:09] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [11:24:04] !log pooling cp4022 with buster - T242093 [11:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:07] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [11:25:22] !log pooling ncredir5001 running buster - T243391 [11:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:25] T243391: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 [11:26:31] 10Operations, 10Traffic: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 (10Vgutierrez) [11:26:33] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [11:29:19] 10Operations, 10Wikimedia-Incident: Investigate whether we can automatically share incident status docs with WMDE - https://phabricator.wikimedia.org/T244395 (10Ladsgroup) Usually and in large incident it's me and Adam that help with the incident from WMDE and I don't remember anyone else from WMDE helping in... [11:33:29] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 520 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:39:01] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 34 probes of 520 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:40:47] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 524 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:41:37] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:48:11] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [11:57:57] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [11:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:03] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:07] 10Operations: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10ops-monitoring-bot) Icinga downtime for 1 day, 0:00:00 set by akosiaris@cumin1001 on 14 host(s) and their services with reason: enable VT ` ganeti[1009-1022].eqiad.wmnet ` [11:58:27] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 35 probes of 524 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:08:09] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 524 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:08:10] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [12:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:14] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:20] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10ops-monitoring-bot) Icinga downtime for 1 day, 0:00:00 set by akosiaris@cumin1001 on 10 host(s) and their services with reason: enable VT ` ganeti[2009-2018].codfw.wmnet ` [12:10:04] WARNING: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/3.7/dist-packages/ldapsupportlib.py] [12:10:52] jbond42: ^ [12:11:00] I think this is 618865b16e [12:11:07] it's showing up throughout the fleet [12:13:17] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:14:01] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 35 probes of 524 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:19:17] akosiaris: thanks looking [12:20:17] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [12:26:29] (03PS1) 10Jbond: ldap: fix case statment [puppet] - 10https://gerrit.wikimedia.org/r/570880 [12:28:51] 10Operations, 10puppet-compiler: puppet-compiler fails to compile production catalog for restbase2014 - https://phabricator.wikimedia.org/T238053 (10jbond) possible patch to puppet untested and could be a bad idea https://phabricator.wikimedia.org/P10344 [12:29:30] (03CR) 10Jbond: [C: 03+2] ldap: fix case statment [puppet] - 10https://gerrit.wikimedia.org/r/570880 (owner: 10Jbond) [12:29:33] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 524 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:29:41] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:32:37] (03PS1) 10Jbond: ldap: fix python path [puppet] - 10https://gerrit.wikimedia.org/r/570881 [12:34:57] (03CR) 10Jbond: [C: 03+2] ldap: fix python path [puppet] - 10https://gerrit.wikimedia.org/r/570881 (owner: 10Jbond) [12:35:27] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 34 probes of 524 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:36:14] (03PS9) 10Hnowlan: mediawiki: check mw versions match those on the deploy server [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) [12:38:16] (03CR) 10Ammarpad: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570680 (https://phabricator.wikimedia.org/T244488) (owner: 10Superzerocool) [12:38:39] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: check mw versions match those on the deploy server [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) (owner: 10Hnowlan) [12:39:43] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:41:09] (03PS10) 10Hnowlan: mediawiki: check mw versions match those on the deploy server [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) [12:42:05] !log depool & reimage cp4021 as buster - T242093 [12:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:09] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [12:42:22] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4021.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002071242_vgutie... [12:44:17] (03PS1) 10Vgutierrez: install_server: Reimage ncredir@esams as buster [puppet] - 10https://gerrit.wikimedia.org/r/570883 (https://phabricator.wikimedia.org/T243391) [12:46:10] (03CR) 10Vgutierrez: [C: 03+2] install_server: Reimage ncredir@esams as buster [puppet] - 10https://gerrit.wikimedia.org/r/570883 (https://phabricator.wikimedia.org/T243391) (owner: 10Vgutierrez) [12:51:50] !log depool and reimage ncredir3002 as buster - T243391 [12:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:53] T243391: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 [13:00:04] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:03:17] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [13:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:32] (03PS7) 10Vgutierrez: ATS: Provide a milestone log for ats-tls and ats-backend [puppet] - 10https://gerrit.wikimedia.org/r/570802 (https://phabricator.wikimedia.org/T244538) [13:05:31] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:02] (03CR) 10Phuedx: [C: 04-2] "Do not deploy until at least Monday, 17th February." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570625 (https://phabricator.wikimedia.org/T196159) (owner: 10Polishdeveloper) [13:07:38] (03CR) 10Ammarpad: [C: 04-1] Throttle rule for National Gallery of Canada Library and Archives edit-a-thon (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570680 (https://phabricator.wikimedia.org/T244488) (owner: 10Superzerocool) [13:11:17] (03CR) 10Marostegui: mysql: adapt Cumin queries to select DBs (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/570161 (https://phabricator.wikimedia.org/T243935) (owner: 10Volans) [13:14:28] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4021.ulsfo.wmnet'] ` and were **ALL** successful. [13:15:39] (03CR) 10Jcrespo: [C: 04-1] "Not technically a -1, but think the wikireplica private columns list should be updated at the same time." [puppet] - 10https://gerrit.wikimedia.org/r/570792 (https://phabricator.wikimedia.org/T240094) (owner: 10Marostegui) [13:17:50] (03CR) 10Marostegui: "> Not technically a -1, but think the wikireplica private columns" [puppet] - 10https://gerrit.wikimedia.org/r/570792 (https://phabricator.wikimedia.org/T240094) (owner: 10Marostegui) [13:23:04] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:26:15] !log pooling cp4021 with buster - T242093 [13:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:19] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [13:28:26] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:29:55] (03PS1) 10Muehlenhoff: Export DEBIAN_FRONTEND=noninteractive in the debdeploy client [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/570887 [13:31:23] 10Operations, 10netops: cr3-knams:xe-0/1/3 down - https://phabricator.wikimedia.org/T244497 (10ayounsi) a:05ayounsi→03faidon As it matches the reboot of cr3-knams I'd say the optic on that side needs to be replaced. (maybe the power fluctuation damaged it?). But as the optic looks fine on the CLI, maybe i... [13:31:56] (03PS1) 10Vgutierrez: install_server: Reimage text@ulsfo as buster [puppet] - 10https://gerrit.wikimedia.org/r/570888 (https://phabricator.wikimedia.org/T242093) [13:32:00] (03PS1) 10Jbond: puppet_compiler: update the puppet master --compile command to support pson [puppet] - 10https://gerrit.wikimedia.org/r/570889 (https://phabricator.wikimedia.org/T238053) [13:32:42] (03CR) 10Ammarpad: "I scheduled it (https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=prev&oldid=1853677) and was present on IRC but it was no" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562486 (https://phabricator.wikimedia.org/T242030) (owner: 10Ammarpad) [13:32:56] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:33:05] (03CR) 10jerkins-bot: [V: 04-1] puppet_compiler: update the puppet master --compile command to support pson [puppet] - 10https://gerrit.wikimedia.org/r/570889 (https://phabricator.wikimedia.org/T238053) (owner: 10Jbond) [13:33:35] (03PS2) 10Jbond: puppet_compiler: update the puppet master --compile command to support pson [puppet] - 10https://gerrit.wikimedia.org/r/570889 (https://phabricator.wikimedia.org/T238053) [13:34:29] (03CR) 10jerkins-bot: [V: 04-1] puppet_compiler: update the puppet master --compile command to support pson [puppet] - 10https://gerrit.wikimedia.org/r/570889 (https://phabricator.wikimedia.org/T238053) (owner: 10Jbond) [13:36:36] (03CR) 10Jcrespo: [C: 04-1] "While we don't a proper orchestration ready- I think we have enough pieces (dbcontrol, zarcillo monitoring and mysql.py utility) to skip u" [software/spicerack] - 10https://gerrit.wikimedia.org/r/570161 (https://phabricator.wikimedia.org/T243935) (owner: 10Volans) [13:37:22] ^not sure if you agree [13:37:27] sorry, wrong channel [13:38:03] (03PS3) 10Jbond: puppet_compiler: update the puppet master --compile command to support pson [puppet] - 10https://gerrit.wikimedia.org/r/570889 (https://phabricator.wikimedia.org/T238053) [13:42:16] 10Operations, 10netops: cr3-knams:xe-0/1/3 down - https://phabricator.wikimedia.org/T244497 (10faidon) Please file a #procurement task for Willy/Rob to execute on :) [13:49:12] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Improve ATS backend connection reuse against origin servers - https://phabricator.wikimedia.org/T241145 (10ema) >>! In T241145#5856750, @Gilles wrote:> > We will keep an eye on the trend in coming days to check how much of a dent it... [13:54:25] (03CR) 10Ema: [C: 03+1] install_server: Reimage text@ulsfo as buster [puppet] - 10https://gerrit.wikimedia.org/r/570888 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [13:54:41] (03PS3) 10Superzerocool: Throttle rule for National Gallery of Canada Library and Archives edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570680 (https://phabricator.wikimedia.org/T244488) [13:56:33] (03PS1) 10Hoo man: Wikibase Client: Fix setting name typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570892 (https://phabricator.wikimedia.org/T244529) [13:56:50] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10akosiaris) Those 4 machines will have to be done one by one in order as @Robh points out. Overall, about an hour of advance notice should suffice, but let's do one each day ? I 'll add tentati... [13:57:16] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:57:58] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10akosiaris) [13:58:21] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10akosiaris) There's nothing rushing us on this btw, feel free to proposed alternative maint windows. [13:58:46] (03CR) 10Vgutierrez: "blocked till ATS gets LogFilterIP fixed :(" [puppet] - 10https://gerrit.wikimedia.org/r/570802 (https://phabricator.wikimedia.org/T244538) (owner: 10Vgutierrez) [14:02:25] 10Operations, 10Traffic, 10Patch-For-Review: ats-tls performance issues under production load - https://phabricator.wikimedia.org/T244538 (10Vgutierrez) So I had a beautiful CR ready to log this data, but ATS ability of filter logs based on IPs is currently broken, so I just applied the CR manually on cp1075... [14:05:29] 10Operations, 10netops: cr3-knams:xe-0/1/3 down - https://phabricator.wikimedia.org/T244497 (10ayounsi) Opened T244574. [14:05:58] (03CR) 10Addshore: [C: 03+1] Wikibase Client: Fix setting name typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570892 (https://phabricator.wikimedia.org/T244529) (owner: 10Hoo man) [14:06:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:07:20] (03CR) 10Ammarpad: [C: 03+1] Throttle rule for National Gallery of Canada Library and Archives edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570680 (https://phabricator.wikimedia.org/T244488) (owner: 10Superzerocool) [14:08:14] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Wikibase Client: Fix setting name typo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570892 (https://phabricator.wikimedia.org/T244529) (owner: 10Hoo man) [14:09:13] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for CherRaye Glenn - https://phabricator.wikimedia.org/T244410 (10CGlenn) Hi @jijiki ! I will be using Turnilo & Superset. :) [14:09:36] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [14:09:42] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [14:09:50] PROBLEM - PHP7 rendering on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:09:50] 10Operations, 10netops: BFD session alerts due to inconsistent status on cr3-knams - https://phabricator.wikimedia.org/T240659 (10ayounsi) cr3-knams got upgraded to 18 yesterday. Waiting to see if the issue happen again. [14:09:50] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [14:09:50] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [14:09:50] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:09:52] PROBLEM - Apache HTTP on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:09:52] PROBLEM - Nginx local proxy to apache on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:09:52] PROBLEM - PHP7 rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:09:56] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:09:56] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [14:09:56] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [14:09:56] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:09:58] PROBLEM - Apache HTTP on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:09:58] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be [14:09:58] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:10:04] PROBLEM - Nginx local proxy to apache on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:08] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out be [14:10:08] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:10:14] PROBLEM - Apache HTTP on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:14] PROBLEM - Apache HTTP on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:16] PROBLEM - Nginx local proxy to apache on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:16] PROBLEM - PHP7 rendering on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:10:20] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be [14:10:20] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:10:20] PROBLEM - Apache HTTP on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:22] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out be [14:10:22] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:10:22] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out be [14:10:22] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:10:22] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out be [14:10:22] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:10:24] PROBLEM - Nginx local proxy to apache on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:24] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out be [14:10:24] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:10:24] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out be [14:10:24] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:10:25] PROBLEM - PHP7 rendering on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:10:26] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain} [14:10:26] ections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was [14:10:26] n}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:10:27] PROBLEM - Nginx local proxy to apache on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:27] PROBLEM - Apache HTTP on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:28] PROBLEM - PHP7 rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:10:28] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed ou [14:10:29] se was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:10:29] PROBLEM - Apache HTTP on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:30] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [14:10:30] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [14:10:31] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:10:31] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [14:10:32] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [14:10:32] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:10:33] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [14:10:33] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [14:10:34] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:10:34] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain} [14:10:38] ections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was [14:10:38] n}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:10:38] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be [14:10:38] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:10:38] PROBLEM - PHP7 rendering on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:10:38] PROBLEM - PHP7 rendering on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:10:38] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [14:10:38] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [14:10:39] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:10:39] PROBLEM - Nginx local proxy to apache on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:40] PROBLEM - PHP7 rendering on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:10:40] PROBLEM - Apache HTTP on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:41] PROBLEM - Nginx local proxy to apache on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:41] PROBLEM - Nginx local proxy to apache on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:42] PROBLEM - Nginx local proxy to apache on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:42] PROBLEM - Nginx local proxy to apache on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:43] PROBLEM - Apache HTTP on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:43] PROBLEM - PHP7 rendering on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:10:44] PROBLEM - Apache HTTP on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:44] PROBLEM - PHP7 rendering on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:10:45] PROBLEM - Nginx local proxy to apache on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:45] PROBLEM - Nginx local proxy to apache on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:46] PROBLEM - PHP7 rendering on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:10:46] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not [14:10:47] xistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [14:10:47] PROBLEM - Apache HTTP on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:48] PROBLEM - Nginx local proxy to apache on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:48] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) is [14:10:49] trieve a random article title returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true) [14:10:49] st retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [14:10:50] PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:50] PROBLEM - Nginx local proxy to apache on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:51] PROBLEM - PHP7 rendering on mw1222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:10:52] PROBLEM - Nginx local proxy to apache on mw1339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:52] PROBLEM - Nginx local proxy to apache on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:52] PROBLEM - PHP7 rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:10:54] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:54] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received: /{domain}/v1/page/su [14:10:54] t summary for test page) timed out before a response was received: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1 [14:10:54] itle} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:10:56] PROBLEM - Apache HTTP on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:58] PROBLEM - Nginx local proxy to apache on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:58] PROBLEM - PHP7 rendering on mw1287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:10:58] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikifeeds_8889: Servers kubernetes1005.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled: api_80: Servers mw1284.eqiad.wmnet, mw1229.eqiad.wmnet, mw1280.eqiad.wmnet, mw1227.eqiad.wmnet, mw1232.eqiad.wmnet, mw1346.eqiad.wmnet, mw1344.eqiad.wmnet, mw1287.eqiad.wmnet, mw1348.eqiad.wmnet, mw1288.eqiad [14:10:58] iad.wmnet, mw1314.eqiad.wmnet, mw1279.eqiad.wmnet, mw1226.eqiad.wmnet, mw1317.eqiad.wmnet, mw1233.eqiad.wmnet, mw1222.eqiad.wmnet, mw1283.eqiad.wmnet, mw1340.eqiad.wmnet, mw1343.eqiad.wmnet, mw1225.eqiad.wmnet, mw1347.eqiad.wmnet, mw1345.eqiad.wmnet, mw1223.eqiad.wmnet, mw1286.eqiad.wmnet, mw1282.eqiad.wmnet, mw1276.eqiad.wmnet, mw1221.eqiad.wmnet, mw1230.eqiad.wmnet, mw1234.eqiad.wmnet, mw1339.eqiad.wmnet, mw1224.eqiad.wmnet, mw [14:10:58] mw1231.eqiad.wmnet, mw1312.eqiad.wmnet, mw1228.eqiad.wmnet, mw1342.eqiad.wmnet, mw1316.eqiad.wmnet, mw1289.eqiad.wmnet, mw1315.eqiad.wmnet, mw1341.eqiad.wmnet, mw1285.eqiad.wmnet, mw12 https://wikitech.wikimedia.org/wiki/PyBal [14:10:58] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [14:11:02] PROBLEM - PHP7 rendering on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:11:02] PROBLEM - PHP7 rendering on mw1339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:11:02] PROBLEM - graphoid endpoints health on scb1003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [14:11:04] PROBLEM - PHP7 rendering on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:11:04] PROBLEM - PHP7 rendering on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:11:04] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve exte [14:11:04] Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: [14:11:04] sform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:11:04] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out be [14:11:04] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:11:05] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be [14:11:05] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:11:06] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be [14:11:06] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:11:07] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out be [14:11:07] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:11:08] PROBLEM - Apache HTTP on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:11:08] PROBLEM - PHP7 rendering on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:11:09] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain} [14:11:09] ections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was [14:11:10] n}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:11:10] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [14:11:11] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [14:11:11] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:11:12] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [14:11:12] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [14:11:13] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:11:13] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [14:11:14] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be [14:11:14] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:11:15] PROBLEM - Apache HTTP on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:11:15] PROBLEM - Apache HTTP on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:11:16] PROBLEM - Apache HTTP on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:11:16] PROBLEM - Nginx local proxy to apache on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:11:17] PROBLEM - Nginx local proxy to apache on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:11:17] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out befor [14:11:19] received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:11:19] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [14:11:19] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [14:11:19] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:11:20] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain} [14:11:20] ections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was [14:11:50] (03PS2) 10Muehlenhoff: Switch logstash hosts to standard Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/570600 (https://phabricator.wikimedia.org/T156955) [14:12:57] * apergos peeks in [14:13:05] RECOVERY - Nginx local proxy to apache on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.184 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:13:05] RECOVERY - PHP7 rendering on mw1287 is OK: HTTP OK: HTTP/1.1 200 OK - 79819 bytes in 8.854 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:13:06] RECOVERY - PHP7 rendering on mw1339 is OK: HTTP OK: HTTP/1.1 200 OK - 79819 bytes in 8.241 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:13:06] RECOVERY - PHP7 rendering on mw1297 is OK: HTTP OK: HTTP/1.1 200 OK - 79819 bytes in 7.899 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:13:07] RECOVERY - PHP7 rendering on mw1315 is OK: HTTP OK: HTTP/1.1 200 OK - 79819 bytes in 8.275 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:13:07] RECOVERY - Apache HTTP on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.099 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:13:08] RECOVERY - PHP7 rendering on mw1316 is OK: HTTP OK: HTTP/1.1 200 OK - 79819 bytes in 8.535 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:13:08] RECOVERY - Apache HTTP on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.283 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:13:09] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [14:13:09] RECOVERY - Apache HTTP on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.159 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:13:10] <_joe_> oh sigh again? [14:13:14] RECOVERY - Nginx local proxy to apache on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.730 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:13:14] RECOVERY - Apache HTTP on mw1340 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 6.245 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:13:14] RECOVERY - Apache HTTP on mw1228 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.246 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:13:14] RECOVERY - Apache HTTP on mw1232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 6.798 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:13:16] RECOVERY - Nginx local proxy to apache on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.630 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:13:16] RECOVERY - Nginx local proxy to apache on mw1297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.712 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:13:16] RECOVERY - Nginx local proxy to apache on mw1315 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.543 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:13:16] <_joe_> what's happening this time? [14:13:20] RECOVERY - PHP7 rendering on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 79819 bytes in 8.413 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:13:20] RECOVERY - Nginx local proxy to apache on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.661 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:13:22] good morning 👀 [14:13:24] RECOVERY - PHP7 rendering on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 79819 bytes in 9.477 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:13:26] RECOVERY - PHP7 rendering on mw1347 is OK: HTTP OK: HTTP/1.1 200 OK - 79819 bytes in 9.921 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:13:30] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received: /{domain}/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /{domain}/v1 [14:13:30] year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{d [14:13:30] t-read articles for date with no data (with aggregated=true)) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve featured article info for unsupported site (with [14:13:30] ) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [14:13:32] RECOVERY - PHP7 rendering on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 79826 bytes in 5.702 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:13:34] RECOVERY - Apache HTTP on mw1231 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.106 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:13:34] RECOVERY - Nginx local proxy to apache on mw1221 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.853 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:13:34] RECOVERY - Apache HTTP on mw1347 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.820 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:13:34] PROBLEM - Apache HTTP on mw1287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:13:36] RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.745 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:13:36] RECOVERY - Apache HTTP on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.772 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:13:42] PROBLEM - Nginx local proxy to apache on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:14:20] RECOVERY - PHP7 rendering on mw1340 is OK: HTTP OK: HTTP/1.1 200 OK - 79819 bytes in 9.475 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:14:34] PROBLEM - Apache HTTP on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:14:36] RECOVERY - Nginx local proxy to apache on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 7.413 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:14:50] RECOVERY - Apache HTTP on mw1281 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.615 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:15:00] RECOVERY - Apache HTTP on mw1346 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.632 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:15:04] RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.107 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:15:04] PROBLEM - Apache HTTP on mw1222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:15:04] PROBLEM - PHP7 rendering on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:15:32] PROBLEM - Apache HTTP on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:15:38] RECOVERY - PHP7 rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 79819 bytes in 9.350 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:15:52] RECOVERY - Nginx local proxy to apache on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.932 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:16:02] RECOVERY - Apache HTTP on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.931 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:16:04] RECOVERY - Apache HTTP on mw1343 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.256 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:16:04] RECOVERY - Apache HTTP on mw1341 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.879 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:16:04] RECOVERY - PHP7 rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 79826 bytes in 9.438 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:16:08] RECOVERY - Apache HTTP on mw1312 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.489 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:16:12] RECOVERY - PHP7 rendering on mw1313 is OK: HTTP OK: HTTP/1.1 200 OK - 79819 bytes in 4.869 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:16:14] RECOVERY - Apache HTTP on mw1315 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.252 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:16:14] RECOVERY - PHP7 rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 79819 bytes in 8.922 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:16:16] RECOVERY - Apache HTTP on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.155 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:16:16] RECOVERY - Nginx local proxy to apache on mw1225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.860 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:16:24] PROBLEM - Nginx local proxy to apache on mw1287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:16:32] RECOVERY - Nginx local proxy to apache on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.600 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:16:32] RECOVERY - Apache HTTP on mw1317 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.887 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:16:54] RECOVERY - PHP7 rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 79819 bytes in 8.932 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:16:54] RECOVERY - PHP7 rendering on mw1222 is OK: HTTP OK: HTTP/1.1 200 OK - 79819 bytes in 9.259 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:17:02] RECOVERY - PHP7 rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 79826 bytes in 3.145 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:17:04] RECOVERY - Nginx local proxy to apache on mw1312 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 3.860 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:17:04] RECOVERY - PHP7 rendering on mw1235 is OK: HTTP OK: HTTP/1.1 200 OK - 79819 bytes in 4.696 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:17:06] RECOVERY - Apache HTTP on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 6.072 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:17:06] RECOVERY - Apache HTTP on mw1221 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 6.530 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:17:06] RECOVERY - Nginx local proxy to apache on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.559 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:17:06] RECOVERY - Nginx local proxy to apache on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 5.980 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:17:06] RECOVERY - Apache HTTP on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 6.618 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:17:07] RECOVERY - Nginx local proxy to apache on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.120 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:17:07] RECOVERY - Nginx local proxy to apache on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.660 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:17:25] I see api.svc.eqiad.wmnet LVS HTTP IPv4 #page CRITICAL 2020-02-07 14:14:08 0d 0h 7m 39s 3/3 CRITICAL - Socket timeout after 10 seconds in the icinga dshboard but it didn't page? Or was about? [14:17:39] it paged me [14:17:40] XioNoX: it did page for me [14:17:45] it paged me as well [14:17:48] same here, paged [14:17:51] XioNoX: it paged, not here because icinga-wm was kicked out [14:17:58] at 14:12 UTC [14:17:58] ah, jsut got it :) [14:18:01] excess flood [14:18:08] I got it 1m ago [14:19:02] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:19:02] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:19:04] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [14:19:04] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [14:19:04] RECOVERY - graphoid endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [14:19:06] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [14:19:06] RECOVERY - graphoid endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [14:19:06] RECOVERY - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid [14:19:08] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:19:08] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:19:08] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:19:08] RECOVERY - graphoid endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [14:19:08] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [14:19:09] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [14:19:09] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [14:19:10] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:19:10] RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [14:19:11] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:19:11] RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [14:19:12] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:19:12] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:19:14] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:19:27] (03CR) 10Vgutierrez: [C: 03+2] install_server: Reimage text@ulsfo as buster [puppet] - 10https://gerrit.wikimedia.org/r/570888 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [14:20:06] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [14:20:06] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [14:20:06] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:08] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:20:08] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:08] RECOVERY - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid [14:20:08] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:08] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [14:20:09] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:20:09] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:10] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:20:10] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:11] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:20:11] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:20:12] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:20:12] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:13] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:13] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:14] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:14] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:15] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:15] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:16] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:16] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:17] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:17] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:18] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:20:18] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:19] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:19] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:20] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:20] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:21] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:21] RECOVERY - graphoid endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [14:20:22] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:22] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:23] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:21:16] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:21:16] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:21:16] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:21:16] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:21:16] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:21:17] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:21:17] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:21:18] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:21:18] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:21:20] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:21:56] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)8 ge (W)1 ge 0.425 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [14:21:58] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:21:58] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:21:58] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:22:00] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:22:00] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:22:00] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:23:02] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [14:23:06] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:23:10] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:23:11] !log pooling ncredir3002 running buster - T243391 [14:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:14] T243391: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 [14:23:16] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:24:03] (03PS1) 10Jhedden: icinga: update sms contact for jhedden [puppet] - 10https://gerrit.wikimedia.org/r/570896 [14:24:18] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [14:24:24] 10Operations, 10Wikimedia-Incident: Investigate whether we can automatically share incident status docs with WMDE - https://phabricator.wikimedia.org/T244395 (10RLazarus) [14:24:46] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:25:56] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:26:25] 10Operations, 10Wikimedia-Incident: Investigate whether we can automatically share incident status docs with WMDE - https://phabricator.wikimedia.org/T244395 (10RLazarus) Thanks! I'll bring this up in the SRE meeting on Monday and go ahead if no one objects. [14:28:48] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:30:06] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:32:31] !log depool & reimage cp4031 as buster - T242093 [14:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:34] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [14:32:54] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4031.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [14:33:56] !log depool and reimage ncredir3001 as buster - T243391 [14:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:59] T243391: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 [14:35:06] (03PS1) 10Elukey: presto: refactor TLS passwords parameter to be more sharable [puppet] - 10https://gerrit.wikimedia.org/r/570899 (https://phabricator.wikimedia.org/T243312) [14:35:10] (03CR) 10Hoo man: [C: 03+2] Wikibase Client: Fix setting name typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570892 (https://phabricator.wikimedia.org/T244529) (owner: 10Hoo man) [14:36:19] (03Merged) 10jenkins-bot: Wikibase Client: Fix setting name typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570892 (https://phabricator.wikimedia.org/T244529) (owner: 10Hoo man) [14:38:02] !log hoo@deploy1001 Synchronized wmf-config/Wikibase.php: Wikibase Client: Fix setting name typo (T244529) (duration: 01m 20s) [14:38:04] (03CR) 10Jhedden: [C: 03+2] icinga: update sms contact for jhedden [puppet] - 10https://gerrit.wikimedia.org/r/570896 (owner: 10Jhedden) [14:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:09] T244529: mw.wikibase.getLabelByLang not return item label for some items - https://phabricator.wikimedia.org/T244529 [14:40:22] (03PS1) 10Elukey: Add presto_clusters_secrets in common.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/570900 [14:40:36] !log hoo@deploy1001 Scap failed!: 9/11 canaries failed their endpoint checks(http://en.wikipedia.org) [14:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:40:49] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add presto_clusters_secrets in common.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/570900 (owner: 10Elukey) [14:40:50] PROBLEM - PHP7 rendering on mw1262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:40:54] PROBLEM - PHP7 rendering on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:40:56] PROBLEM - Apache HTTP on mw1275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:40:56] PROBLEM - PHP7 rendering on mw1271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:40:56] PROBLEM - PHP7 rendering on mw1267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:40:56] PROBLEM - Nginx local proxy to apache on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:40:56] PROBLEM - PHP7 rendering on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:40:57] PROBLEM - Nginx local proxy to apache on mw1327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:40:57] PROBLEM - Nginx local proxy to apache on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:41:00] PROBLEM - Apache HTTP on mw1250 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:41:04] PROBLEM - Nginx local proxy to apache on mw1249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:41:04] PROBLEM - Nginx local proxy to apache on mw1256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:41:04] PROBLEM - Apache HTTP on mw1241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:41:04] PROBLEM - Apache HTTP on mw1330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:41:04] PROBLEM - Apache HTTP on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:41:06] PROBLEM - Apache HTTP on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:41:06] PROBLEM - PHP7 rendering on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:41:08] How do I force the deploy [14:41:10] PROBLEM - Apache HTTP on mw1324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:41:10] PROBLEM - Apache HTTP on mw1319 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:41:10] PROBLEM - Nginx local proxy to apache on mw1262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:41:10] it's a revert [14:41:12] PROBLEM - PHP7 rendering on mw1323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:41:12] somebody unplugged the wrong cable [14:41:16] PROBLEM - Apache HTTP on mw1246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:41:16] PROBLEM - Nginx local proxy to apache on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:41:16] PROBLEM - Apache HTTP on mw1254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:41:16] PROBLEM - Apache HTTP on mw1243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:41:16] PROBLEM - PHP7 rendering on mw1255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:41:18] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:41:18] not sure why but my last change broke it [14:41:18] PROBLEM - Apache HTTP on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:41:18] PROBLEM - Nginx local proxy to apache on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:41:18] PROBLEM - Nginx local proxy to apache on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:41:19] PROBLEM - Apache HTTP on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:41:19] PROBLEM - Apache HTTP on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:41:20] PROBLEM - Apache HTTP on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:41:20] PROBLEM - Apache HTTP on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:41:21] I guess [14:41:24] sorry if wrong channel, I can't visit all wikis now [14:41:59] Cohaf: yep it is, we are working on it :) [14:42:01] Cohaf: that's what happens when the Apache server crashes :) [14:42:16] Got it, reverting with --force now [14:42:18] thanks, I had 502 all round [14:42:25] <_joe_> hoo: damnit yes [14:42:29] hoo: there was an occurrence of the same problem before, it is probably not your change [14:42:43] I recalled then I was able to access [14:42:56] from Singapore [14:42:58] but let's revert in any case [14:43:02] (03CR) 10Jcrespo: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/570792 (https://phabricator.wikimedia.org/T240094) (owner: 10Marostegui) [14:43:25] #rip [14:43:32] !log ladsgroup@mwmaint1002:~$ mwscript createAndPromote.php --wiki=zhwiki --force "Amir Sarabadani (WMDE)" --sysop (T244578) [14:43:34] hoo: how's the revert ? [14:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:35] T244578: Tracking task: 2020-02-07 MW API server outage(s) - https://phabricator.wikimedia.org/T244578 [14:43:43] (03PS1) 10Hoo man: Revert "Wikibase Client: Fix setting name typo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570901 [14:43:43] !log hoo@deploy1001 Synchronized wmf-config/Wikibase.php: REVERT: Wikibase Client: Fix setting name typo (T244529) (duration: 01m 40s) [14:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:46] T244529: mw.wikibase.getLabelByLang not return item label for some items - https://phabricator.wikimedia.org/T244529 [14:43:47] godog: Done [14:44:07] (03CR) 10Hoo man: [C: 03+2] "For consistency" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570901 (owner: 10Hoo man) [14:44:10] <_joe_> we're back [14:44:19] hoo: thank you [14:44:48] thanks [14:44:52] Seems that typo actually hid a very nasty bug :S [14:44:57] (03Merged) 10jenkins-bot: Revert "Wikibase Client: Fix setting name typo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570901 (owner: 10Hoo man) [14:45:02] <_joe_> also it's friday :) [14:45:44] Yes, that calls for bad luck :S [14:45:45] RECOVERY - Nginx local proxy to apache on mw1266 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 1.714 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:45:45] RECOVERY - phpfpm_up reduced availability on icinga1001 is OK: (C)0.8 le (W)0.9 le 0.9534 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:45:49] RECOVERY - Apache HTTP on mw1266 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.679 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:45:50] RECOVERY - Apache HTTP on mw1272 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:45:53] RECOVERY - PHP7 rendering on mw1319 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.139 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:45:55] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:45:55] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:45:55] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:45:55] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:45:55] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:45:56] RECOVERY - Apache HTTP on mw1274 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.315 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:45:57] RECOVERY - Nginx local proxy to apache on mw1271 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.412 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:45:57] RECOVERY - Nginx local proxy to apache on mw1270 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 1.132 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:45:57] RECOVERY - Apache HTTP on mw1270 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.232 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:45:57] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [14:45:59] RECOVERY - PHP7 rendering on mw1242 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:45:59] RECOVERY - PHP7 rendering on mw1238 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:45:59] RECOVERY - PHP7 rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.264 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:46:00] RECOVERY - Nginx local proxy to apache on mw1275 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.613 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:46:01] RECOVERY - Nginx local proxy to apache on mw1328 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:46:01] RECOVERY - PHP7 rendering on mw1272 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:46:03] RECOVERY - Nginx local proxy to apache on mw1267 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.572 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:46:03] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204,205} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-met [14:46:03] RECOVERY - PHP7 rendering on mw1327 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:46:03] RECOVERY - PHP7 rendering on mw1332 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:46:05] RECOVERY - Nginx local proxy to apache on mw1242 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:46:11] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 71.45 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:46:11] RECOVERY - Apache HTTP on mw1326 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:46:19] RECOVERY - Apache HTTP on mw1256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:46:19] RECOVERY - Nginx local proxy to apache on mw1258 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:46:20] RECOVERY - PHP7 rendering on mw1269 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:46:23] RECOVERY - Nginx local proxy to apache on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.716 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:46:23] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:46:23] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:46:23] RECOVERY - PHP7 rendering on mw1330 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:47:56] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)8 ge (W)1 ge 0.6208 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [14:48:02] (03PS3) 10Muehlenhoff: Switch logstash hosts to standard Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/570600 (https://phabricator.wikimedia.org/T156955) [14:48:03] RECOVERY - ATS TLS has reduced HTTP availability #page on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [14:48:04] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [14:48:04] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:48:04] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:48:14] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:49:04] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [14:49:18] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:49:18] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [14:49:19] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [14:49:20] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:49:20] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:49:22] RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [14:49:22] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [14:49:22] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [14:49:22] RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [14:50:22] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:50:26] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:50:30] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:51:16] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [14:52:16] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [14:52:38] (03CR) 10Muehlenhoff: "@Keith: That was copy&pasta on my end from an older commit message, now fixed." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570600 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [14:52:46] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [14:53:44] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [14:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:06] (03PS2) 10Elukey: presto: refactor TLS passwords parameter to be more sharable [puppet] - 10https://gerrit.wikimedia.org/r/570899 (https://phabricator.wikimedia.org/T243312) [14:54:47] 10Operations, 10Traffic, 10MW-1.35-notes (1.35.0-wmf.18; 2020-02-04), 10Performance Issue, and 2 others: Time-out error; Babel/WikibaseRepo being somehow uncached, overloading the API, and causing general outage - https://phabricator.wikimedia.org/T243713 (10Addshore) 05Open→03Resolved a:03Addshore I... [14:56:00] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:29] (03PS3) 10Elukey: presto: refactor TLS passwords parameter to be more sharable [puppet] - 10https://gerrit.wikimedia.org/r/570899 (https://phabricator.wikimedia.org/T243312) [14:59:44] 10Operations, 10Wikimedia-Incident: Tracking task: 2020-02-07 MW API server outage(s) - https://phabricator.wikimedia.org/T244578 (10CDanis) [14:59:47] (03CR) 10Aezell: [C: 03+1] realm.pp: Add watchlist_expiry to private tables [puppet] - 10https://gerrit.wikimedia.org/r/570792 (https://phabricator.wikimedia.org/T240094) (owner: 10Marostegui) [15:00:03] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [15:00:16] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4031.ulsfo.wmnet'] ` Of which those **FAILED**: ` ['cp4031.ulsfo.wmnet'] ` [15:00:18] (03PS1) 10Vgutierrez: Release 8.0.5-1wm15 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/570904 (https://phabricator.wikimedia.org/T244538) [15:00:30] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.5-1wm15 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/570904 (https://phabricator.wikimedia.org/T244538) (owner: 10Vgutierrez) [15:04:41] (03PS4) 10Elukey: presto: refactor TLS passwords parameter to be more sharable [puppet] - 10https://gerrit.wikimedia.org/r/570899 (https://phabricator.wikimedia.org/T243312) [15:07:00] (03CR) 10jerkins-bot: [V: 04-1] presto: refactor TLS passwords parameter to be more sharable [puppet] - 10https://gerrit.wikimedia.org/r/570899 (https://phabricator.wikimedia.org/T243312) (owner: 10Elukey) [15:08:00] (03CR) 10Marostegui: [C: 03+2] realm.pp: Add watchlist_expiry to private tables [puppet] - 10https://gerrit.wikimedia.org/r/570792 (https://phabricator.wikimedia.org/T240094) (owner: 10Marostegui) [15:08:21] elukey: is your change good to go? [15:09:39] (03PS5) 10Elukey: presto: refactor TLS passwords parameter to be more sharable [puppet] - 10https://gerrit.wikimedia.org/r/570899 (https://phabricator.wikimedia.org/T243312) [15:09:40] marostegui: for labs_private? yes yes [15:10:01] yeah, b/hieradata/role/common/analytics_test_cluster/coordinator.yaml [15:10:08] elukey: and hieradata/common.yaml [15:10:10] yes thanks! [15:10:14] ok, merging! [15:11:54] !log Restart all instances on db2094 and db2095 to pick up a new replication filter - T240094 [15:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:57] T240094: Create required table for new Watchlist Expiry feature - https://phabricator.wikimedia.org/T240094 [15:14:38] (03PS6) 10Elukey: presto: refactor TLS passwords parameter to be more sharable [puppet] - 10https://gerrit.wikimedia.org/r/570899 (https://phabricator.wikimedia.org/T243312) [15:14:52] 10Operations, 10Wikimedia-Incident: Tracking task: 2020-02-07 MW API server outage(s) - https://phabricator.wikimedia.org/T244578 (10CDanis) First outage (14:12-14:17) caused by very expensive templates (thanks Amir for the [[ https://zh.wikipedia.org/wiki/Special:%E7%94%A8%E6%88%B7%E8%B4%A1%E7%8C%AE/Amir_Sara... [15:15:11] 10Operations, 10Wikimedia-Incident: Tracking task: 2020-02-07 MW API server outage(s) - https://phabricator.wikimedia.org/T244578 (10CDanis) 05Open→03Resolved a:03CDanis Followup actionables under discussion but outages over. [15:18:42] !log Restart all instances on db1124 and db1125 to pick up a new replication filter - T240094 [15:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:45] T240094: Create required table for new Watchlist Expiry feature - https://phabricator.wikimedia.org/T240094 [15:20:50] !log pooling ncredir3001 running buster - T243391 [15:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:54] T243391: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 [15:21:26] !log pooling cp4031 with buster - T242093 [15:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:28] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [15:24:25] (03PS1) 10Elukey: Add fake TLS secrets for the Presto Analytics cluster [labs/private] - 10https://gerrit.wikimedia.org/r/570905 [15:24:45] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake TLS secrets for the Presto Analytics cluster [labs/private] - 10https://gerrit.wikimedia.org/r/570905 (owner: 10Elukey) [15:25:49] !log depool & reimage cp4030 as buster - T242093 [15:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:46] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4030.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [15:27:04] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/20690/" [puppet] - 10https://gerrit.wikimedia.org/r/570899 (https://phabricator.wikimedia.org/T243312) (owner: 10Elukey) [15:27:34] (03CR) 10Elukey: [C: 03+2] presto: refactor TLS passwords parameter to be more sharable [puppet] - 10https://gerrit.wikimedia.org/r/570899 (https://phabricator.wikimedia.org/T243312) (owner: 10Elukey) [15:29:36] 10Operations, 10Traffic: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 (10Vgutierrez) [15:32:58] (03PS1) 10Andrew Bogott: keystone key sync: update rsync remote path [puppet] - 10https://gerrit.wikimedia.org/r/570908 (https://phabricator.wikimedia.org/T243418) [15:35:26] 10Operations: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 (10ayounsi) One use case I have of the install1002 server is: # Download a Junos software image from Juniper to install1002 # Move it to `/srv/junos/` # Fetch it over https with for example: `file copy "htt... [15:35:40] (03CR) 10Andrew Bogott: [C: 03+2] keystone key sync: update rsync remote path [puppet] - 10https://gerrit.wikimedia.org/r/570908 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [15:41:41] (03PS1) 10Elukey: Add presto client to stat and notebook hosts [puppet] - 10https://gerrit.wikimedia.org/r/570909 (https://phabricator.wikimedia.org/T243312) [15:42:47] (03CR) 10Marostegui: "Just to sum up. We had a chat in IRC (Jaime, Riccardo and myself) and we already clarified that this is only used for the switchover." [software/spicerack] - 10https://gerrit.wikimedia.org/r/570161 (https://phabricator.wikimedia.org/T243935) (owner: 10Volans) [15:47:54] (03CR) 10Eevans: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/570862 (https://phabricator.wikimedia.org/T242585) (owner: 10Filippo Giunchedi) [15:48:40] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [15:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:49] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 520 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:48:55] (03CR) 10Jbond: [C: 03+1] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/570887 (owner: 10Muehlenhoff) [15:50:58] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:10] (03PS1) 10Cparle: Make sure constraints are defined for commons as well as wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570912 (https://phabricator.wikimedia.org/T244572) [15:53:33] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Export DEBIAN_FRONTEND=noninteractive in the debdeploy client [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/570887 (owner: 10Muehlenhoff) [15:53:34] 10Operations, 10netops: Upgrade routers - https://phabricator.wikimedia.org/T243080 (10ayounsi) Current dates are: Feb. 11th - 21:00UTC - 1h - cr1-eqsin - eqsin will be depooled (this is when eqsin sees the less traffic) Feb. 12th - 13:00UTC - 2h - cr2/3-esams [15:53:52] 10Operations, 10Traffic, 10Patch-For-Review: ats-tls performance issues under production load - https://phabricator.wikimedia.org/T244538 (10Vgutierrez) To mitigate the DNS based delay that can be seen on the milestone log, I've backported https://github.com/apache/trafficserver/pull/6332 in https://gerrit.w... [15:53:53] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 33 probes of 520 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:55:29] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4030.ulsfo.wmnet'] ` Of which those **FAILED**: ` ['cp4030.ulsfo.wmnet'] ` [15:57:10] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Make sure constraints are defined for commons as well as wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570912 (https://phabricator.wikimedia.org/T244572) (owner: 10Cparle) [15:58:45] (03CR) 10Hnowlan: [C: 03+2] cassandra: default to sending logs to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/570862 (https://phabricator.wikimedia.org/T242585) (owner: 10Filippo Giunchedi) [16:00:14] (03CR) 10Hnowlan: [C: 03+1] "This LGTM but I'm not sure I'm qualified to +2. It'd be nice to hold off until Monday when pcc will hopefully not show errors for these ho" [puppet] - 10https://gerrit.wikimedia.org/r/570094 (https://phabricator.wikimedia.org/T244178) (owner: 10Clarakosi) [16:01:58] !log andrew@deploy1001 Started deploy [horizon/deploy@bc777d6]: Fix for T243422 [16:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:03] T243422: Horizon hiera UI: investigate data type handling - https://phabricator.wikimedia.org/T243422 [16:03:11] !log removing GRE MTU mitigations from cp[135]xxx - T232602 [16:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:15] T232602: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 [16:04:36] !log pooling cp4030 with buster - T242093 [16:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:38] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [16:05:23] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [16:05:43] !log andrew@deploy1001 Finished deploy [horizon/deploy@bc777d6]: Fix for T243422 (duration: 03m 45s) [16:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:18] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Investigate using the rich_data opsion to support Binary and binary_file for binary data - https://phabricator.wikimedia.org/T236481 (10Aklapper) >>! In T236481#5851908, @jbond wrote: > for some reason this change is not attached to the ticket https... [16:06:37] (03PS2) 10Elukey: Add presto client to stat and notebook hosts [puppet] - 10https://gerrit.wikimedia.org/r/570909 (https://phabricator.wikimedia.org/T243312) [16:06:40] (03PS2) 10Ayounsi: Add option to prepend our AS# to peers [homer/public] - 10https://gerrit.wikimedia.org/r/569639 [16:06:56] (03PS6) 10Aklapper: puppet_compiler: add rich_data support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/557050 (https://phabricator.wikimedia.org/T236481) (owner: 10Jbond) [16:07:38] (03PS2) 10Ayounsi: Add option to clamp TCP-MSS [homer/public] - 10https://gerrit.wikimedia.org/r/569636 [16:07:43] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/570600 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [16:08:25] (03PS3) 10Ayounsi: Add option to clamp TCP-MSS [homer/public] - 10https://gerrit.wikimedia.org/r/569636 [16:10:41] (03PS3) 10Elukey: Add presto client to stat and notebook hosts [puppet] - 10https://gerrit.wikimedia.org/r/570909 (https://phabricator.wikimedia.org/T243312) [16:12:32] (03PS11) 10Hnowlan: mediawiki: check mw versions match those on the deploy server [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) [16:13:32] (03PS4) 10Elukey: Add presto client to stat and notebook hosts [puppet] - 10https://gerrit.wikimedia.org/r/570909 (https://phabricator.wikimedia.org/T243312) [16:13:57] !log remove MSS clamping from eqiad/eqord/knams/esams [16:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:08] (03CR) 10Herron: [C: 03+1] Switch logstash hosts to standard Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/570600 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [16:16:10] (03CR) 10CDanis: [V: 03+2 C: 03+2] add cdanis as super-user, also add 'next UID' tracker comment [homer/public] - 10https://gerrit.wikimedia.org/r/570437 (owner: 10CDanis) [16:16:33] (03CR) 10Elukey: [C: 03+2] Add presto client to stat and notebook hosts [puppet] - 10https://gerrit.wikimedia.org/r/570909 (https://phabricator.wikimedia.org/T243312) (owner: 10Elukey) [16:17:15] is gerrit incredibly slow for anyone else? [16:17:31] works fine for me [16:18:04] 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Release-Engineering-Team-TODO, and 2 others: Horizon hiera UI: investigate data type handling - https://phabricator.wikimedia.org/T243422 (10Andrew) 05Open→03Resolved This is quite a bit better now. [16:18:08] 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Release-Engineering-Team-TODO, and 2 others: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675 (10Andrew) [16:20:15] (03PS1) 10Muehlenhoff: Bump debian/changelog for new release [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/570917 [16:22:25] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/570600 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [16:22:52] (03PS1) 10Elukey: profile::presto::client: set file ownership to root:root [puppet] - 10https://gerrit.wikimedia.org/r/570918 [16:24:29] (03PS12) 10Hnowlan: mediawiki: check mw versions match those on the deploy server [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) [16:25:01] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 38 probes of 603 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:25:21] (03CR) 10Addshore: "Reason for outage and revert in https://phabricator.wikimedia.org/T244529#5859927" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570901 (owner: 10Hoo man) [16:25:35] (03CR) 10Muehlenhoff: [C: 03+2] Bump debian/changelog for new release [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/570917 (owner: 10Muehlenhoff) [16:29:01] (03CR) 10Elukey: [C: 03+2] profile::presto::client: set file ownership to root:root [puppet] - 10https://gerrit.wikimedia.org/r/570918 (owner: 10Elukey) [16:30:29] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 3 probes of 603 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:31:34] (03PS1) 10Filippo Giunchedi: Add grafana2001 [dns] - 10https://gerrit.wikimedia.org/r/570920 (https://phabricator.wikimedia.org/T244357) [16:32:06] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Investigate using the rich_data option to support Binary and binary_file for binary data - https://phabricator.wikimedia.org/T236481 (10Aklapper) [16:34:26] (03PS2) 10Filippo Giunchedi: Add grafana2001 [dns] - 10https://gerrit.wikimedia.org/r/570920 (https://phabricator.wikimedia.org/T244357) [16:36:19] (03CR) 10Filippo Giunchedi: [C: 03+2] Add grafana2001 [dns] - 10https://gerrit.wikimedia.org/r/570920 (https://phabricator.wikimedia.org/T244357) (owner: 10Filippo Giunchedi) [16:38:54] !log filippo@cumin1001 START - Cookbook sre.ganeti.makevm [16:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:58] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/570889 (https://phabricator.wikimedia.org/T238053) (owner: 10Jbond) [16:42:17] (03CR) 10Muehlenhoff: "@akosiaris: Fine to simply ditch oresrdb* entries here for now?" [puppet] - 10https://gerrit.wikimedia.org/r/566290 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [16:46:51] !log filippo@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [16:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:43] (03PS1) 10Filippo Giunchedi: Add grafana2001 [puppet] - 10https://gerrit.wikimedia.org/r/570925 (https://phabricator.wikimedia.org/T244357) [16:53:08] (03PS2) 10Filippo Giunchedi: Add grafana2001 [puppet] - 10https://gerrit.wikimedia.org/r/570925 (https://phabricator.wikimedia.org/T244357) [16:54:36] (03CR) 10Filippo Giunchedi: [C: 03+2] Add grafana2001 [puppet] - 10https://gerrit.wikimedia.org/r/570925 (https://phabricator.wikimedia.org/T244357) (owner: 10Filippo Giunchedi) [17:01:55] 10Operations, 10ORES, 10Scoring-platform-team (Current): Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10Halfak) I just deployed a change to beta that dramatically reduced the memory usage of uwsgi processes. See https://github.com/wikimedia/ores/pull/337 This change address... [17:07:09] (03PS1) 10Elukey: presto: fix client settings for stat/notebook hosts [puppet] - 10https://gerrit.wikimedia.org/r/570927 [17:12:44] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Jclark-ctr) performed flea power drain. powered on host [17:13:20] (03CR) 10Elukey: [C: 03+2] presto: fix client settings for stat/notebook hosts [puppet] - 10https://gerrit.wikimedia.org/r/570927 (owner: 10Elukey) [17:13:24] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Marostegui) Thanks - IPMI is back. I will take it from here. Thank you! [17:13:36] 10Operations: Upgrade ping VMs to buster - https://phabricator.wikimedia.org/T244584 (10ayounsi) p:05Triage→03Low [17:15:36] 10Operations: Upgrade rpki VMs to buster - https://phabricator.wikimedia.org/T244585 (10ayounsi) p:05Triage→03Low [17:17:25] folks, anyone following along: unscheduled deploy happening now (rollback to .16 o groups 0, 1) [17:18:23] twentyafterfour I'll watch and be in here now [17:19:04] !log Start MySQL on es1019 after onsite maintenance T243963 [17:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:08] T243963: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 [17:19:09] (03PS1) 1020after4: rolling back all wikis to 1.35.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570928 [17:20:16] (03CR) 1020after4: [C: 03+2] rolling back all wikis to 1.35.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570928 (owner: 1020after4) [17:21:33] (03Merged) 10jenkins-bot: rolling back all wikis to 1.35.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570928 (owner: 1020after4) [17:22:38] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: roll back all wikis to 1.35.0-wmf.16 refs T233866 [17:22:39] ok it's sync'd .. saw a small increase in 60 second timeout exceptions [17:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:41] T233866: 1.35.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T233866 [17:23:08] but I don't think we broke production with this one? [17:23:13] twentyafterfour: I see something about google vision something else [17:23:24] TypeError from line 36 of /srv/mediawiki/php-1.35.0-wmf.16/extensions/MachineVision/src/Handler/GoogleCloudVisionHandler.php: Argument 4 passed to MediaWiki\Extension\MachineVision\Handler\GoogleCloudVisionHandler::__construct() must be an instance of MediaWiki\Extension\MachineVision\Handler\WikidataDepictsSetter, instance of MediaWiki\Extension\MachineVision\Handler\LabelResolver given, called in [17:23:24] /srv/mediawiki/php-1.35.0-wmf.16/vendor/wikimedia/object-factory/src/ObjectFactory.php on line 184 [17:23:37] quite a low volume but there and new [17:23:48] ack ^ [17:23:56] we'll have to revert a config change that was applied for wmf.18 [17:24:34] mdholloway: which one? :) [17:24:56] yeah we definitely increased error rate with a rollback to wmf.16 ... we really need to fix the way config is done. config should be included with the branch rather than separate, IMO [17:25:03] grrrrr [17:25:15] twentyafterfour: +1 to that [17:25:20] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:25:32] It looks like this exception is the only one [17:25:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1019 after on-site maintenance T243963', diff saved to https://phabricator.wikimedia.org/P10349 and previous config saved to /var/cache/conftool/dbconfig/20200207-172541-marostegui.json [17:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:45] T243963: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 [17:25:59] so the thing is we rolled back all wikis to .16 but before only group 2 was rolled back? [17:26:08] correct [17:26:20] should we have only touched group 2 also this time? [17:26:38] or was the point to fix not being on the same version [17:26:40] mutante: that wouldn't have solved the problem [17:26:44] *nod* [17:26:45] no; groups 0 and 1 were producing rendered pages with garbage data [17:26:49] ok [17:26:50] and that needed to be stopped [17:26:59] (03PS1) 10Mholloway: Revert "Remove handler deleted from the MachineVision extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570929 [17:27:10] addshore: ^ [17:27:13] twentyafterfour: ^^ [17:27:17] so other fallout will have to be dealt with until monday, that or we go with the roll forward option which seemed to be less favored by everyone [17:27:35] mdholloway: thanks [17:27:49] (03CR) 1020after4: [C: 03+2] Revert "Remove handler deleted from the MachineVision extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570929 (owner: 10Mholloway) [17:28:14] well.. the fatals alert usually goes off during any deploy.. the question is does it recover soon or not [17:28:18] yes, the thought was that rolling back to 16 would land us in a previous known good state [17:28:37] well let's see what this additional config update does for things [17:28:49] (03Merged) 10jenkins-bot: Revert "Remove handler deleted from the MachineVision extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570929 (owner: 10Mholloway) [17:29:01] I think we are ok, deploying this config change now [17:29:07] at least the last bar on https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&from=1581096344918&to=1581096527700 is smaller [17:29:10] also, +1 from me for having a way of tying config changes to branches [17:30:06] per wiki is tricky in cases like this where we need to move backwards. [17:30:35] fatals spike in grafana going down [17:30:52] hopefully more than that though [17:31:29] once the sync-file is complete, then we'll worry :) [17:31:34] heh [17:31:39] in this case our hope was that if we waited a while after wmf.18 settled on group1 before deploying the related code change to wmf.18 via backport, and the config change, we'd be safe from rollback [17:31:47] nope :) [17:31:53] :-D [17:31:59] you shall never be safe!! [17:32:03] (the extension is only active on commons, hence group1 being our concern) [17:32:17] yep, its only exceptions for commons [17:32:18] !log twentyafterfour@deploy1001 Synchronized wmf-config/InitialiseSettings.php: sync https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/570929 refs T233866 (duration: 01m 02s) [17:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:21] T233866: 1.35.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T233866 [17:32:50] mdholloway: thanks for paying attention today [17:33:06] :) [17:33:10] I still see them rolling in? could be the IS.php sync bug? [17:33:10] 💯 [17:33:53] yeah I still see the same exception after syncing the config [17:34:10] how can that be checked? (sync bug or not0 [17:34:11] touch wmf-config/InitialiseSettings.php; do the sync again. [17:34:12] ) [17:34:18] ah ha [17:34:21] doing that [17:34:35] hooray for bugs on bugs on bugs............ [17:34:37] can be checked by looking at the cache on the machine having still reporting the error [17:34:44] * thcipriani does [17:34:56] too many stacks on the item [17:36:13] !log twentyafterfour@deploy1001 Synchronized wmf-config/InitialiseSettings.php: sync InitializeSettings again for lols refs T233866 (duration: 01m 03s) [17:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:23] looks like they are gone [17:36:29] hey I think that worked [17:36:30] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Marostegui) John found this: https://www.dell.com/support/article/es/es/esbsdt1/sln316859/idrac7-idrac8-idrac-unresponsive-or-sluggish-performance?lang=en which is an update from May 2019, so mayb... [17:36:34] aaaah, silence [17:36:38] yep, looks like that stopped it [17:36:45] waiting for a minute to go by so I can reload grafana [17:37:28] thanks for jumping in mdholloway and thanks for everyone who have been standing by. very good teamwork all! [17:38:06] Thanks for the deploy! [17:38:08] who on the dev side can stick around to babysit just in case? I've already said I can be here for another 1 hour on the sre side [17:38:16] * addshore is here all evening [17:38:18] i'll be here [17:38:22] awesome [17:38:26] but then, i really need to go skiing :P [17:38:27] grafana looking MUCH better [17:38:38] expecting icinga recovery [17:38:39] \o/ [17:38:40] logstash now looking good, and I'll be around it's midday here [17:38:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1019 after on-site maintenance T243963', diff saved to https://phabricator.wikimedia.org/P10350 and previous config saved to /var/cache/conftool/dbconfig/20200207-173850-marostegui.json [17:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:53] T243963: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 [17:38:56] thanks everybody [17:39:10] here's hoping for a quiet rest of friday/weekend for all! [17:39:12] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:39:18] indeed, nicely handled a bad situation [17:39:25] apergos: +1000 [17:39:38] there it is. reschedulign one more icinga check [17:39:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1016 crash - https://phabricator.wikimedia.org/T241882 (10JHedden) This host has been taken out of service, we can perform maintenance on it anytime. [17:41:00] is there somewhere we are keeping suggestions/questions/action items from the last couple days? [17:41:12] like 'config + branch deploy and roll bck together' [17:42:00] or " I wonder if we should default to rolling back _all_ wikis when a group2 rollout kills prod. I only rolled back group2 since group0/1 didn't have any noticeable issues. Should I have defaulted to roll back all?" from twentyafterfour [17:42:10] we can move this out of operations and back to -sre if folks like [17:42:25] the incident report? [17:42:27] apergos: probably in the incident followupsx [17:42:36] great minds ^ [17:43:12] (03CR) 10CDanis: [C: 03+1] Add option to clamp TCP-MSS [homer/public] - 10https://gerrit.wikimedia.org/r/569636 (owner: 10Ayounsi) [17:43:15] fatals alert came back shortly for another spike.. but is over now [17:43:24] i was going to tag T233866 with the patch putting back the MachineVision config change; i'll note it in the incident report followups as well [17:43:24] T233866: 1.35.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T233866 [17:43:28] yeah .... I think that question is task worthy. I've usually defaulted to rolling back just group2 unless rollback of all was more expedient but it's unfortunate that either choice has uncertain consequences depending on the week [17:43:44] https://wikitech.wikimedia.org/wiki/Incident_documentation/20200206-mediawiki leaving this here for folks to add to [17:43:52] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:44:04] uh oh [17:44:09] do not like [17:44:38] (03CR) 10CDanis: [C: 03+1] "lg w/ optional nit" (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/569639 (owner: 10Ayounsi) [17:44:52] see grafana.. there are 3 spikes [17:44:55] 2 are the deploys [17:45:03] and then another smaller one, but small enough to cause alerts [17:45:30] the icinga alert is just always delayed a bit and i made it "check right now" [17:45:51] in grafana, past 2 hours shows things more clearly [17:46:07] it looks like the latency increased _before_ the rollback to wmf.16 [17:46:27] unless the deploy markers are inaccurate [17:46:36] the deploy markers are at the _end_ of the scap invocation [17:46:42] so it's hard to be sure [17:46:43] I was going to say that [17:46:44] right [17:47:05] I've been meaning to change the way scap logs things so that we get a marker for the start of sync-wikiversions as well as the end of it [17:47:10] \o/ [17:47:17] +1 [17:47:18] twentyafterfour: add it to the list! :) [17:47:19] that would be wonderful [17:47:37] I think other scap commands log at the beginning and end, it's just wikiversions that doesn't [17:47:39] plus there is also some incertainty/lag inherent to metric collecting [17:47:46] right [17:47:51] we could go see when jenkins merged the patch and guess 1 minute after (for these ones) if we want to be precie [17:47:53] +s [17:48:03] so is this increase bad enough to worry about or should we leave it be [17:49:09] I think let it sit for the moment [17:49:59] the recovery should be coming in shortly [17:50:09] the last 3 hour view in grafana looks like it's back to normal [17:50:27] i tried rescheduling the check.. it takes a minute [17:51:07] Hmm. Did any of the activities over the last 12-24 hours end up running scap-sync-l10n, or that otherwise would regenerate the CDB files on the Group 1 wikis _after_ the normal deploy on Wednesday? [17:51:15] it's mostly back to normal but there is a slight increase I think [17:51:16] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:51:26] there's the recovery... whew [17:51:41] xover: probably not, I will do that now [17:51:55] anyone object to syncing l10n now? [17:51:59] icinga is always about 10(?) minutes behind grafana [17:52:08] (there's a problem that's mysteriously gone now, that we expected to have to live with until next week) [17:52:19] no objection from me [17:52:39] xover: should I _not_ sync l10n? [17:52:49] https://wikitech-static.wikimedia.org/wiki/Server_Admin_Log i don't see anything here [17:53:01] xover: care to elaborate about the l10n problem? [17:53:01] I know nothing and object to nothing! :-) [17:53:03] we fixed another unexpected thing ? nice [17:53:21] "problem ..gone" is a good thing [17:53:26] fixing things is good, unexpectedly fixing things is disturbing [17:53:48] twentyafterfour: T240858 [17:53:49] T240858: Clean up implementation for "follow" cases - https://phabricator.wikimedia.org/T240858 [17:54:26] https://phabricator.wikimedia.org/T240858 ah ha you got there first [17:54:27] hrmm.. there is another spike https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&from=now-3h&to=now [17:54:36] to about 50 [17:54:43] yeah I saw this discussion earlier, another thing to add to the ever-growing list [17:54:52] (discussion in releng scrollback) [17:54:55] but that also happened about 2h50m ago without a deploy [17:55:56] I was just checking and the previously missing interface message now appeared to be there. [17:56:30] ErrorException from line 228 of /srv/mediawiki/php-1.35.0-wmf.16/vendor/psy/psysh/src/ConfigPaths.php: PHP Notice: Writing to /home/urbanecm/.config/psysh is not allowed. [17:56:33] wth? [17:56:53] wut [17:56:55] Hence the question of whether any of the other activity might have re-ran the i10n sync, because otherwise I don't understand how this got un-broken. [17:57:04] that's pretty strange thing to see in logstash [17:57:08] ok really?? [17:57:11] twentyafterfour: I guess that's because I have shell.php opened? [17:57:19] oh it's a maintence script doh [17:57:21] lol [17:57:37] It said enwiki and I didn't see mwmaint1002 [17:57:39] even maintenance scripts should not write to home dirs [17:57:39] was getting ready to seriously side-eye you :-D [17:57:44] they should write to /var/log/mediawiki/ [17:58:14] if those were logs.. otherwise ignore me [17:58:36] twentyafterfour: xover I don't know how we would have the .16 l10n files now... it's a good question [18:02:08] icinga cleaned up again - except the usual "netbox reports" that seem to be constantly alerting [18:05:15] where do the i10n files live again? I used to know this [18:06:48] in /srv/mediawiki/cache [18:06:54] er /branch/cache [18:07:41] branc [18:07:50] I went to /srv/mediawiki and failed [18:08:47] /srv/mediawiki/php-1.35.0-wmf.16/cache/l10n/ [18:08:54] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 38 probes of 521 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:10:03] ok but so, ... hmm... confused. why, when we roll back, would there be any issue with the i10n files on the rolled back wikis [18:10:19] I mean wouldn't we expect everything to be fine? [18:10:30] Group 1 wasn't rolled back was it? [18:10:34] apergos: I would expect it to be fine, yes [18:10:38] xover: all groups rolled back [18:10:44] everything is on .16 now [18:10:50] Oh, then that explains it. [18:10:54] :D [18:10:55] so xover that's the answer to your question [18:11:00] sorry if that wasn't clear [18:11:07] thank you ocd, sometimes you are useful [18:11:17] Excellent, thanks! [18:12:10] * twentyafterfour wishes it was easier to communicate the specific status of all deployments. maybe every !log message should include a link to https://tools.wmflabs.org/versions/ [18:12:20] heh [18:12:34] TIL that link, thanks! [18:13:01] it's my goto link too for all this [18:14:00] is the train still blocked twentyafterfour ? [18:14:17] (magic 8-ball says: yes) [18:14:20] hauskatze: essentially yes until monday at least [18:14:28] apergos: :D [18:14:33] ;-) [18:14:38] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 34 probes of 521 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:14:52] twentyafterfour: ack, I'll keep an eye on the status then. I may need to cherry-pick a patch to wmf.16 is still blocked by then [18:15:07] *if it is still blocked by then [18:17:21] hauskatze: sure, there should be no problem cherry picking on monday just let me know [18:17:38] PROBLEM - Host cloudvirt1016.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:18:50] RECOVERY - Host cloudvirt1016.mgmt is UP: PING WARNING - Packet loss = 66%, RTA = 1.18 ms [18:19:15] i gotta go for a while. be back later [18:19:49] logs for cloudvirt1016 are in -cloud it looks [18:20:11] known and they will downtime it [18:20:17] dcops doing work or so [18:20:55] I'm around for about another 15 minutes and then finally going into lurk mode [18:21:37] anyone doing work related to grafana2001? [18:23:14] just saw 800503d6c43fc9ec611532af705d49ae7d12513d, all good [18:23:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1016 crash - https://phabricator.wikimedia.org/T241882 (10Jclark-ctr) replaced failed dimm A8 [18:26:50] fixed: https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-job=grafana2001.codfw.wmnet-Monthly-1st-Fri-production-var-lib-grafana&from=1581096384344&to=1581099984344 [18:30:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1022 memory errors causing host to crash - https://phabricator.wikimedia.org/T243536 (10JHedden) This server has running workloads that need to be drained prior to maintenance. I'll schedule a maintenance window and get it ready. [18:30:39] (03PS1) 10Mholloway: Revert "Revert "Remove handler deleted from the MachineVision extension"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570935 [18:31:09] (03CR) 10Mholloway: [C: 04-2] "HOLD - Deploy immediately after deploying https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/MachineVision/+/570932/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570935 (owner: 10Mholloway) [18:35:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1016 crash - https://phabricator.wikimedia.org/T241882 (10JHedden) 05Open→03Resolved Thanks, @Jclark-ctr for replacing the DIMM. I verified the host looks good now. [18:35:55] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531 (10JHedden) [18:36:23] (03CR) 10Faidon Liambotis: [C: 04-1] "Are you sure this works as intended? I wouldn't expect an MX to change TCP flags on forwarded traffic, and would only expect that to apply" (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/569636 (owner: 10Ayounsi) [18:42:19] and my hour of attention span has come to an end. I may be responsive to pings but with delay. see ya! [18:52:03] 10Operations, 10ops-esams: Terminate OE10,11,12,13 Racks - https://phabricator.wikimedia.org/T237055 (10wiki_willy) Followed up with Iron Mountain, who responded back that their Legal team wouldn't allow moving OE14 to the same contract as OE15,16, regardless of how much our account rep stressed that it would... [19:28:02] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Maps (Tilerator): Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939 (10Mholloway) It looks like, in the course of dealing with {T239728}, replication was increased to twice daily on 2019/12/19 [ [[ https://gerrit.wikimedia... [19:44:06] (03CR) 10Jforrester: [C: 03+1] "Good to go whenever." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570787 (https://phabricator.wikimedia.org/T244561) (owner: 10Reedy) [19:45:26] (03CR) 10Brion VIBBER: "The docker image is working well on the other patch, will finish this one up in a bit this weekend." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/569341 (https://phabricator.wikimedia.org/T166024) (owner: 10Brion VIBBER) [19:45:58] (03PS6) 10Krinkle: mediawiki: Add reqId/file/line to php7-fatal-error.php's 'message' field [puppet] - 10https://gerrit.wikimedia.org/r/554599 [19:46:06] (03CR) 10CDanis: [C: 03+1] "> Patch Set 3: Code-Review-1" (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/569636 (owner: 10Ayounsi) [19:51:10] (03PS1) 10Jhedden: ceph: configure jumbo frames on OSD interfaces [puppet] - 10https://gerrit.wikimedia.org/r/570947 (https://phabricator.wikimedia.org/T225320) [20:02:03] (03CR) 10Bstorm: "> Patch Set 2: Code-Review+1" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570785 (owner: 10BryanDavis) [20:07:03] (03CR) 10Bstorm: [C: 03+1] "excited to see how this works out :)" [puppet] - 10https://gerrit.wikimedia.org/r/570947 (https://phabricator.wikimedia.org/T225320) (owner: 10Jhedden) [20:11:12] (03CR) 10BryanDavis: "> We use macs though, no gbp dch command." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570785 (owner: 10BryanDavis) [20:12:20] (03CR) 10Bstorm: "> Patch Set 2:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570785 (owner: 10BryanDavis) [20:31:10] (03PS2) 10Jhedden: ceph: configure jumbo frames on OSD interfaces [puppet] - 10https://gerrit.wikimedia.org/r/570947 (https://phabricator.wikimedia.org/T225320) [20:32:42] !log ganeti: attempting to reinstall install1003 which failed last time [20:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:42] (03CR) 10Jhedden: [C: 03+2] "PCC results: https://puppet-compiler.wmflabs.org/compiler1001/20699/" [puppet] - 10https://gerrit.wikimedia.org/r/570947 (https://phabricator.wikimedia.org/T225320) (owner: 10Jhedden) [20:40:24] (03PS3) 10Bstorm: d/changelog: Prepare for 0.59 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570785 (owner: 10BryanDavis) [20:42:05] !log ceph: OSD failover and recovery testing on cloudcephosd1003.wikimedia.org T240718 [20:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:08] T240718: Perform failover tests on Ceph storage cluster - https://phabricator.wikimedia.org/T240718 [20:43:43] 10Operations, 10vm-requests: VM requests for install_server replacements - https://phabricator.wikimedia.org/T244390 (10Dzahn) 05Open→03Resolved VMs have been created. OS install now worked at second attempt. [20:43:46] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10Dzahn) [20:44:47] (03CR) 10Bstorm: "Updated with that change suggested by Arturo" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570785 (owner: 10BryanDavis) [20:46:31] (03CR) 10Bstorm: "> Patch Set 3:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570785 (owner: 10BryanDavis) [20:46:54] (03CR) 10BryanDavis: [C: 03+1] d/changelog: Prepare for 0.59 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570785 (owner: 10BryanDavis) [20:47:19] (03CR) 10Bstorm: [C: 03+2] d/changelog: Prepare for 0.59 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570785 (owner: 10BryanDavis) [20:47:22] !log OS install on new install_server VMs worked on second attempt, issues are gone. signed puppet certs for install1003.eqiad.wmnet, install2003.codfw.wmnet, initial puppet runs (T224576) [20:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:25] T224576: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 [20:50:09] (03PS2) 10Bstorm: Run Python tests using pytest, not nose [puppet] - 10https://gerrit.wikimedia.org/r/568856 (owner: 10Legoktm) [20:51:49] (03CR) 10Bstorm: [C: 03+1] "I love this. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/568856 (owner: 10Legoktm) [21:05:03] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:07:26] (03PS10) 10Joal: Add profile::analytics::refinery::job::import_wikidata_entites_dumps [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) [21:09:53] (03CR) 10jerkins-bot: [V: 04-1] Add profile::analytics::refinery::job::import_wikidata_entites_dumps [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) (owner: 10Joal) [21:10:45] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 34 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:13:19] (03PS11) 10Joal: Add profile::analytics::refinery::job::import_wikidata_entites_dumps [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) [21:14:03] (03CR) 10Faidon Liambotis: [C: 04-1] "> > Patch Set 3: Code-Review-1" (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/569636 (owner: 10Ayounsi) [21:16:36] (03CR) 10jerkins-bot: [V: 04-1] Add profile::analytics::refinery::job::import_wikidata_entites_dumps [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) (owner: 10Joal) [21:20:54] (03PS12) 10Joal: Add profile::analytics::refinery::job::import_wikidata_entites_dumps [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) [21:21:03] (03CR) 10CDanis: [C: 03+1] "> Patch Set 3:" [homer/public] - 10https://gerrit.wikimedia.org/r/569636 (owner: 10Ayounsi) [21:55:09] 10Operations, 10netops: some outbound is TCP failing from fundraising cluster as of approx 2020-02-07 16:15UTC - https://phabricator.wikimedia.org/T244610 (10Jgreen) [21:55:27] 10Operations, 10netops: some outbound is TCP failing from fundraising cluster as of approx 2020-02-07 16:15UTC - https://phabricator.wikimedia.org/T244610 (10Jgreen) p:05Triage→03Unbreak! [21:58:29] 10Operations, 10Beta-Cluster-Infrastructure, 10RESTBase, 10Traffic: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Pchelolo) RESTBase itself seem to be working correctly. Something is wrong with routing before RESTBase. If you look at https://en.wikipedia.beta.wmfl... [22:06:41] (03PS1) 10Jhedden: ceph: configure osd cluster network on boot [puppet] - 10https://gerrit.wikimedia.org/r/570954 (https://phabricator.wikimedia.org/T225320) [22:09:10] (03CR) 10Jhedden: [C: 03+2] "PCC results https://puppet-compiler.wmflabs.org/compiler1001/20700/" [puppet] - 10https://gerrit.wikimedia.org/r/570954 (https://phabricator.wikimedia.org/T225320) (owner: 10Jhedden) [22:10:04] (03PS2) 10Jhedden: ceph: configure osd cluster network on boot [puppet] - 10https://gerrit.wikimedia.org/r/570954 (https://phabricator.wikimedia.org/T225320) [22:12:49] (03CR) 10Jhedden: "PCC results https://puppet-compiler.wmflabs.org/compiler1003/20701/" [puppet] - 10https://gerrit.wikimedia.org/r/570954 (https://phabricator.wikimedia.org/T225320) (owner: 10Jhedden) [22:12:59] (03CR) 10Jhedden: [C: 03+2] ceph: configure osd cluster network on boot [puppet] - 10https://gerrit.wikimedia.org/r/570954 (https://phabricator.wikimedia.org/T225320) (owner: 10Jhedden) [22:15:29] (03PS1) 10BryanDavis: Fix local 'type' variable shadowing global type() function [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570957 (https://phabricator.wikimedia.org/T244611) [22:15:56] (03CR) 10Faidon Liambotis: [C: 04-1] "> I do think it's worth noting that this was applied at the router" [homer/public] - 10https://gerrit.wikimedia.org/r/569636 (owner: 10Ayounsi) [22:16:47] James_F: is it possible to clear stuck Echo notifications? [22:18:44] (03CR) 10BryanDavis: [V: 03+2 C: 03+2] "Tested via hot patch on toolsbeta-sgebastion-04" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570957 (https://phabricator.wikimedia.org/T244611) (owner: 10BryanDavis) [22:19:38] (03Merged) 10jenkins-bot: Fix local 'type' variable shadowing global type() function [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570957 (https://phabricator.wikimedia.org/T244611) (owner: 10BryanDavis) [22:20:57] !log ceph: round 2 OSD failover and recovery testing on cloudcephosd1003.wikimedia.org T240718 [22:21:00] hauskatze: You can delete echo notifs via the db but not sure how sanely or safe it is [22:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:03] T240718: Perform failover tests on Ceph storage cluster - https://phabricator.wikimedia.org/T240718 [22:32:33] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 33104928 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:33:37] (03PS1) 10Bstorm: prepare to deploy 0.60 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570959 [22:34:29] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 14488 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:36:24] 10Operations, 10netops: some outbound is TCP failing from fundraising cluster as of approx 2020-02-07 16:15UTC - https://phabricator.wikimedia.org/T244610 (10Dwisehaupt) On our end, I fixed the typo that would have alerted us to this much earlier in the day. [frack::puppet] 6247eb80 Fix typo in critical alter... [22:45:09] (03CR) 10Bstorm: [C: 03+2] prepare to deploy 0.60 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570959 (owner: 10Bstorm) [23:02:41] hauskatze: Yes, RoanKattouw has done it from time to time [23:04:30] James_F: Alright thanks. I've filed a Task and referenced it to the old (same) issue [23:04:36] Kk. [23:04:42] Good week end [23:06:02] if you have "tftp, dhcp, proxy and APT repo" and call that combination "installserver" and then you take the APT repo part out of it, what do you call the new role? "installserver_light" ? [23:06:39] tftp_dhcp_proxy ? [23:07:58] not-an-installserver? [23:08:05] mutante: #somethingaintgonnawork ? :P [23:08:24] well the actual installserver is the tftp part since it sends the installer [23:08:29] tftp/dhcp [23:08:37] and the APT repo is just a webserver [23:08:45] and supposed to move to a separate VM [23:09:09] so i need a new name for all the installserver things minus APT [23:09:13] installserverless [23:09:18] and seriously wrote installserver_light [23:09:21] but seemed silly [23:09:56] oh well, it's just naming bikeshedding :) [23:11:39] 10Operations, 10Puppet, 10Patch-For-Review: PuppetDB misbehaving on 2017-07-15 - https://phabricator.wikimedia.org/T170740 (10Volans) 05Open→03Resolved a:03Volans Since the last update a lot of things have changed in our PuppetDB installation, including PuppetDB and OS versions. As we didn't had re-occ... [23:16:29] 10Operations, 10Beta-Cluster-Infrastructure, 10RESTBase, 10Traffic: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Jdforrester-WMF) >>! In T244586#5860963, @Mholloway wrote: > To help isolate possible culprits, when was this last working? Yesterday, but not sure w... [23:20:48] 10Operations, 10Beta-Cluster-Infrastructure, 10RESTBase, 10Traffic: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Krenair) Did fix puppet on cache-text05 yesterday, it did a lot of stuff to replace some nginx/varnish stuff with ATS. May be related? [23:21:24] 10Operations, 10Beta-Cluster-Infrastructure, 10RESTBase, 10Traffic: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Jdforrester-WMF) >>! In T244586#5861250, @Krenair wrote: > Did fix puppet on cache-text05 yesterday, it did a lot of stuff to replace > some nginx/var... [23:21:31] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10Dzahn) [23:22:12] 10Operations, 10Beta-Cluster-Infrastructure, 10RESTBase, 10Traffic: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Mholloway) It looks like routing for /api/rest_v1/ in Beta is set up in the prefix puppet settings for deployment-cache-text (as seen [[ https://gerri... [23:23:04] 10Operations, 10Beta-Cluster-Infrastructure, 10RESTBase, 10Traffic: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10MarcoAurelio) Not sure T243226 might be related. [23:24:19] 10Operations, 10Beta-Cluster-Infrastructure, 10RESTBase, 10Traffic: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Jdforrester-WMF) deployment-restbase01.deployment-prep.eqiad.wmflabs reports `The last Puppet run was at Mon Jan 20 10:54:08 UTC 2020 (26669 minutes a... [23:25:03] 10Operations, 10Beta-Cluster-Infrastructure: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster) - https://phabricator.wikimedia.org/T243226 (10dpifke) puppetdb on deployment-puppetdb03 was killed by kernel OOM at Feb 7 09:50:29, per syslog. I just now ran `systemctl start puppetdb` on th... [23:25:15] 10Operations, 10Beta-Cluster-Infrastructure, 10RESTBase, 10Traffic: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Pchelolo) >>! In T244586#5861273, @Jdforrester-WMF wrote: > deployment-restbase01.deployment-prep.eqiad.wmflabs reports `The last Puppet run was at Mo... [23:26:01] 10Operations, 10Beta-Cluster-Infrastructure, 10RESTBase, 10Traffic: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Jdforrester-WMF) Ah, right, that's the please-upgrade-puppet task that @MarcoAurelio linked above. [23:27:42] (03PS1) 10BryanDavis: Add missing newline following wait_for() messages [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570963 [23:27:44] (03PS1) 10BryanDavis: Fix undefined backend_clazz reference [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570964 [23:27:46] (03PS1) 10BryanDavis: Minor whitespace and doc string updates [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570965 [23:31:37] (03CR) 10Bstorm: [C: 03+2] Add missing newline following wait_for() messages [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570963 (owner: 10BryanDavis) [23:32:13] (03CR) 10Bstorm: [C: 03+2] "That should do it" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570964 (owner: 10BryanDavis) [23:32:43] (03CR) 10Bstorm: [C: 03+2] "And the easy one" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570965 (owner: 10BryanDavis) [23:36:50] 10Operations, 10Beta-Cluster-Infrastructure, 10observability: Beta puppet patch "prometheus: make ferm DNS record type configurable" - https://phabricator.wikimedia.org/T244624 (10Krinkle) [23:36:53] (03CR) 10Krinkle: "I've filed T244624" [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T153468) (owner: 10Hashar) [23:36:58] (03PS1) 10Bstorm: prepare for release 0.61 to fix a few issues [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570966 [23:37:37] (03CR) 10Dave Pifke: [C: 03+1] "This change was successfully deployed & tested in the beta cluster. I used FatalError.php to generate an error after applying, and verifi" [puppet] - 10https://gerrit.wikimedia.org/r/554599 (owner: 10Krinkle) [23:38:20] (03PS2) 10Bstorm: prepare for release 0.61 to fix a few issues [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570966 [23:38:36] (03PS1) 10Dzahn: site: add new install servers with private IP, spare role [puppet] - 10https://gerrit.wikimedia.org/r/570967 (https://phabricator.wikimedia.org/T224576) [23:39:57] (03CR) 10BryanDavis: [C: 03+1] prepare for release 0.61 to fix a few issues [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570966 (owner: 10Bstorm) [23:40:47] (03CR) 10Dzahn: [C: 03+2] site: add new install servers with private IP, spare role [puppet] - 10https://gerrit.wikimedia.org/r/570967 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [23:40:58] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/20704/install1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/570967 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [23:42:17] (03CR) 10Bstorm: [C: 03+2] prepare for release 0.61 to fix a few issues [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570966 (owner: 10Bstorm) [23:43:43] 10Operations, 10Traffic, 10Patch-For-Review: ats-tls performance issues under production load - https://phabricator.wikimedia.org/T244538 (10Krinkle) [23:43:46] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10Krinkle) [23:55:17] (03PS1) 10Dzahn: installserver: introduce new role without APT, rename existing role [puppet] - 10https://gerrit.wikimedia.org/r/570969 (https://phabricator.wikimedia.org/T224576) [23:57:54] (03PS2) 10Dzahn: installserver: create new role without HTTP/APT, rename existing role [puppet] - 10https://gerrit.wikimedia.org/r/570969 (https://phabricator.wikimedia.org/T224576)