[03:00:46] 10serviceops, 10Operations, 10Release-Engineering-Team, 10Performance-Team (Radar), and 2 others: All debug hosts give (likely spurious) message: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp) - https://phabricator.wikimedia.org/T214734 (10Krinkle) [03:01:54] 10serviceops, 10Analytics, 10Event-Platform, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10Krinkle) [03:03:42] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10Krinkle) [03:04:04] 10serviceops, 10Operations, 10Performance-Team: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10Krinkle) [03:04:26] 10serviceops, 10Core Platform Team, 10Performance-Team, 10Release-Engineering-Team-TODO, and 4 others: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Krinkle) [06:27:54] 10serviceops, 10Analytics, 10Event-Platform, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10Joe) The number of errors has gone up during the weekend, making this even more absurd. This is the debug log for a session that fails:... [07:21:45] 10serviceops, 10Analytics, 10Event-Platform, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10Joe) This seems to be a recurring issue with envoy and some upstream applications, see for instance https://github.com/envoyproxy/envoy/... [08:31:27] we have the same plroblem as last week, that the subteam meet is at the same time as the full team meeting; what do we want to do? _joe_ akosiaris effie we can't ask the others because they should be sleeping now [08:33:56] <_joe_> let's cancel and update the doc independently? [08:37:48] that seems fine to me. If we decline the invite then others will see when they come on line [08:44:15] +1 [08:47:51] I 've blanked out the status doc [08:50:05] thanks [10:29:46] 10serviceops, 10Analytics, 10Event-Platform, 10Patch-For-Review, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10akosiaris) >>! In T247484#5971229, @Joe wrote: > This seems to be a recurring issue with envoy and some upstream a... [12:11:06] <_joe_> akosiaris: puppet ran everywhere and errors have disappeared now [12:11:23] <_joe_> I am ok with this approach to pick up the pace of the transition [12:36:19] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Volans) ganeti1009 is set as Staged in Netbox and missing in PuppetDB, so it's report by the Netbox report. What should be the correct state for now? [13:34:51] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10akosiaris) >>! In T228924#5971978, @Volans wrote: > ganeti1009 is set as Staged in Netbox and missing in PuppetDB, so it's report by the Netbox report. > What sh... [13:35:57] _joe_: ok, let's wait it out until tomorrow though? We 've had the impression errors stopped at least 3 times up to now [13:36:09] I honestly did not expect so many errors during the weekend [13:42:43] <_joe_> akosiaris: ah sure, I have other things to fix btw [14:27:43] 10serviceops, 10Operations, 10Thumbor, 10Wikimedia-Logstash, and 2 others: Stream Thumbor logs to logstash - https://phabricator.wikimedia.org/T212946 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is complete (i.e. {T242609}), resolving. Feel free to reopen though! [14:44:16] serviceopsen: our status meeting is at the same time as the SRE weekly this week and next, due to Daylight Confusion Time -- do we want to cancel or reschedule? [14:45:30] it looks like calendars are clear if we bump it an hour earlier to match (1530 ETC) [14:45:37] except possibly mutante's [14:45:49] er, *UTC [14:47:12] 10serviceops, 10Core Platform Team, 10WMF-JobQueue: RunSingleJob.php timeout too low at 180 seconds - https://phabricator.wikimedia.org/T247622 (10Joe) a:03Joe [14:47:35] aw man ignore me, I thought I'd read scrollback but I missed that whole conversation :( [14:47:41] canceling sgtm [15:06:20] we already declined en masse hoping that would answer the question :-) [15:13:18] 10serviceops, 10Core Platform Team, 10WMF-JobQueue: RunSingleJob.php timeout too low at 180 seconds - https://phabricator.wikimedia.org/T247622 (10Joe) 05Open→03Resolved The max execution time has been elevated to 1200 seconds on the jobrunner. The errors should disappear from now on. [15:16:16] 10serviceops, 10Analytics, 10Event-Platform, 10Patch-For-Review, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10Joe) a:03Joe [15:16:58] <_joe_> rlazarus, akosiaris, apergos do you have anything to add to the SRE doc? [15:17:43] Not this time, thanks for checking [15:21:03] _joe_: /me looking [15:21:41] <_joe_> akosiaris: I added eventstreams for now [15:22:15] <_joe_> btw restart_on: reset possibly didn't do the trick, it's either too restrictive or not enough [15:22:25] <_joe_> I hate I have to look at the sources [15:23:07] I don't think so -- was going to give a switchover update but I might wait until there's a little more certainty on whether we're still on schedule after virus changes [15:23:31] depends in part on whether we'd still want to do a switchover if we couldn't do eqiad maintenance at the same time, for example [15:24:09] I guess I'll post the dates anyway and just say it's uncertain [15:26:37] <_joe_> yeah I wouldn't give anything for certain in this situation if it's not scheduled at least in may [15:27:25] yeah, "not for certain" is a given :) I'm mostly just deciding whether to share the uncertain stuff or wait until we know more [18:36:53] 10serviceops, 10Operations: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn) [18:37:25] 10serviceops, 10Operations: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn) [18:37:40] 10serviceops, 10Operations: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn) [18:38:02] 10serviceops, 10Operations: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn) [18:38:51] 10serviceops, 10Operations, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn) [18:38:59] 10serviceops, 10Operations, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn) [19:06:37] 10serviceops, 10Performance-Team, 10Wikimedia-Site-requests, 10Technical-Debt: Remove "Cache-control: no-cache" hack from wmf-config - https://phabricator.wikimedia.org/T247783 (10Krinkle) [19:40:35] 10serviceops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10wiki_willy) a:05Christopher→03Cmjohnson [20:37:22] 10serviceops, 10Operations, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1221.eqiad.wmnet` - mw1221.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - Found... [20:46:03] 10serviceops, 10Operations, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[1222-1226].eqiad.wmnet` - mw1222.eqiad.wmnet (**PASS**) - Downtimed host on Icinga... [22:06:33] Pchelolo: https://github.com/wikimedia/restbase/pull/1245 is waiting for your review